Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving handling of large files in k6 #2974

Open
2 of 5 tasks
oleiade opened this issue Mar 13, 2023 · 11 comments
Open
2 of 5 tasks

Improving handling of large files in k6 #2974

oleiade opened this issue Mar 13, 2023 · 11 comments
Assignees
Labels
enhancement evaluation needed proposal needs to be validated or tested before fully implementing it in k6

Comments

@oleiade
Copy link
Member

oleiade commented Mar 13, 2023

Story

Problem Statement

Handling large files in k6, whether binary or structured format files such as CSV, leads to high memory usage. As a result, our user’s experience, especially in the cloud, is degraded as soon as they need to handle large data sets; of user ids, for instance.

This issue is experienced by our users in various situations and is caused by many design and implementation decisions in the current state of the k6 open-source tool.

Objectives

Product-oriented

The product-oriented objective of this story, and definition of success, will consist in landing a new streaming CSV parser in k6, allowing users to parse and use big CSV files (> 500MB) that wouldn’t fit in memory and likely lead their k6 scripts to crash. We are keen to provide any technical improvements to k6 that make it possible along the way.

Technology Oriented

From a technological standpoint, the objective of this story is to provide all the design and technical changes necessary to complete the product story. Our primary objective should be to support the ability of users to use large data set files in their k6 scripts without leading to out-of-memory errors. As we pursue this goal, we aim to pick the solutions with the most negligible overhead and are keen to take on technical debt if necessary.

Resolution

Through internal workshops with @sniku and @mstoykov, we surfaced various topics and issues that must be addressed to fulfill the objective.

Must-have

The bare minimum must-have items to be able even to start tackling the top-level product objective would be:

  1. Support finer-grained and richer access to tar archives content #2975: We end up caching files in memory because we cannot read them directly from a tar archive without decompressing them first.
  2. Reduce the caching of files inside k6. As a result, 👆 k6 caching files users use in memory and duplicates them per-vu. The behavior is all over the place. If a more convenient tar library allowing for direct access to files from archives were to be available, we might want to revisit this behavior.

Nice to have

As we're at it, some other set of features and refactors would be beneficial to the larger story of handling large files in k6

  1. Design a File API for k6 #2977. Currently, the open() method of k6 is somewhat misnamed and performs a readFile() operation. This is also a result of k6 archiving users' content in a single tar archive and having to access resources through it. With more efficient and flexible access to k6's tar archive's content, we believe that k6 would also benefit from having a more "standard" file API to open, read and seek through files more conveniently. This would help support streaming use cases by providing more flexible navigation in a file's content. While also helping benefit from OS-based optimizations like the buffer cache.
  2. Add Streams API support to k6 #2978 Another key aspect of handling files more efficiently in k6 is how we access them. As illustrated 👆 we currently only have a way to load the whole content of a file in memory. To support the specific product goal, and other endeavors such as #2273 , or our work towards a new HTTP API, we believe adding even a partial (read operations only) support for the Streams API to k6 would be beneficial. It would establish a healthy baseline API for streaming IO in k6.

Problem Space

This issue approaches the problem at hand with a pragmatic product-oriented objective. However, this specific set of issues has already been approached from various angles in the past and is connected to longer-term plans as demonstrated in this list:

@oleiade oleiade self-assigned this Mar 13, 2023
@oleiade oleiade changed the title Improving k6 large files handling Improving handling of large files in k6 Mar 13, 2023
@oleiade oleiade added the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label Mar 13, 2023
@na--
Copy link
Member

na-- commented Apr 4, 2023

While thinking about #2975 (comment), I realized that I probably disagree with something here. I think the first points of the "Must-have" and "Nice to have" sections are somewhat flipped and need to be exchanged 😅 That is, #2975 is probably nice to have, while #2977 seems like a must.

What will happen if we only implement #2975? We'll have a very efficient archive bundle format (.tar or otherwise) that doesn't load everything into memory when k6 executes it. That would be awesome! It would mean users would be able to potentially cram huge static files in these archives. However, they would have no way to actually use these files in their scripts besides using open(). If it's data (and not, say, HTTP request bodies), maybe they can also use a SharedArray, which will still make at least two (one temporary and one permanent, with extra JSON overhead) copies of the whole contents in memory... 😅

Whereas, if we don't touch .tar archives and still load their contents fully in memory, but we stop copying all of the file contents everywhere, and we add a way to open files without fully reading them into memory, users will still be able to work with them somewhat efficiently. Loading 500 MB in memory is not great, but as long as it happens only once, it's fairly tolerable and fixing it becomes "nice to have", not a must.

@oleiade
Copy link
Member Author

oleiade commented Apr 4, 2023

TL; DR The tar improvements are less obviously immediately valuable. I agree with that, and agree that we should prioritize #2977 over it 👍🏻

I think parts of the idea with the tar archive lib on "steroids" was to do it hand in hand with #2977, with the assumption that one would be able to obtain a file handle of anything inside of the tar archive, without having to dump it twice in memory (once because the tar archive is in memory, and the copy of the content of the file).

Having done some research on this in the past, already having a more transparent API around files would help tremendously, even gaining granularity around how to handle data in scripts. As of today, users don't have much choice. It's all or nothing: load all the data in memory or nothing. Whereas with a File API, one could also open the files once and read them whenever they need the content, just in time (modern OSes all have some flavor of a buffer cache, which caches the content of read syscalls, so that when you read file N times in a row, all the reads past the first one are made from this cache, and are much MUCH quicker). This also has the potential to improve memory usage.

@fly3366
Copy link

fly3366 commented May 19, 2023

lgtm. randomString with too large size may use more time for bootstrap. But load stress for db testing or kvstore testing may often need generate large value.

@jade-lucas
Copy link

I have a need to load test an api that takes in pdfs. I was hoping that we could utilize SharedArray to share blobs (ArrayBuffer) across VUs. We are trying to achieve load testing with a wide range of pdf sizes up to 50MB. We need to simulate 2000 VU. Simple math shows that the feasibility is not in our favor if every VU needs to hold a 50MB blob in memory. Having away to share blobs across VU or having streaming support would make this more feasible. I personally think a streaming option in k6/http functions would be the most flexible and scalable. It would be a similar pattern seen in many code bases. We just started using k6 a few months ago to load test our mission critical services where I'm employed and so far its been a really good experience. I think k6 would benefit greatly and allow them to really expand their capabilities if they can crack handling of large sets of unstructured data, like pdfs, jpegs, etc.

Our POC to test SharedArray with ArrayBuffer. Currently doesn't work. Note: we are using typescript and transpiling with webpack using ts-loader. Not that should make a difference I don't think. Also if I am doing something wrong with this POC, please leave a comment. There could be others thinking the same approach as we did.

import { SharedArray } from "k6/data";

const data: ArrayBuffer[] = new SharedArray<ArrayBuffer>("pdfs", function (): ArrayBuffer[] {
    const data: ArrayBuffer[] = [open(path, 'b')];
    console.info("Bytes in LoadTestFiles: " + data[0].byteLength);
    return data;
});

//Start our k6 test
export default (): void => {
    console.info("Number of items in data: " + data.length);
    console.info("Bytes in VU: " + data[0].byteLength);
}

output. Notice the size being undefined in the test.
image

@oleiade
Copy link
Member Author

oleiade commented May 24, 2023

Hi @jade-lucas

Thanks a lot for your constructive feedback and your concrete use case. We are currently experimenting with this topic, and we expect the first improvements to land in k6 in the not-so-distant future (no ETA yet). We have prioritized #2977 and have #2978 on our radar.

We expect #2977 might help with your issue handling pdf files. Streaming in HTTP is, unfortunately further down the road as it is expected to potentially be part of the next http module we're working on at the moment (research phase). I'll make sure to keep you up when something concrete lands in k6 🤝

@rojas-diego
Copy link

I have ran into the same issue as jade-lucas. We need to load-test an API on a large scale with binary file upload. Having little knowledge of JavaScript buffers, I first couldn't understand why open(file) worked and open(file, "b") didn't when using SharedArray. I think a note about this in the documentation could help folks unfamiliar with the underlying implementation.

As mentioned, being able to share the contents of binary files between VUs and stream their contents over HTTP would be awesome and greatly aid our use-case.

Anyway, our experience of K6 has been amazing except for this one hurdle. Thanks for the great OSS 🙌🏻

@oleiade
Copy link
Member Author

oleiade commented Nov 30, 2023

Quick update on this:

  • The upcoming version v0.48 of k6 will provide a k6/experimental/fs module which allows for a better memory footprint when dealing with binary files.
  • We have started working actively towards Add Streams API support to k6 #2978 and expect to deliver it in the span of one or two releases.
  • This also servers Implement phase 1 of new HTTP API (PoC) #3038, in that eventually, we should be able to provide an HTTP client which would stream data from a k6/experimental/fs.File, and not have some of the other memory usage issues the current module has.

@nk-tedo-001
Copy link

  • The upcoming version v0.48 of k6 will provide a k6/experimental/fs module which allows for a better memory footprint when dealing with binary files.

Any docs on this? Can't find description at https://k6.io/docs/javascript-api/k6-experimental/

@oleiade
Copy link
Member Author

oleiade commented Dec 20, 2023

Hi @nk-tedo-001 👋🏻

Our docs have recently migrated to Grafana's, you can find more information about it there: https://grafana.com/docs/k6/latest/javascript-api/k6-experimental/fs/ 🙇🏻

@nk-tedo-001
Copy link

With fs module k6 has now not exceeded memory usage!

Great job!

@oleiade
Copy link
Member Author

oleiade commented Dec 27, 2023

Thank you 🙇🏻 I'm glad it was helpful 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement evaluation needed proposal needs to be validated or tested before fully implementing it in k6
Projects
Development

No branches or pull requests

6 participants