Improving handling of large files in k6 #2974

oleiade · 2023-03-13T10:45:45Z

Story

Problem Statement

Handling large files in k6, whether binary or structured format files such as CSV, leads to high memory usage. As a result, our user’s experience, especially in the cloud, is degraded as soon as they need to handle large data sets; of user ids, for instance.

This issue is experienced by our users in various situations and is caused by many design and implementation decisions in the current state of the k6 open-source tool.

Objectives

Product-oriented

The product-oriented objective of this story, and definition of success, will consist in landing a new streaming CSV parser in k6, allowing users to parse and use big CSV files (> 500MB) that wouldn’t fit in memory and likely lead their k6 scripts to crash. We are keen to provide any technical improvements to k6 that make it possible along the way.

Technology Oriented

From a technological standpoint, the objective of this story is to provide all the design and technical changes necessary to complete the product story. Our primary objective should be to support the ability of users to use large data set files in their k6 scripts without leading to out-of-memory errors. As we pursue this goal, we aim to pick the solutions with the most negligible overhead and are keen to take on technical debt if necessary.

Resolution

Through internal workshops with @sniku and @mstoykov, we surfaced various topics and issues that must be addressed to fulfill the objective.

Add a streaming-based CSV parser to k6 #2976: our product-level end goal

Must-have

The bare minimum must-have items to be able even to start tackling the top-level product objective would be:

Support finer-grained and richer access to tar archives content #2975: We end up caching files in memory because we cannot read them directly from a tar archive without decompressing them first.
Reduce the caching of files inside k6. As a result, 👆 k6 caching files users use in memory and duplicates them per-vu. The behavior is all over the place. If a more convenient tar library allowing for direct access to files from archives were to be available, we might want to revisit this behavior.

Nice to have

As we're at it, some other set of features and refactors would be beneficial to the larger story of handling large files in k6

Design a File API for k6 #2977. Currently, the open() method of k6 is somewhat misnamed and performs a readFile() operation. This is also a result of k6 archiving users' content in a single tar archive and having to access resources through it. With more efficient and flexible access to k6's tar archive's content, we believe that k6 would also benefit from having a more "standard" file API to open, read and seek through files more conveniently. This would help support streaming use cases by providing more flexible navigation in a file's content. While also helping benefit from OS-based optimizations like the buffer cache.
Add Streams API support to k6 #2978 Another key aspect of handling files more efficiently in k6 is how we access them. As illustrated 👆 we currently only have a way to load the whole content of a file in memory. To support the specific product goal, and other endeavors such as #2273 , or our work towards a new HTTP API, we believe adding even a partial (read operations only) support for the Streams API to k6 would be beneficial. It would establish a healthy baseline API for streaming IO in k6.

Problem Space

This issue approaches the problem at hand with a pragmatic product-oriented objective. However, this specific set of issues has already been approached from various angles in the past and is connected to longer-term plans as demonstrated in this list:

The text was updated successfully, but these errors were encountered:

na-- · 2023-04-04T11:07:47Z

While thinking about #2975 (comment), I realized that I probably disagree with something here. I think the first points of the "Must-have" and "Nice to have" sections are somewhat flipped and need to be exchanged 😅 That is, #2975 is probably nice to have, while #2977 seems like a must.

What will happen if we only implement #2975? We'll have a very efficient archive bundle format (.tar or otherwise) that doesn't load everything into memory when k6 executes it. That would be awesome! It would mean users would be able to potentially cram huge static files in these archives. However, they would have no way to actually use these files in their scripts besides using open(). If it's data (and not, say, HTTP request bodies), maybe they can also use a SharedArray, which will still make at least two (one temporary and one permanent, with extra JSON overhead) copies of the whole contents in memory... 😅

Whereas, if we don't touch .tar archives and still load their contents fully in memory, but we stop copying all of the file contents everywhere, and we add a way to open files without fully reading them into memory, users will still be able to work with them somewhat efficiently. Loading 500 MB in memory is not great, but as long as it happens only once, it's fairly tolerable and fixing it becomes "nice to have", not a must.

oleiade · 2023-04-04T12:40:37Z

TL; DR The tar improvements are less obviously immediately valuable. I agree with that, and agree that we should prioritize #2977 over it 👍🏻

I think parts of the idea with the tar archive lib on "steroids" was to do it hand in hand with #2977, with the assumption that one would be able to obtain a file handle of anything inside of the tar archive, without having to dump it twice in memory (once because the tar archive is in memory, and the copy of the content of the file).

Having done some research on this in the past, already having a more transparent API around files would help tremendously, even gaining granularity around how to handle data in scripts. As of today, users don't have much choice. It's all or nothing: load all the data in memory or nothing. Whereas with a File API, one could also open the files once and read them whenever they need the content, just in time (modern OSes all have some flavor of a buffer cache, which caches the content of read syscalls, so that when you read file N times in a row, all the reads past the first one are made from this cache, and are much MUCH quicker). This also has the potential to improve memory usage.

fly3366 · 2023-05-19T10:01:14Z

lgtm. randomString with too large size may use more time for bootstrap. But load stress for db testing or kvstore testing may often need generate large value.

jade-lucas · 2023-05-23T16:55:08Z

I have a need to load test an api that takes in pdfs. I was hoping that we could utilize SharedArray to share blobs (ArrayBuffer) across VUs. We are trying to achieve load testing with a wide range of pdf sizes up to 50MB. We need to simulate 2000 VU. Simple math shows that the feasibility is not in our favor if every VU needs to hold a 50MB blob in memory. Having away to share blobs across VU or having streaming support would make this more feasible. I personally think a streaming option in k6/http functions would be the most flexible and scalable. It would be a similar pattern seen in many code bases. We just started using k6 a few months ago to load test our mission critical services where I'm employed and so far its been a really good experience. I think k6 would benefit greatly and allow them to really expand their capabilities if they can crack handling of large sets of unstructured data, like pdfs, jpegs, etc.

Our POC to test SharedArray with ArrayBuffer. Currently doesn't work. Note: we are using typescript and transpiling with webpack using ts-loader. Not that should make a difference I don't think. Also if I am doing something wrong with this POC, please leave a comment. There could be others thinking the same approach as we did.

import { SharedArray } from "k6/data";

const data: ArrayBuffer[] = new SharedArray<ArrayBuffer>("pdfs", function (): ArrayBuffer[] {
    const data: ArrayBuffer[] = [open(path, 'b')];
    console.info("Bytes in LoadTestFiles: " + data[0].byteLength);
    return data;
});

//Start our k6 test
export default (): void => {
    console.info("Number of items in data: " + data.length);
    console.info("Bytes in VU: " + data[0].byteLength);
}

output. Notice the size being undefined in the test.

oleiade · 2023-05-24T11:29:13Z

Hi @jade-lucas

Thanks a lot for your constructive feedback and your concrete use case. We are currently experimenting with this topic, and we expect the first improvements to land in k6 in the not-so-distant future (no ETA yet). We have prioritized #2977 and have #2978 on our radar.

We expect #2977 might help with your issue handling pdf files. Streaming in HTTP is, unfortunately further down the road as it is expected to potentially be part of the next http module we're working on at the moment (research phase). I'll make sure to keep you up when something concrete lands in k6 🤝

dhbrojas · 2023-10-21T01:59:22Z

I have ran into the same issue as jade-lucas. We need to load-test an API on a large scale with binary file upload. Having little knowledge of JavaScript buffers, I first couldn't understand why open(file) worked and open(file, "b") didn't when using SharedArray. I think a note about this in the documentation could help folks unfamiliar with the underlying implementation.

As mentioned, being able to share the contents of binary files between VUs and stream their contents over HTTP would be awesome and greatly aid our use-case.

Anyway, our experience of K6 has been amazing except for this one hurdle. Thanks for the great OSS 🙌🏻

oleiade · 2023-11-30T10:57:50Z

Quick update on this:

The upcoming version v0.48 of k6 will provide a k6/experimental/fs module which allows for a better memory footprint when dealing with binary files.
We have started working actively towards Add Streams API support to k6 #2978 and expect to deliver it in the span of one or two releases.
This also servers Implement phase 1 of new HTTP API (PoC) #3038, in that eventually, we should be able to provide an HTTP client which would stream data from a k6/experimental/fs.File, and not have some of the other memory usage issues the current module has.

nk-tedo-001 · 2023-12-20T14:20:02Z

The upcoming version v0.48 of k6 will provide a k6/experimental/fs module which allows for a better memory footprint when dealing with binary files.

Any docs on this? Can't find description at https://k6.io/docs/javascript-api/k6-experimental/

oleiade · 2023-12-20T14:25:00Z

Hi @nk-tedo-001 👋🏻

Our docs have recently migrated to Grafana's, you can find more information about it there: https://grafana.com/docs/k6/latest/javascript-api/k6-experimental/fs/ 🙇🏻

nk-tedo-001 · 2023-12-25T13:42:54Z

With fs module k6 has now not exceeded memory usage!

Great job!

oleiade · 2023-12-27T15:42:47Z

Thank you 🙇🏻 I'm glad it was helpful 🎉

oleiade self-assigned this Mar 13, 2023

oleiade added the enhancement label Mar 13, 2023

oleiade changed the title ~~Improving k6 large files handling~~ Improving handling of large files in k6 Mar 13, 2023

This was referenced Mar 13, 2023

Support finer-grained and richer access to tar archives content #2975

Open

Add a streaming-based CSV parser to k6 #2976

Closed

Design a File API for k6 #2977

Closed

Add Streams API support to k6 #2978

Closed

oleiade added the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label Mar 13, 2023

na-- mentioned this issue Apr 11, 2023

Provide a file storing API #3017

Open

oleiade mentioned this issue May 30, 2023

Add File API proposal design document #3101

Merged

codebien added this to k6 open-source public roadmap Jun 14, 2023

codebien moved this to Mid term - Q4 2023 in k6 open-source public roadmap Jun 14, 2023

oleiade moved this from Mid term - Q4 2023 to Short term - Q3 2023 in k6 open-source public roadmap Jul 12, 2023

mstoykov mentioned this issue Aug 4, 2023

SharedArray: Bad performance with big items #3237

Open

oleiade moved this from Short term - Q4 2023 to Long term - Q2 2024+ in k6 open-source public roadmap Nov 30, 2023

andrewslotin moved this from Long term to Mid term in k6 open-source public roadmap Apr 8, 2024

This was referenced Aug 28, 2024

Implement "Bring Your Own Buffer" constructs for ReadableStream #3918

Open

Implement TransformStream #3919

Open

Stabilize Streams module #3920

Open

Implement the ReadableStream.pipeThrough method #3921

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving handling of large files in k6 #2974

Improving handling of large files in k6 #2974

oleiade commented Mar 13, 2023 •

edited

Loading

na-- commented Apr 4, 2023 •

edited

Loading

oleiade commented Apr 4, 2023

fly3366 commented May 19, 2023

jade-lucas commented May 23, 2023

oleiade commented May 24, 2023

dhbrojas commented Oct 21, 2023

oleiade commented Nov 30, 2023

nk-tedo-001 commented Dec 20, 2023

oleiade commented Dec 20, 2023

nk-tedo-001 commented Dec 25, 2023

oleiade commented Dec 27, 2023

Improving handling of large files in k6 #2974

Improving handling of large files in k6 #2974

Comments

oleiade commented Mar 13, 2023 • edited Loading

Story

Problem Statement

Objectives

Product-oriented

Technology Oriented

Resolution

Must-have

Nice to have

Problem Space

na-- commented Apr 4, 2023 • edited Loading

oleiade commented Apr 4, 2023

fly3366 commented May 19, 2023

jade-lucas commented May 23, 2023

oleiade commented May 24, 2023

dhbrojas commented Oct 21, 2023

oleiade commented Nov 30, 2023

nk-tedo-001 commented Dec 20, 2023

oleiade commented Dec 20, 2023

nk-tedo-001 commented Dec 25, 2023

oleiade commented Dec 27, 2023

oleiade commented Mar 13, 2023 •

edited

Loading

na-- commented Apr 4, 2023 •

edited

Loading