-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ranged read http header leaves TO bytes blank, causing S3 filesystems to read full file length #725
Comments
I've been looking through the If you skip that first seek, the first call to
I guess the most obvious solution then is to set the to-bytes in the call to However, I question why |
Here's the effect on SeaweedFS, just to elucidate the effect this issue has. At about 04:00 I was running a continuous loop reading 100 bytes from a 4 GB file, it issued a few calls per second. No appreciable bandwidth reported during that time. This was using the patch discussed above along with At 4:10 I simply changed Here's the effect on a Ceph cluster that was (nearly) unused for the same test. Ceph does handle the situation better. The example above was running about 3 reads/s in a loop and reading 200k bytes, so well under 1 MB/sec of end-user IO. But Ceph monitoring shows a throughput of 150 MB/sec, two orders of magnitude over what is actually being requested by the user. Ceph doesn't become inoperable as SWFS does at the scale of this test, but it drastically limits scalability when we get to higher performance IO (example: high IO ML jobs). |
Nice hunting champ! 👑 I was looking where the If I’m reading it correctly it seems like it just there to prevent unexpected behaviour when accessing content_length or validating if the object exists before actually checking the object exists in storage. |
So I only succeeded in solving the problem for a single read less than the buffer size (~132k by default). There's a more fundamental issue at play.
So we're damned if we do and damned if we don't. One solution is to give the user the ability to specify a range in the open request. If they fail to do this then we fall back on multiple HTTP requests which can set the range correctly. Another solution from the filesystem would be for the filesystem to only queue up what it needs as bytes are read. While we might be able to get one filesystem to implement this, ensuring all S3 filesystems operate that way is probably wishful thinking. I can't think of another way to handle this situation in a way that both protects the filesystem and keeps network overhead to a minimum. Can anyone else? |
After sleeping on this here's the solution I propose implementing (I'm working towards this and a pull request). We have two use cases that are at odds here: Use case 1: The file is opened for streaming and many Use case 2: A single small ranged read of a large file. This use case is common in many applications such as machine learning, where a large binary data file contains many small samples. The file is opened, read once, and closed, often at a very high frequency. For example, my use case is to read up to 20 samples/s of 25 Mb segments of a multi-GB file, up to 500 MB/s, to feed a single GPU, and I often run 10's of GPUs. In this use case, the overhead of an unspecified byte range is a showstopper because it doesn't allow the S3 filesystem to optimize the internal queuing of the file. These two use cases are fundamentally at odds, so any solution is a balancing act between them. I propose we remove Description of these modes:
Tasks necessary to implement this (assuming no one raises a concern here then I'm working on a pull request with these changes):
|
Sounds reasonable, but let's use shorter values for the options, e.g. @jcushman Can you comment on the defer seek part? |
After thinking about this some more, I can think of another alternative. We can specify the range to always read at most N bytes, where N is configurable via transport parameters. For the streaming case, N would be sufficiently large, e.g. several MB (we could make this the default). For the reading case, it would be smaller, e.g. several kB. I think this would satisfy both use cases. I think this would be simpler than the optimize for partial reads/streaming/auto option. What's your opinion? |
This works as long as the first read request still sets the first byte range to exactly what is requested in that first read. This covers the use case of open, single-read, close where even a modest inefficiency in byte-range would reduce scalability when pushing the system to its limits, such as occurs in ML. I like your solution better for the streaming case. Although it necessitates more HTTP headers, setting the value to a reasonably large number (in the MB range, maybe 10 MB is good?), limits the damage that can be done if the file is obscenely large. In fact, looking at the Ceph / SeaweedFS comparison here we can see that most likely Ceph is already implementing similar logic in their S3 protocol (we see 1 MB/s of 200k end-user reads actually transfer 150 MB/s internally), which means Ceph doesn't crash under current conditions, it just performs sub-optimally. SeaweedFS doesn't have this logic and internally queues a few orders of magnitude more data. Setting this limit would let So then I'm implementing this set of changes:
|
Yes, that makes sense.
Sounds like a fair bit of work, have fun ;) |
A few details:
It is a bit of work, but I would have built this functionality myself anyway, so I'm much happier having it in a widely supported platform. Plus the two |
Actually, while I'm at it, the |
Now I’m a big of smart_open 💯. but the other day discovered s3fs, gcsfs and fsspec. I wonder how that would deal with this problem. Admittedly their interfaces are not as nice. But almost makes me wonder if smart open would just become the magical wrapper interface. Or if that takes too much from what this project was doing. 🤷♀️ |
I doubt the fuse-based filesystem will be especially efficient for streaming and dealing with high-performance reads of large files, but I haven't really put it to the test, I vaguely recall testing it out a few years ago but that didn't go anywhere. I'm not using GCP, so no comment on gcsfs, but it's only one filesystem, having the multi-filesystem back end of |
I've finished a first pass at this optimization. We've got a clean set of unit tests (your existing unit tests were fantastic and a huge help in catching edge cases). I kept all your unit tests except those specific to It ended up being nearly a full re-write of the The commit is currently on my fork at the link below. I could test it under load in our lab for a while before passing over a pull request, or I can send over a pull request on the basis of passing unit tests now, let me know what works best for you. |
An in-the-wild load test looks pretty good. I'm running two GPU based ML jobs performing small random reads over a 2TB dataset (3.5 GB data file sizes). Namespace bandwidth (top) shows end-user data received by the jobs, and Ceph/S3 bandwidth (bottom) shows what Ceph is reporting (this is on an otherwise unused S3 region, independent of the cluster main filesystem). Ceph throughput is reporting ~350 MB/sec vs. ~650 MB/sec end-user received throughput which is ideal, the difference is due to cache hits in front of Ceph/S3. This same job before the update would have brought Ceph/S3 to its knees. |
That looks good! What are the actual file types you’re reading, parquet? |
These are raw binary electrophysiology data, 25 khz recordings of electrical potential at 512 recording sites in freely behaving mice, capable of resolving individual neuron spiking, a la Neuralink style (probably slightly higher resolution here). The format is a proprietary Whitematter/eCube format, but it's just a binary dump of int16's. I'm at UC Santa Cruz and do this jointly with Wash U. in St. Louis, with the multi-institution cluster run by UCSD. |
Fascinating, thanks for the reply, sounds like you have some interesting research going on. |
I briefly considered being more predictive about this kind of pattern, but there wasn't a readily obvious way to do that. At this point it would be easy to implement more predictive logic just by manipulating the Depending on the recording rig we do use formats like HDF5 (we support 5 different hardware platforms in our code base). HDF5 does a lot of index scanning then a ton of sequential reads which are a terror for trying to stream over a high-latency/high-bandwidth S3 link (it's impossible to set the user block size optimally for all use cases). In this situation my general-purpose solution is to provide a data transformation service that reformats the data from its original form (typically optimized for write performance and anti-optimized for read performance) into a pure binary format explicitly optimized for the particular use case at hand (the user specifies their requirements in the data transformation step). This data transformation step allows me to return to the optimal open-read-close paradigm when operating at scale. |
Problem description
smart_open
is effectively crashing SeaweedFS/S3 and Ceph/S3 filesystems when doing many small reads over large files (ex: 32k read on a 4GB file).On a SeaweedFS/S3 filesystem (also demonstrated on Ceph/S3), using the code shown in the reproduce section below, I am reading a small segment of data (32k in this example) from a large file (4GB in this example). This simple request causes SeaweedFS to move the full file for just a 32k range read. It appears that this is expected behavior based on a reading of the protocol specifications. Notably
boto3
does not trigger the same behavior.When I run the code and look at the HTTP traffic being generated we see the following GET request:
Notably
Range: bytes=0-
is our culprit. My assumption is thatsmart_open
intends to open the file for streaming and read data from the stream as dictated by calls tof.read(...)
.When performing a ranged read with just
boto3
code the header looks like this:Using
boto3
does not cause any issue. With smart_open the SeaweedFS/S3 filesystem is interpreting the lack of ato-bytes
value as the full file. It is then moving the full (4 GB) data file from a volume server to an S3 server, where just 32k are passed to the end user job. This has the effect of very quickly oversaturating the network.The protocol specifications seem to agree that the behavior by SeaweedFS/S3 is the correct way to interpret this Range header. E.g. how can the filesystem know that the user won't need to read the whole file given this header.
https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
Steps/code to reproduce the problem
This code triggers the full (4 GB) file transfer (internally, not to the end user application) for a small 32k ranged read.
This
boto3
version of the code does not trigger the same issue:I'd like to open a discussion to determine how to properly interpret the S3 protocol and figure out whether an issue like this should be in the domain of the filesystem(s) or should be changed in
smart_open
.Versions
The text was updated successfully, but these errors were encountered: