-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel uploads for multipart #1020
Comments
The problem is not the hash, really. The problem is split chain itself and v1/v2 are the same here, pre-#957 code didn't support this as well. Chaining is this Potential solutions:
Temporary objects require additional logic on S3 side and can leave garbage that is harder to trace (additional attributes?). They can be optional (we can try pushing the next split chunk if possible and resort to additional objects if the part is out of sequence). And they will seriously affect multipart completion, it will require quite some time to reslice everything (hashing alone would be much easier, but that's not the problem we have). Supporting "slots" can be some additional "part number" attribute that is used instead of "previous". It completely breaks backward walking assembly logic and makes link objects more important, but it's still a possibility and we still can find all related objects this way. It can also simplify part reuploading. At the same time it's a protocol change. Can this be useful for standalone NeoFS? Not sure. |
From the NeoFS side, I see some questions, and the main one is if we can solve them successfully, why have we needed this backward-chained logic from the beginning for so long if we can accept a simpler scheme (but based on some agreements that should be taken as truth)?
Don't mind considering protocol changes but for now to me it is more like trying to play against NeoFS and figuring out some kludges about it. |
Chained objects are more robust and they're very good for streams of data. Typical NeoFS slicing pattern is exactly that, you know previous object hash, you know all the hashes, you can build these links and indexes effectively and you can always follow the chain exactly. Slot-alike structure is more fragile, it's not simpler, without an index object it requires searches to find other parts. Also, regarding its use for S3 one thing to keep in mind is that probably we can't ensure 1:1 slot mapping between NeoFS and S3, since parts there are 5 MB to 5 GB and 5GB is a big (split) object in NeoFS. Split hierarchies is something we've long tried to avoid and I'd still try to so. Unfortunately, looks like this limits us to some S3-specific scheme with regular objects that are then reassembled upon upload completion. Which totally destroys the optimization we have now (almost free multipart upload completion). I'm all ears for other ideas. |
The proposal here is:
Intermediate objects can be found via attributes if we need to respond to ListParts. We can't easily expire them unfortunately, the default S3 behavior is to keep multipart open for as long as needed even though practically they recommend lifecycle policies. We will try minimizing reslicing overhead as much as possible, but at the same time make it possible to use S3 multiparts the way they were designed. |
Multipart uploads don't work in general case
Current Behavior
AWS SDK uploads parts for multipart in 5 parallel threads. The gate expects parts subsequently one by one
Expected Behavior
All parts should be uploaded in any order
Possible Solution
Collect the final object hash in a different way
Steps to Reproduce
OperationAborted: 409
errorContext
Related to #1016
Your Environment
The text was updated successfully, but these errors were encountered: