-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Proposal] Enhancement of Repository Plugin #6354
Comments
Tagging @elfisher @muralikpbhat @reta @mch2 @dreamer-89 @andrross @Bukhtawar @sachinpkale @itiyamas for feedback. Pls do tag others who can review this. |
This proposal looks like a good idea to me. I suggest putting together a quick proof-of-concept of the interface changes you have in mind, as I suspect it'll be easier for folks to provide feedback on a more concrete proposal. It would also be great to have at least rough implementations from multiple object stores to ensure the API has the right abstractions (though what you're proposing does look pretty generic). |
This is proposal looks great. My thumb rule is always, if we are making the product better its a go. From what I understand, we are improving the upload (or download) to remote store using multipart[1]. It absolutely makes sense as it improves the performance and reliability dramatically. The tricker part is to convert the single Also as an optimization, multipart upload/download only makes sense for large files (>100MB recommended for S3) we should gate this and make sure we dont overkill for small files. Overall it makes sense, I second @andrross idea, lets implement this for one blob store and we could converse right set of abstractions. [1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html |
@saratvemulapalli Your overall brief of using multiple streams emitted from a file and using them for transfer instead of a single stream is correct. There are few corrections though in detailed view :
We will add details in the design doc. |
@vikasvb90 thank you. That makes sense. |
I think the proposal make sense, but I would love to see the API sketched, as @andrross suggested. The repository plugin is pretty generic in nature, there are feature disparities between S3 / Azure / HDFS / GCS / OCS / whatever comes next and this needs to be baked into API concisely. |
This is a great proposal!
|
The limiting factor, as I understand it, is that the interface passes in a single InputStream for one file. Given that an InputStream only gives sequential access to a file, the only way to do concurrent multipart uploads without changing the interface API would be to buffer the parts into memory from that InputStream. That could be done for a proof-of-concept, but I'm not sure how valuable that would be given that the memory requirements would likely be unacceptable in the general case. I think a proof-of-concept including the interface changes would be pretty straightforward. |
|
Even with a single |
@dblock I am not sure if it is also the case in other remote stores but in case of S3 it will surely not work. This means that we need one stream per part referring to a particular portion of a file responsible for uploading only that portion. |
The purpose of this issue is to gather community feedback on proposal of enhancing the repository plugin.
OpenSearch repository plugin, today, provides transfer capabilities via streams to remote store. This plugin allows user to store the index data in off-cluster external repositories such as Amazon S3, Google Cloud Storage, or a shared filesystem, in addition to the local disk of the OpenSearch cluster. By using the repository plugin, OpenSearch users can take advantage of Snapshot feature to backup and restore to protect against data loss, enable disaster recovery, and create replicas of their data for testing and development purposes. With remote-backed storage, the user now has the ability to protect against data loss by automatically creating continuous backups of all index transactions and sending them to remote storage. OpenSearch users can achieve request level durability using remote-backed storage.
Problem Statement
OpenSearch repository plugin today provides interfaces such as writeBlob interface to facilitate transfer of a file by using a single InputStream. This means that the file referenced by InputStream needs to be processed serially. This restricts the capabilities of underlying plugin to serially transfer buffered content of a file and only after successful processing of first buffer, subsequent buffer of content is read and transferred. Parallel processing of multiple parts of a file is therefore, not possible. Use cases such as download or upload of multiple parts of a file in parallel cannot be supported due to this.
S3 repository plugin, for instance, provides support for multi-part upload but due to single InputStream restriction of base plugin, upload of each part happens serially even though S3 provides support for parallel upload of individual parts of a file.
Proposed Solution
We propose to enhance existing repository plugin to extend support for mutliple stream suppliers for underlying vendor plugins to be able to optionally provide multi-stream based implementations for remote transfers. Provisioning multiple streams instead of abstracting with file based transfer would provide control in core Opensearch code to pre-process buffered content with multiple stream wrappers after content is read and before transfer can take place. Stream suppliers instead of concrete streams can further facilitate delegation of stream creation till remote transfer is started. Following can be some of the abstractions we propose to provide the required support :
Credits - @vikasvb90, @ashking94
The text was updated successfully, but these errors were encountered: