Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove S3 output stream #27280

Merged
merged 3 commits into from
Nov 10, 2017
Merged

Remove S3 output stream #27280

merged 3 commits into from
Nov 10, 2017

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Nov 6, 2017

The S3OutputStream class has been added in elasticsearch 1.4 in order to take advantage of the AWS Multipart Upload API. At that time, the Snapshot/Restore API didn't communicate the size of the blobs to be written to the repository implementation: the logic was to open an Outputstream and start to write bytes to it. With this logic, we decided to bufferize the bytes in memory until we know which API to use between a single upload request or multipart upload requests. This is why the buffer and associated logic was implemented as an OutputStream.

Now the blob size information is available before writing anything, the repository implementation can know upfront what will be the more suitable API to upload the blob to S3.

This pull request removes the DefaultS3OutputStream and S3OutputStream classes and moves the implementation of the upload logic directly in the S3BlobContainer. I think it's easier to understand and easier to maintain. It also avoids the internal buffering by passing the InputStream directly to the S3 client that takes care of buffering (up to 16Mb) and retry the requests. It removes pressure on memory (some allocations outside TLAB drop from 2.7Gb to 57Mb in my tests while total snapshot time drops from 35%), specially on nodes with small heap like 1Gb-4Gb.

Note: this has been tested together with #27278.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the snapshot/restore perspective. Love the simplification. Left one minor observation. Feel free to ignore it if you disagree.

Streams.copy(inputStream, stream);
}

final UploadMethod method = (blobSize <= blobStore.bufferSizeInBytes()) ? this::executeSingleUpload : this::executeMultipartUpload;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fell like a functional interface here doesn't really buy us anything comparing to a simple if statement with two calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and pushed 4421ee1

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one minor nit.

@@ -87,7 +87,7 @@
* use of the Multipart API and may result in upload errors. Defaults to the minimum between 100MB and 5% of the heap size.
*/
static final Setting<ByteSizeValue> BUFFER_SIZE_SETTING = Setting.byteSizeSetting("buffer_size", DEFAULT_BUFFER_SIZE,
new ByteSizeValue(5, ByteSizeUnit.MB), new ByteSizeValue(5, ByteSizeUnit.TB));
new ByteSizeValue(5, ByteSizeUnit.MB), new ByteSizeValue(5, ByteSizeUnit.GB));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you extract these as contants and then use these instead of redefining them in S3BlobContainer? Can you also document the limitations of S3 in the Javadoc here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 4aae0ce

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks :)

@tlrx tlrx mentioned this pull request Nov 9, 2017
@tlrx tlrx force-pushed the remove-s3-outputstream branch from 4aae0ce to cc3c690 Compare November 10, 2017 09:34
@tlrx tlrx merged commit 9c4d6c6 into elastic:master Nov 10, 2017
tlrx added a commit that referenced this pull request Nov 10, 2017
Now the blob size information is available before writing anything, 
the repository implementation can know upfront what will be the 
more suitable API to upload the blob to S3.

This commit removes the DefaultS3OutputStream and S3OutputStream 
classes and moves the implementation of the upload logic directly in the 
S3BlobContainer.

related #26993
closes #26969
tlrx added a commit that referenced this pull request Nov 10, 2017
Now the blob size information is available before writing anything, 
the repository implementation can know upfront what will be the 
more suitable API to upload the blob to S3.

This commit removes the DefaultS3OutputStream and S3OutputStream 
classes and moves the implementation of the upload logic directly in the 
S3BlobContainer.

related #26993
closes #26969
@tlrx tlrx added the v6.0.1 label Nov 10, 2017
martijnvg added a commit that referenced this pull request Nov 14, 2017
* es/master: (24 commits)
  Reduce synchronization on field data cache
  add json-processor support for non-map json types (#27335)
  Properly format IndexGraveyard deletion date as date (#27362)
  Upgrade AWS SDK Jackson Databind to 2.6.7.1
  Stop responding to ping requests before master abdication (#27329)
  [Test] Fix POI version in packaging tests
  Allow affix settings to specify dependencies (#27161)
  Tests: Improve size regex in documentation test (#26879)
  reword comment
  Remove unnecessary logger creation for doc values field data
  [Geo] Decouple geojson parse logic from ShapeBuilders
  [DOCS] Fixed link to docker content
  Plugins: Add versionless alias to all security policy codebase properties (#26756)
  [Test] #27342 Fix SearchRequests#testValidate
  [DOCS] Move X-Pack-specific Docker content (#27333)
  Fail queries with scroll that explicitely set request_cache (#27342)
  [Test] Fix S3BlobStoreContainerTests.testNumberOfMultiparts()
  Set minimum_master_nodes to all nodes for REST tests (#27344)
  [Tests] Relax allowed delta in extended_stats aggregation (#27171)
  Remove S3 output stream (#27280)
  ...
tlrx added a commit that referenced this pull request Nov 14, 2017
Now the blob size information is available before writing anything,
the repository implementation can know upfront what will be the
more suitable API to upload the blob to S3.

This commit removes the DefaultS3OutputStream and S3OutputStream
classes and moves the implementation of the upload logic directly in the
S3BlobContainer.

related #26993
closes #26969
@tlrx tlrx added the v5.6.5 label Nov 14, 2017
@tlrx
Copy link
Member Author

tlrx commented Nov 14, 2017

This change has been backported to 5.6.5, 6.0.1, 6.1.0 and master.

@tlrx tlrx deleted the remove-s3-outputstream branch November 14, 2017 10:29
@clintongormley clintongormley added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed :Plugin Repository S3 labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants