Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3: getObject, ResponseInputStream close not working as expected. #2117

Open
JstDust opened this issue Oct 26, 2020 · 5 comments
Open

S3: getObject, ResponseInputStream close not working as expected. #2117

JstDust opened this issue Oct 26, 2020 · 5 comments
Labels
bug This issue is a bug. p3 This is a minor priority issue

Comments

@JstDust
Copy link

JstDust commented Oct 26, 2020

Describe the bug

When you invoke close on ResponseInputStream from the getObject before you reach the end of the data. The stream will not immediately(/within reasonable time) close on big files.

Expected Behavior

When invoking close on an ResponseInputStream you would expect a close within at least a few seconds.

Current Behavior

It does not close until it did a complete read of the stream internally. (this seems related to connection recycling internally).

Steps to Reproduce

  • Upload a big file to S3 atleast a 1-2 gb to observe the issue on fast internet connection
var r = client.getObject(c -> {
            c.bucket(bucket).key(key).versionId(versionId);
        });
r.read(new byte[1024]); // just to make sure we actually opened the resource and did something.
r.close(); // now you have to wait until the entire file is read by the close internally.

Possible Solution

  • The internal handler closes the stream immediately, possibly by simply closing the underlying connection.
  • It could make a decision based on the remaining bytes to read to make the impact less significant on big files (e.g. if remaining > 1 MB close connection, otherwise consume).

Context

I was trying to create a test case for a wrapper i created for handling network related issues (e.g. connection reset). Where i would simply track the position in the stream and use a range request when the connection fails, to continue the stream transparently.
When testing it by creating a wrapper around the ResponseInputStream that throws randomly a IO exception i noticed the extreme delay when closing the ResponseInputStream.

Workaround

 var r = client.getObject(c -> {
            c.bucket(bucket).key(key).versionId(versionId);
        });
r.read(new byte[1024]); // just to make sure we actually opened the resource and did something.
r.abort(); // <<< Workaround, but this is not part of a InputStream contract
r.close(); // now the connection closes almost immediately.

Your Environment

  • AWS Java SDK version used: 2.15.14
  • JDK version used: 11
  • Operating System and version: windows 10
@JstDust JstDust added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Oct 26, 2020
@debora-ito
Copy link
Member

It looks like abort is what you are looking for, according to the ResponseInputStream documentation:

If it is not desired to read remaining data from the stream, you can explicitly abort the connection via abort(). Note that this will close the underlying connection and require establishing an HTTP connection which may outweigh the cost of reading the additional data.

I'm not sure what is the expected behavior of close after a read has started, I'll do some research.

@debora-ito debora-ito removed the needs-triage This issue or PR still needs to be triaged. label Nov 10, 2020
@JstDust
Copy link
Author

JstDust commented Nov 17, 2020

@debora-ito It's fine that there is a abort() call for this. but it simply does not work well when you use wrappers to actually read the contents of a stream e.g. a simple InputStreamReader, or GZipInputStream etc.

Those are often returned directly like below, at that point the abort() and the whole S3 backend is lost to the caller.

public Reader readText(String bucket, String key){
  return new InputStreamReader(client.getObject(.....), StandardCharSets.UTF_8);
}

And the only thing that remains of the contract is the close() operation. In the java docs these are very clearly documented.
https://docs.oracle.com/javase/8/docs/api/java/io/InputStream.html#close--
https://docs.oracle.com/javase/8/docs/api/java/io/InputStreamReader.html#close--

"Closes the stream and releases any system resources associated with it."

I do not consider it a issue that on EOF the connection is returned to the pool. But if for some reason it gets closed early, continuing to read in the close() call is effectively not part of the contract. And it does cause long waits when something fails. e.g. when using a try-resource statement like below.

try(BufferedReader reader = (...s3call...)){
  String line;
  while((line=reader.readLine()) != null){
    if(line.length() < 10) throw new IllegalStateException("Line to short");
    ....some code...
  }
} // now have fun waiting if that is the first line in the multi gigabyte file.

So i consider it a contract violation of the InputStream#close() contract that getObject implements a InputStream and on close read is continued.
Also the gains of this aggressive pool return seems minimal at most specially on early close. Most of the time it's probably more IO and CPU friendly to to just close the underlying connection if the connection can not be returned immediately to the pool when close is invoked.

@shearn89
Copy link

shearn89 commented Mar 4, 2022

This appears to happen with other streams as well. I'm reading from a Kinesis Video stream that returns a GetMediaResponse for the important bit (reading the data). When I'm done, for example because I've hit my stopFragmentNumber, I stop processing. When I then call close() on the ResponseInputStream<GetMediaResponse> object, it takes about 3 seconds to close.

@shearn89
Copy link

shearn89 commented Mar 4, 2022

I can use the same workaround by doing:

        ResponseInputStream<GetMediaResponse> inputStream = (ResponseInputStream<GetMediaResponse>) kvsStreamTrackObject.getInputStream();
        inputStream.abort();

...although I then also have to call .release() else it throws an error on close.

aws-sdk-java-automation added a commit that referenced this issue Aug 10, 2022
…bda3b78d9

Pull request: release <- staging/8dbcb073-2c84-4481-bbc1-358bda3b78d9
@yasminetalby yasminetalby added the p3 This is a minor priority issue label Nov 12, 2022
@hohwille
Copy link

I fully agree to @JstDust. As some temporary workaround one would have to create its own S3InputStream class that delegates all method calls to a wraped ResponseInputStream but in the close() method it will first call abort() and then close().

@debora-ito IMHO there is no big thing to clarify as this is just the contract that every Java developer would expect. I can imagine that it is a huge effort to get all the S3 clients for every different programming languages perfect and it may also be tricky to consider changing existing behavior but this IMHO the way to go.
Java developers just want a seamless API experience for S3 and after all S3 stands for "Simple Storage Service" so developers would love to have it simple instead of figuring out all the things out themselves and add workarounds to gaps in the SDK such as also the demand to put via an OutputStream:

Could we expect things like this to be solved out of the box by the SDK from the top #1 hyperscaler?
From my PoV this would be lovely...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

5 participants