-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite of s3.Reader class to protect S3 from open range headers (#725) #734
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left you a ton of comments, please have a look. Thanks!
@@ -232,7 +233,7 @@ def open( | |||
buffer_size=DEFAULT_BUFFER_SIZE, | |||
min_part_size=DEFAULT_MIN_PART_SIZE, | |||
multipart_upload=True, | |||
defer_seek=False, | |||
stream_range=10485760, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stream_range=10485760, | |
stream_range=DEFAULT_STREAM_RANGE, |
single HTTP request across multiple read calls, this is an important protection | ||
for the S3 server, ensuring the S3 server doesn't get an open-ended byte-range request | ||
which can cause it to internally queue up a massive file when only a small bit of it may ultimately be | ||
read by the user. Note that the first read call (after opening or seeking) will always set the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The note is an implementation detail. It doesn't belong in user documentation. Please remove it.
You've already mentioned this in a code comment below - that is sufficient.
smart_open/s3.py
Outdated
|
||
def __str__(self): | ||
return 'smart_open.s3._SeekableReader(%r, %r)' % (self._bucket, self._key) | ||
# class _SeekableRawReader(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out? If this class is no longer needed, please remove it.
client=None, | ||
client_kwargs=None, | ||
): | ||
assert isinstance(stream_range, int), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please raise a ValueError instead.
self._line_terminator = line_terminator | ||
|
||
# the max_stream_size setting limits how much data will be read in a single HTTP request, this is an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean stream_range? I don't see max_stream_size anywhere.
if self._body is not None: | ||
self._body.close() | ||
self._body = self._stream_range_from = self._stream_range_to = None | ||
self._eof = False if self._stream_range_from is None or self._position < self._content_length - 1 else True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rewrite this as a block of conditionals.
if self._stream_range_from is None or self._position < self._content_length - 1:
self._eof = False
else:
self._eof = True
stop = None if size == -1 else self._position + size | ||
self._open_body(start=self._position, stop=stop) | ||
|
||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like copypasta from below, please refactor.
self._eof = True | ||
# check if existing body range is sufficient to satisfy this request, if not close it so that it re-opens | ||
# with an appropriate range. | ||
is_stream_limit_before_eof = self._body is not None \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use all() to simplify this expression.
is_stream_limit_before_eof = all(
self._body is not None,
size < 0,
...
)
this avoids the repetitive "and" and artificial line breaks.
# open the HTTP request if it's not open already | ||
if self._body is None: | ||
start = self._position + len(self._buffer) | ||
stop = None if size < 0 else \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't abuse the ternary operator :)
if size < 0:
stop = None
else if self._content_length is None:
stop = start + size - 1
else:
stop = start + max(size, self._stream_range) - 1
|
||
# If we can figure out that we've read past the EOF, then we can save | ||
# an extra API call. | ||
reached_eof = True if self._content_length is not None and self._position >= self._content_length else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ternary operator.
Also, here's a question: why do we spend so much effort handling the case when the content length is None? I think we ever reach a point where we need it and don't have it, we should just query it.
What do you think?
@piskvorky There's a ton of work left for this PR, my recommendation is to remove it from the milestone and deal with it after we release 6.3.0 |
…3 implementations.
came across a nice read about S3 and range requests https://blog.limbus-medtec.com/the-aws-s3-denial-of-wallet-amplification-attack-bc5a97cc041d |
fwiw, we routinely do a |
Motivation
The
smart_open.s3.Reader
class left the HTTP range header open (e.g.bytes=0-
). That was done to allow multiple (future) calls toread
to stream bytes from the same HTTP body to avoid duplicating HTTP headers and connections. The open range header causes many S3 servers to internally move the full file, or a large block of the file, from its storage location to a server handling the S3 request. This was demonstrated on Ceph/S3 and SeaweedFS/S3, see issue #725. In the case of a small read against a large file this can quickly overwhelm the S3 filesystem.Fixes #725.
Work Performed
The class
smart_open.s3.Reader
was re-written to ensure the HTTP range headers are always set reasonably to protect the S3 filesystem. The rewrite strikes a balance between optimizing for the open-read-close use case and the open-repeat-read-close use case. See the git issue #725 for a detailed discussion.Tests
Existing unit tests for
smart_open.s3
all pass. A few unit tests were removed for features that were eliminated and a few unit tests were added to cover new features. Some unrelated unit tests were updated because they failed to clean up after themselves.Checklist
Before you create the PR, please make sure you have:
Workflow
Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress.
Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.