Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite of s3.Reader class to protect S3 from open range headers (#725) #734

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

davidparks21
Copy link

@davidparks21 davidparks21 commented Oct 19, 2022

Motivation

The smart_open.s3.Reader class left the HTTP range header open (e.g. bytes=0-). That was done to allow multiple (future) calls to read to stream bytes from the same HTTP body to avoid duplicating HTTP headers and connections. The open range header causes many S3 servers to internally move the full file, or a large block of the file, from its storage location to a server handling the S3 request. This was demonstrated on Ceph/S3 and SeaweedFS/S3, see issue #725. In the case of a small read against a large file this can quickly overwhelm the S3 filesystem.

Fixes #725.

Work Performed

The class smart_open.s3.Reader was re-written to ensure the HTTP range headers are always set reasonably to protect the S3 filesystem. The rewrite strikes a balance between optimizing for the open-read-close use case and the open-repeat-read-close use case. See the git issue #725 for a detailed discussion.

Tests

Existing unit tests for smart_open.s3 all pass. A few unit tests were removed for features that were eliminated and a few unit tests were added to cover new features. Some unrelated unit tests were updated because they failed to clean up after themselves.

Checklist

Before you create the PR, please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Checked that all unit tests pass

Workflow

Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress.
Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

@piskvorky piskvorky added this to the 6.3.0 milestone Nov 19, 2022
Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left you a ton of comments, please have a look. Thanks!

@@ -232,7 +233,7 @@ def open(
buffer_size=DEFAULT_BUFFER_SIZE,
min_part_size=DEFAULT_MIN_PART_SIZE,
multipart_upload=True,
defer_seek=False,
stream_range=10485760,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream_range=10485760,
stream_range=DEFAULT_STREAM_RANGE,

single HTTP request across multiple read calls, this is an important protection
for the S3 server, ensuring the S3 server doesn't get an open-ended byte-range request
which can cause it to internally queue up a massive file when only a small bit of it may ultimately be
read by the user. Note that the first read call (after opening or seeking) will always set the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note is an implementation detail. It doesn't belong in user documentation. Please remove it.

You've already mentioned this in a code comment below - that is sufficient.

smart_open/s3.py Outdated

def __str__(self):
return 'smart_open.s3._SeekableReader(%r, %r)' % (self._bucket, self._key)
# class _SeekableRawReader(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out? If this class is no longer needed, please remove it.

client=None,
client_kwargs=None,
):
assert isinstance(stream_range, int), \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please raise a ValueError instead.

self._line_terminator = line_terminator

# the max_stream_size setting limits how much data will be read in a single HTTP request, this is an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean stream_range? I don't see max_stream_size anywhere.

if self._body is not None:
self._body.close()
self._body = self._stream_range_from = self._stream_range_to = None
self._eof = False if self._stream_range_from is None or self._position < self._content_length - 1 else True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rewrite this as a block of conditionals.

if self._stream_range_from is None or self._position < self._content_length - 1:
    self._eof = False
else:
    self._eof = True

stop = None if size == -1 else self._position + size
self._open_body(start=self._position, stop=stop)

#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like copypasta from below, please refactor.

self._eof = True
# check if existing body range is sufficient to satisfy this request, if not close it so that it re-opens
# with an appropriate range.
is_stream_limit_before_eof = self._body is not None \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use all() to simplify this expression.

is_stream_limit_before_eof = all(
    self._body is not None,
    size < 0,
    ...
)

this avoids the repetitive "and" and artificial line breaks.

# open the HTTP request if it's not open already
if self._body is None:
start = self._position + len(self._buffer)
stop = None if size < 0 else \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't abuse the ternary operator :)

if size < 0:
    stop = None
else if self._content_length is None:
    stop = start + size - 1
else:
    stop = start + max(size, self._stream_range) - 1


# If we can figure out that we've read past the EOF, then we can save
# an extra API call.
reached_eof = True if self._content_length is not None and self._position >= self._content_length else False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ternary operator.

Also, here's a question: why do we spend so much effort handling the case when the content length is None? I think we ever reach a point where we need it and don't have it, we should just query it.

What do you think?

@mpenkov
Copy link
Collaborator

mpenkov commented Nov 29, 2022

@piskvorky There's a ton of work left for this PR, my recommendation is to remove it from the milestone and deal with it after we release 6.3.0

@piskvorky piskvorky removed this from the 6.3.0 milestone Nov 29, 2022
@mpenkov mpenkov added the stale No recent activity from author label Feb 22, 2024
@ddelange
Copy link
Contributor

ddelange commented May 1, 2024

came across a nice read about S3 and range requests https://blog.limbus-medtec.com/the-aws-s3-denial-of-wallet-amplification-attack-bc5a97cc041d

@ddelange
Copy link
Contributor

ddelange commented Oct 4, 2024

fwiw, we routinely do a fp.read() (without size argument) on a ~1GB file and expect that to be a single GET request. If that would be split in 10MB reads internally by smart_open, that would increase the amount of GET requests (and costs) x100. Can the default behaviour of fp.read() remain unchanged, i.e. performing only one single get_object call?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale No recent activity from author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ranged read http header leaves TO bytes blank, causing S3 filesystems to read full file length
4 participants