Rewrite of s3.Reader class to protect S3 from open range headers (#725) #734

davidparks21 · 2022-10-19T20:01:44Z

Motivation

The smart_open.s3.Reader class left the HTTP range header open (e.g. bytes=0-). That was done to allow multiple (future) calls to read to stream bytes from the same HTTP body to avoid duplicating HTTP headers and connections. The open range header causes many S3 servers to internally move the full file, or a large block of the file, from its storage location to a server handling the S3 request. This was demonstrated on Ceph/S3 and SeaweedFS/S3, see issue #725. In the case of a small read against a large file this can quickly overwhelm the S3 filesystem.

Fixes #725.

Work Performed

The class smart_open.s3.Reader was re-written to ensure the HTTP range headers are always set reasonably to protect the S3 filesystem. The rewrite strikes a balance between optimizing for the open-read-close use case and the open-repeat-read-close use case. See the git issue #725 for a detailed discussion.

Tests

Existing unit tests for smart_open.s3 all pass. A few unit tests were removed for features that were eliminated and a few unit tests were added to cover new features. Some unrelated unit tests were updated because they failed to clean up after themselves.

Checklist

Before you create the PR, please make sure you have:

Picked a concise, informative and complete title
Clearly explained the motivation behind the PR
Linked to any existing issues that your PR will be solving
Included tests for any new functionality
Checked that all unit tests pass

Workflow

Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress.
Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

…ers (piskvorky#725)

mpenkov

Left you a ton of comments, please have a look. Thanks!

mpenkov · 2022-11-26T08:56:59Z

smart_open/s3.py

@@ -232,7 +233,7 @@ def open(
    buffer_size=DEFAULT_BUFFER_SIZE,
    min_part_size=DEFAULT_MIN_PART_SIZE,
    multipart_upload=True,
-    defer_seek=False,
+    stream_range=10485760,


Suggested change

stream_range=10485760,

stream_range=DEFAULT_STREAM_RANGE,

mpenkov · 2022-11-26T08:58:18Z

smart_open/s3.py

+        single HTTP request across multiple read calls, this is an important protection
+        for the S3 server, ensuring the S3 server doesn't get an open-ended byte-range request
+        which can cause it to internally queue up a massive file when only a small bit of it may ultimately be
+        read by the user. Note that the first read call (after opening or seeking) will always set the


The note is an implementation detail. It doesn't belong in user documentation. Please remove it.

You've already mentioned this in a code comment below - that is sufficient.

mpenkov · 2022-11-26T08:59:04Z

smart_open/s3.py

-
-    def __str__(self):
-        return 'smart_open.s3._SeekableReader(%r, %r)' % (self._bucket, self._key)
+# class _SeekableRawReader(object):


Why is this commented out? If this class is no longer needed, please remove it.

mpenkov · 2022-11-26T08:59:56Z

smart_open/s3.py

        client=None,
        client_kwargs=None,
    ):
+        assert isinstance(stream_range, int), \


Please raise a ValueError instead.

mpenkov · 2022-11-26T09:02:54Z

smart_open/s3.py

        self._line_terminator = line_terminator

+        # the max_stream_size setting limits how much data will be read in a single HTTP request, this is an


Do you mean stream_range? I don't see max_stream_size anywhere.

mpenkov · 2022-11-26T09:18:08Z

smart_open/s3.py

+        if self._body is not None:
+            self._body.close()
+            self._body = self._stream_range_from = self._stream_range_to = None
+        self._eof = False if self._stream_range_from is None or self._position < self._content_length - 1 else True


Please rewrite this as a block of conditionals.

if self._stream_range_from is None or self._position < self._content_length - 1: self._eof = False else: self._eof = True

mpenkov · 2022-11-26T09:18:34Z

smart_open/s3.py

+            stop = None if size == -1 else self._position + size
+            self._open_body(start=self._position, stop=stop)
+
+        #


This looks like copypasta from below, please refactor.

mpenkov · 2022-11-26T09:20:53Z

smart_open/s3.py

-                self._eof = True
+        # check if existing body range is sufficient to satisfy this request, if not close it so that it re-opens
+        # with an appropriate range.
+        is_stream_limit_before_eof = self._body is not None \


Please use all() to simplify this expression.

is_stream_limit_before_eof = all( self._body is not None, size < 0, ... )

this avoids the repetitive "and" and artificial line breaks.

mpenkov · 2022-11-26T09:23:00Z

smart_open/s3.py

+        # open the HTTP request if it's not open already
+        if self._body is None:
+            start = self._position + len(self._buffer)
+            stop = None if size < 0 else \


Please don't abuse the ternary operator :)

if size < 0: stop = None else if self._content_length is None: stop = start + size - 1 else: stop = start + max(size, self._stream_range) - 1

mpenkov · 2022-11-26T09:24:32Z

smart_open/s3.py

+
+        # If we can figure out that we've read past the EOF, then we can save
+        # an extra API call.
+        reached_eof = True if self._content_length is not None and self._position >= self._content_length else False


Ternary operator.

Also, here's a question: why do we spend so much effort handling the case when the content length is None? I think we ever reach a point where we need it and don't have it, we should just query it.

What do you think?

mpenkov · 2022-11-29T07:41:27Z

@piskvorky There's a ton of work left for this PR, my recommendation is to remove it from the milestone and deal with it after we release 6.3.0

…3 implementations.

ddelange · 2024-05-01T19:05:29Z

came across a nice read about S3 and range requests https://blog.limbus-medtec.com/the-aws-s3-denial-of-wallet-amplification-attack-bc5a97cc041d

ddelange · 2024-10-04T16:35:29Z

fwiw, we routinely do a fp.read() (without size argument) on a ~1GB file and expect that to be a single GET request. If that would be split in 10MB reads internally by smart_open, that would increase the amount of GET requests (and costs) x100. Can the default behaviour of fp.read() remain unchanged, i.e. performing only one single get_object call?

Rewrite of s3.Reader class to protect S3 servers from open range head…

5a321d0

…ers (piskvorky#725)

piskvorky added this to the 6.3.0 milestone Nov 19, 2022

mpenkov requested changes Nov 26, 2022

View reviewed changes

piskvorky removed this from the 6.3.0 milestone Nov 29, 2022

Corrected for fact that ActualObjectSize may not be provided by all S…

aafcdcf

…3 implementations.

mpenkov added the stale No recent activity from author label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite of s3.Reader class to protect S3 from open range headers (#725) #734

Rewrite of s3.Reader class to protect S3 from open range headers (#725) #734

davidparks21 commented Oct 19, 2022 •

edited by piskvorky

Loading

mpenkov left a comment

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov Nov 26, 2022

mpenkov commented Nov 29, 2022

ddelange commented May 1, 2024

ddelange commented Oct 4, 2024

		self._line_terminator = line_terminator

		# the max_stream_size setting limits how much data will be read in a single HTTP request, this is an

Rewrite of s3.Reader class to protect S3 from open range headers (#725) #734

Are you sure you want to change the base?

Rewrite of s3.Reader class to protect S3 from open range headers (#725) #734

Conversation

davidparks21 commented Oct 19, 2022 • edited by piskvorky Loading

Motivation

Work Performed

Tests

Checklist

Workflow

mpenkov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpenkov commented Nov 29, 2022

ddelange commented May 1, 2024

ddelange commented Oct 4, 2024

davidparks21 commented Oct 19, 2022 •

edited by piskvorky

Loading