Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer overflow in TCP comm #4538

Closed
fjetter opened this issue Feb 23, 2021 · 27 comments · Fixed by #5115
Closed

Integer overflow in TCP comm #4538

fjetter opened this issue Feb 23, 2021 · 27 comments · Fixed by #5115

Comments

@fjetter
Copy link
Member

fjetter commented Feb 23, 2021

What happened:

I am receiving integer overflow errors in some large scale computations. Haven't had the chance to construct a MCVE but will try to.

cc @jakirkham I believe you reworked this section recently. Maybe you spot what's going wrong right away

What you expected to happen:

Minimal Complete Verifiable Example:

N/A

Anything else we need to know?:

Traceback (most recent call last):
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/core.py", line 554, in handle_stream
    msgs = await comm.read()
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/comm/tcp.py", line 199, in read
    n = await stream.read_into(frames)
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/tornado/iostream.py", line 475, in read_into
    self._try_inline_read()
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/tornado/iostream.py", line 842, in _try_inline_read
    pos = self._read_to_buffer_loop()
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/tornado/iostream.py", line 755, in _read_to_buffer_loop
    if self._read_to_buffer() == 0:
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/tornado/iostream.py", line 1592, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
  File "/usr/lib/python3.6/ssl.py", line 1009, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.6/ssl.py", line 871, in read
    return self._sslobj.read(len, buffer)
  File "/usr/lib/python3.6/ssl.py", line 631, in read
    v = self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum (distributed.core)

Environment:

  • Dask version: dask/dask@91cee5b
  • tornado version: 6.1
  • Python version: py3.6
  • Operating System: Deb9
  • Install method (conda, pip, source): pip
@fjetter
Copy link
Member Author

fjetter commented Feb 23, 2021

might've been introduced by #4506

@jakirkham
Copy link
Member

Have we confirmed it is due to that change? Does the previous commit work?

@fjetter
Copy link
Member Author

fjetter commented Feb 23, 2021

I reverted the commit and didn't run into the issue anymore but I'm also not sure if my computation triggers this reliably.

@jakirkham
Copy link
Member

Ok thanks for that data point

It might be worth sticking a print(frames_nbytes) in read to see how large the message is when things fail

@jakirkham
Copy link
Member

Does this show up in other Python versions?

@jakirkham
Copy link
Member

Asking about other Python versions as 3.6 is the only one with SSLObject. It got removed with PR ( python/cpython#5252 ) in 3.8, which was also backported to 3.7. So would be good to know if this is just an oddity of 3.6

@jakirkham
Copy link
Member

Does this show up in other Python versions?

Just checking in here. @fjetter do you have insight into this question?

@fjetter
Copy link
Member Author

fjetter commented Feb 25, 2021

As of now, I don't have any more insights. I will try to set up another test run today to see if I can reproduce it another time. I will also try another python version. Since this requires a bit of time on my end, is there anything else worth checking out in these test runs?

  • print statement for the frames_nbytes
  • python version
    anything else worthy logging/inspecting?

@fjetter
Copy link
Member Author

fjetter commented Feb 25, 2021

So far, I was not able to reproduce :/

@jakirkham
Copy link
Member

Interesting. What things have been tried thus far?

Yeah those are the main things I can think of. Are there other things that you think would be relevant to test?

@jrbourbeau
Copy link
Member

Just checking in here, @fjetter any luck on reproducing? I don't mean to apply unneeded pressure, but I'm wondering if this is a blocker for our upcoming release (xref dask/community#129)

@fjetter
Copy link
Member Author

fjetter commented Mar 4, 2021

I'm not actively working on this due to limited time and my last efforts reproducing it was not successful. Considering that the exception is caused by a SSLObject which only exists in py3.6, I'm wondering if it is a reasonable assumption that this will only be an issue on py36?
My gut feeling tells me that a very difficult to reproduce, yet to be confirmed issue which likely only triggers on a python version we're about to drop should not block a release.

@jakirkham
Copy link
Member

Should we go ahead and close now that Python 3.6 is dropped?

@jrbourbeau
Copy link
Member

Yeah, let's close for now and re-open is needed

@RichardScottOZ
Copy link

I just saw this:-

distributed.core - ERROR - signed integer is greater than maximum
Traceback (most recent call last):
  File "/env/lib/python3.8/site-packages/distributed/core.py", line 554, in handle_stream
    msgs = await comm.read()
  File "/env/lib/python3.8/site-packages/distributed/comm/tcp.py", line 199, in read
    n = await stream.read_into(frames)
  File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 475, in read_into
    self._try_inline_read()
  File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 842, in _try_inline_read
    pos = self._read_to_buffer_loop()
  File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 755, in _read_to_buffer_loop
    if self._read_to_buffer() == 0:
  File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 1592, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum

@RichardScottOZ
Copy link

RichardScottOZ commented May 21, 2021

Definitely not 3.6.

@jakirkham
Copy link
Member

Do you have a reproducer? If so, can you please share this in a new issue?

@RichardScottOZ
Copy link

Not easy to reproduce in that is large scale satellite data processing via Open Data Cube on a Dask Gateway Kubernetes cluster, but can give some details.

@quasiben
Copy link
Member

It looks like this is also happening in Python >= 3.8:
https://bugs.python.org/issue42853

@jakirkham
Copy link
Member

jakirkham commented Jun 2, 2021

At the end of that Python issue Christian notes this is fixed in Python 3.10. PR ( python/cpython#25468 ) seems to be the relevant change.

Also he notes that this due to limitations of OpenSSL 1.0.2 and so remains an issue where that OpenSSL version is used. This is not an issue when OpenSSL 1.1.1 is used with the patch noted above

Edit: Asked on the issue whether it would be possible to backport the fix (note this would still require package builders to use the newer OpenSSL)

@RichardScottOZ
Copy link

Seems to happen on the bigger/longer workloads I was doing- haven't seen it on small ones as yet.

@jakirkham
Copy link
Member

Yeah suspect that it is likely reproducible with a large message if that CPython bug is relevant. If not, maybe we are missing something

@jakirkham
Copy link
Member

Thinking about this a bit more, one option would be to take smaller views of the buffer we are reading into and fill those. We can ensure these buffers are big enough for OpenSSL to handle, but no bigger. After looking at the CPython code and OpenSSL, this appears to be the maximum value of a C int, which will be implementation dependent. Have written some code in PR ( #5115 ), which should workaround this issue in this way. If anyone can reproduce, would be good to have some feedback on whether that works.

@mrocklin
Copy link
Member

Should we be adjusting our maximum shard size in comms under certain conditions? We already reduce this for different systems. For example in websockets we restrict things today to 8MiB frames. Maybe we should do the same for TLS under certain versions

@jakirkham
Copy link
Member

It's possible we will want to change the frame size if we discover this batching of smaller amounts of data through OpenSSL is causing us performance issues. That said, I haven't dived into that side of things.

On a different note, it's interesting that OpenSSL uses an int here at all instead of size_t, which is what OpenSSL 1.1.1+ does and is what one would use in C typically. Maybe this code is quite old and precedes size_t or they would have been nearly equivalent at the time? Or no one thought so much data would be given to OpenSSL? Or perhaps whatever algorithm was used couldn't handle size_t amounts of data previously?

@jakirkham
Copy link
Member

jakirkham commented Jul 28, 2021

As this error could show up in the write code path as well, added PR ( #5134 ) to protect against that too.

Edit: Superseded by PR ( #5141 )

mrocklin added a commit to mrocklin/distributed that referenced this issue Jul 29, 2021
Supercedes dask#5134

Copying over the summary of that PR

Works around the OpenSSL 1.0.2 bug demonstrated in issue ( dask#4538 ), except unlike PR ( dask#5115 ) which did this for reading, this does the same thing for writing. The error may be less likely to show up in the write path (as frames may simply be smaller than this limit). Still it seems like a good idea to protect against OverflowErrors from OpenSSL
jrbourbeau pushed a commit that referenced this issue Jul 29, 2021
Supercedes #5134

Copying over the summary of that PR

Works around the OpenSSL 1.0.2 bug demonstrated in issue ( #4538 ), except unlike PR ( #5115 ) which did this for reading, this does the same thing for writing. The error may be less likely to show up in the write path (as frames may simply be smaller than this limit). Still it seems like a good idea to protect against OverflowErrors from OpenSSL
@jakirkham
Copy link
Member

We are cutting a release tomorrow. These fixes should be in there ( dask/community#173 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants