-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer overflow in TCP comm #4538
Comments
might've been introduced by #4506 |
Have we confirmed it is due to that change? Does the previous commit work? |
I reverted the commit and didn't run into the issue anymore but I'm also not sure if my computation triggers this reliably. |
Ok thanks for that data point It might be worth sticking a |
Does this show up in other Python versions? |
Asking about other Python versions as 3.6 is the only one with SSLObject. It got removed with PR ( python/cpython#5252 ) in 3.8, which was also backported to 3.7. So would be good to know if this is just an oddity of 3.6 |
Just checking in here. @fjetter do you have insight into this question? |
As of now, I don't have any more insights. I will try to set up another test run today to see if I can reproduce it another time. I will also try another python version. Since this requires a bit of time on my end, is there anything else worth checking out in these test runs?
|
So far, I was not able to reproduce :/ |
Interesting. What things have been tried thus far? Yeah those are the main things I can think of. Are there other things that you think would be relevant to test? |
Just checking in here, @fjetter any luck on reproducing? I don't mean to apply unneeded pressure, but I'm wondering if this is a blocker for our upcoming release (xref dask/community#129) |
I'm not actively working on this due to limited time and my last efforts reproducing it was not successful. Considering that the exception is caused by a SSLObject which only exists in py3.6, I'm wondering if it is a reasonable assumption that this will only be an issue on py36? |
Should we go ahead and close now that Python 3.6 is dropped? |
Yeah, let's close for now and re-open is needed |
I just saw this:- distributed.core - ERROR - signed integer is greater than maximum
Traceback (most recent call last):
File "/env/lib/python3.8/site-packages/distributed/core.py", line 554, in handle_stream
msgs = await comm.read()
File "/env/lib/python3.8/site-packages/distributed/comm/tcp.py", line 199, in read
n = await stream.read_into(frames)
File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 475, in read_into
self._try_inline_read()
File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 842, in _try_inline_read
pos = self._read_to_buffer_loop()
File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 755, in _read_to_buffer_loop
if self._read_to_buffer() == 0:
File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/env/lib/python3.8/site-packages/tornado/iostream.py", line 1592, in read_from_fd
return self.socket.recv_into(buf, len(buf))
File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum |
Definitely not 3.6. |
Do you have a reproducer? If so, can you please share this in a new issue? |
Not easy to reproduce in that is large scale satellite data processing via Open Data Cube on a Dask Gateway Kubernetes cluster, but can give some details. |
It looks like this is also happening in Python >= 3.8: |
At the end of that Python issue Christian notes this is fixed in Python 3.10. PR ( python/cpython#25468 ) seems to be the relevant change. Also he notes that this due to limitations of OpenSSL 1.0.2 and so remains an issue where that OpenSSL version is used. This is not an issue when OpenSSL 1.1.1 is used with the patch noted above Edit: Asked on the issue whether it would be possible to backport the fix (note this would still require package builders to use the newer OpenSSL) |
Seems to happen on the bigger/longer workloads I was doing- haven't seen it on small ones as yet. |
Yeah suspect that it is likely reproducible with a large message if that CPython bug is relevant. If not, maybe we are missing something |
Thinking about this a bit more, one option would be to take smaller views of the buffer we are reading into and fill those. We can ensure these buffers are big enough for OpenSSL to handle, but no bigger. After looking at the CPython code and OpenSSL, this appears to be the maximum value of a C |
Should we be adjusting our maximum shard size in comms under certain conditions? We already reduce this for different systems. For example in websockets we restrict things today to 8MiB frames. Maybe we should do the same for TLS under certain versions |
It's possible we will want to change the frame size if we discover this batching of smaller amounts of data through OpenSSL is causing us performance issues. That said, I haven't dived into that side of things. On a different note, it's interesting that OpenSSL uses an |
Supercedes dask#5134 Copying over the summary of that PR Works around the OpenSSL 1.0.2 bug demonstrated in issue ( dask#4538 ), except unlike PR ( dask#5115 ) which did this for reading, this does the same thing for writing. The error may be less likely to show up in the write path (as frames may simply be smaller than this limit). Still it seems like a good idea to protect against OverflowErrors from OpenSSL
Supercedes #5134 Copying over the summary of that PR Works around the OpenSSL 1.0.2 bug demonstrated in issue ( #4538 ), except unlike PR ( #5115 ) which did this for reading, this does the same thing for writing. The error may be less likely to show up in the write path (as frames may simply be smaller than this limit). Still it seems like a good idea to protect against OverflowErrors from OpenSSL
We are cutting a release tomorrow. These fixes should be in there ( dask/community#173 ) |
What happened:
I am receiving integer overflow errors in some large scale computations. Haven't had the chance to construct a MCVE but will try to.
cc @jakirkham I believe you reworked this section recently. Maybe you spot what's going wrong right away
What you expected to happen:
Minimal Complete Verifiable Example:
N/A
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: