Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL Overflow Errors on distributed 2022.05 #6556

Closed
hhuuggoo opened this issue Jun 9, 2022 · 4 comments · Fixed by #6557
Closed

SSL Overflow Errors on distributed 2022.05 #6556

hhuuggoo opened this issue Jun 9, 2022 · 4 comments · Fixed by #6557
Labels
bug Something is broken

Comments

@hhuuggoo
Copy link
Contributor

hhuuggoo commented Jun 9, 2022

What happened:
I tried to send a ridiculously large numpy array to a dask cluster. This resulted in an OverflowError.

2022-06-09 20:24:23,931 - distributed.batched - ERROR - Error in batched write
Traceback (most recent call last):
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/distributed/batched.py", line 94, in _background_send
    nbytes = yield self.comm.write(
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/distributed/comm/tcp.py", line 314, in write
    stream.write(b"")
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 544, in write
    self._handle_write()
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 1486, in _handle_write
    super()._handle_write()
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 971, in _handle_write
    num_bytes = self.write_to_fd(self._write_buffer.peek(size))
  File "/srv/conda/envs/saturn/lib/python3.9/site-packages/tornado/iostream.py", line 1568, in write_to_fd
    return self.socket.send(data)  # type: ignore
  File "/srv/conda/envs/saturn/lib/python3.9/ssl.py", line 1173, in send
    return self._sslobj.write(data)
OverflowError: string longer than 2147483647 bytes

What you expected to happen:

No errors.

Minimal Complete Verifiable Example:

import numpy as np
from dask.distributed import Client

arr = np.random.random(300000000)
c = Client(...) # assuming you have a cluster that supports TLS
print(c.scheduler)
def func(x):
    return x.sum()
fut = c.submit(func, arr)
print('submitted')
result = fut.result()
print('done')

Anything else we need to know?:

There was a related fix in #5141, however that fix works by passing in frame_split_size, which gets passed to distributed.protocol.core.dumps. frame_split_size is only used if msgpack encounters a type it does not understand. The numpy array becomes a bytearray by the time it makes it to dumps, so frame_split_size is not used and we end up with a 2.4GB message.

Tornado actually has a fix that does resolve this issue, however it's not on the latest release of tornado which is quite old.

https://github.com/tornadoweb/tornado/blob/master/tornado/iostream.py#L1565

When I monkey patch write_to_fd, the problem is resolved.

Environment:

  • Dask version: 2022.5
  • Python version:3.9
  • Operating System: Linux
  • Install method (conda, pip, source): pip
@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2022

cc @jakirkham I suspect , unless @hhuuggoo you're interested in taking this on.

@hhuuggoo
Copy link
Contributor Author

hhuuggoo commented Jun 9, 2022

cc @jakirkham I suspect , unless @hhuuggoo you're interested in taking this on.

happy to try, I would need some guidance though. Depending on the recommended approach I might be out of my depth.

@mrocklin
Copy link
Member

mrocklin commented Jun 9, 2022 via email

@jakirkham
Copy link
Member

@hhuuggoo, PR ( #5134 ) has the needed logic. Would suggest resurrecting that

@fjetter fjetter added the bug Something is broken label Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants