Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager debug mode urllib3 error #988

Open
gbpdt opened this issue Sep 19, 2024 · 5 comments
Open

Eager debug mode urllib3 error #988

gbpdt opened this issue Sep 19, 2024 · 5 comments
Labels
bug Something isn't working Trn1

Comments

@gbpdt
Copy link

gbpdt commented Sep 19, 2024

When I try to use eager debug mode, I receive the following error:

WARNING:torch_neuron:Eager debug mode is enabled. In this mode all operations would be executed eagerly. This will result in high execution times.
2024-09-19 11:56:33.000250:  2789  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-09-19 11:56:33.000958:  2789  INFO ||NEURON_CC_WRAPPER||: Call compiler client with cmd: --target=trn1 --framework=XLA /tmp/no-user/neuroncc_compile_workdir/3b831dea-72eb-4ca0-9587-f8b4292ba794/model.MODULE_2968843894416043412+d7517139.hlo_module.pb --output /tmp/no-user/neuroncc_compile_workdir/3b831dea-72eb-4ca0-9587-f8b4292ba794/model.MODULE_2968843894416043412+d7517139.neff -O0 --internal-tensorizer-opt-level=eager
WARNING:neuronxcc.cli.Client:Failed to connect to com.amazon.neuronxcc.2989. Request will be retried in 0.5s.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 633, in send
    conn = self.get_connection_with_tls_context(
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 489, in get_connection_with_tls_context
    conn = self.poolmanager.connection_from_host(
  File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 246, in connection_from_host
    return self.connection_from_context(request_context)
  File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 258, in connection_from_context
    raise URLSchemeUnknown(scheme)
urllib3.exceptions.URLSchemeUnknown: Not supported URL scheme http+unix

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/neuronxcc/cli/Client.py", line 84, in run
    response = self.session.post(f"http+unix://\0{self.abstractSocket}/compile",
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 637, in send
    raise InvalidURL(e, request=request)
requests.exceptions.InvalidURL: Not supported URL scheme http+unix

The following simple test code reproduces the issue when run with NEURON_USE_EAGER_DEBUG_MODE=1 set in the environment, using both Neuron 2.19.1 and 2.20.0 via the DLC container:

#!/usr/bin/env python3

import torch
import torch_xla.core.xla_model as xm

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(in_features=1, out_features=1)

    def forward(self, x):
        return self.linear(x)


def main():
    model = MyModel()
    device = xm.xla_device()
    model.to(device)
    data = torch.rand(10).to(device)
    model(data)

if __name__ == "__main__":
    main()
@aws-taylor
Copy link
Contributor

This is caused by a backwards incompatible change that the requests python library made in 2.32.0 that caused custom schemes such as http+unix and http+docker to no longer be supported. The neuronx-cc package pins requests<2.30.0 as a dependency to avoid this issue. I suspect that your environment may have overridden/ignored this constraint, causing the above problem. Downgrading should address the issue. You can read more about this change here: psf/requests#6707

@jyang-aws
Copy link
Contributor

@gbpdt I agree with Taylor.
When I try your test case, it works for me, both with the release 2.20 and 2.19. Maybe you can follow https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/index.html to ensure the setup is new.

Besides, I need to change
self.linear = torch.nn.Linear(in_features=10, out_features=1)
to match the input dimensions.
Please let us know if you have further questions.

@gbpdt
Copy link
Author

gbpdt commented Sep 19, 2024

Hi, I actually don't have this problem in my own python environment which doesn't have the problematic version. I'm reporting that the problem exists in the Deep Learning Containers (the above was with public.ecr.aws/neuron/pytorch-training-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04). Thanks.

@jyang-aws jyang-aws reopened this Sep 20, 2024
@jyang-aws
Copy link
Contributor

I see. Thanks! Let me reopen this issue and we'll investigate and fix.

@jeffhataws
Copy link
Contributor

Latest neuronx-cc (2.20 version 2.15.128.0+56dc5a86) pinning is actually < 2.32.

│   │   └── neuronx-cc [required: ~=2.0, installed: 2.15.128.0+56dc5a86]
...
│   │       ├── requests [required: <2.32.0, installed: 2.31.0]
...
│   │       │   └── urllib3 [required: >=1.21.1,<3, installed: 1.26.19]

Even with the correct requests < 2.32, you also may see the following error if urllib3 is >= 2.*:

2024-08-18 20:13:52.000136:  72870  INFO ||NEURON_CC_WRAPPER||: Call compiler client with cmd: --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/c84164f9-1a5d-441f-b20c-a47f8df7f7f9/model.MODULE_646047888363632588+d7517139.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/c84164f9-1a5d-441f-b20c-a47f8df7f7f9/model.MODULE_646047888363632588+d7517139.neff -O0 --internal-tensorizer-opt-level=eager
WARNING:neuronxcc.cli.Client:Failed to connect to com.amazon.neuronxcc.72960. Request will be retried in 0.5s.
Traceback (most recent call last):
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/neuronxcc/cli/Client.py", line 84, in run
    response = self.session.post(f"http+unix://\0{self.abstractSocket}/compile",
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
  File "/home/ubuntu/aws_neuron_venv_pytorch/lib/python3.10/site-packages/urllib3/connectionpool.py", line 495, in _make_request
    conn.request(
TypeError: HTTPConnection.request() got an unexpected keyword argument 'chunked'

The workaround for both Not supported URL scheme http+unix and unexpected keyword argument 'chunked' is to install older versions of both packages that satisfy the neuronx-cc dependency contraints:

pip install requests==2.31.0 urllib3==1.26.20

@jyang-aws jyang-aws added Trn1 bug Something isn't working labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Trn1
Projects
None yet
Development

No branches or pull requests

4 participants