Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed package doesn't have NCCL built in #50

Closed
alescire94 opened this issue Mar 2, 2023 · 26 comments
Closed

Distributed package doesn't have NCCL built in #50

alescire94 opened this issue Mar 2, 2023 · 26 comments
Labels
compatibility issues arising from specific hardware or system configs documentation Improvements or additions to documentation

Comments

@alescire94
Copy link

alescire94 commented Mar 2, 2023

Got the following error when executing:
torchrun --nproc_per_node 1 example.py --ckpt_dir models/7B --tokenizer_path models/tokenizer.model

image

additional info:
cuda: 11.4
GPU: NVIDIA GeForce 3090
torch 1.12.1
Ubuntu 20.04.2 LTS

Anyone knows how to solve it?
Thanks in advance!

@sz85512678
Copy link

Have the same issue, running on a MacBook Pro.

@christophelebrun
Copy link

Same, running on MacBook Pro M1.

@Amadeus-AI
Copy link

Try change

torch.distributed.init_process_group("nccl")

to

torch.distributed.init_process_group("gloo")

@Elvincth
Copy link

Elvincth commented Mar 2, 2023

Try change

torch.distributed.init_process_group("nccl")

to

torch.distributed.init_process_group("gloo")

Running on a MacBook pro m1, after changing to this got a new error: AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

@AmirFone
Copy link

AmirFone commented Mar 2, 2023

Running on a MacBook pro m1, after changing to this got a new error: AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

I'm getting the same error have you been able to resolve it?

@b0kch01
Copy link

b0kch01 commented Mar 2, 2023

Running on a MacBook pro m1, after changing to this got a new error: AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

I'm getting the same error have you been able to resolve it?

To run on m1, You have to go through the repository and modify every line that references CUDA to use the CPU instead. Even then, the sample prompt took over an hour to run for me on the smallest llama model.

@jaygdesai
Copy link

jaygdesai commented Mar 2, 2023

Running on a MacBook pro m1, after changing to this got a new error: AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

I'm getting the same error have you been able to resolve it?

To run on m1, You have to go through the repository and modify every line that references CUDA to use the CPU instead. Even then, the sample prompt took over an hour to run for me on the smallest llama model.

I am trying to do the same. Can you tell exactly how did you achieve the same? Also, if you can share your example.py, it would be great.

@b0kch01
Copy link

b0kch01 commented Mar 2, 2023

Make the following changes if you want to use CPU

example.py

# torch.distributed.init_process_group("nccl")  you don't have/didn't properly setup gpus 
torch.distributed.init_process_group("gloo") # uses CPU 
# torch.cuda.set_device(local_rank) remove for the same reasons


# torch.set_default_tensor_type(torch.cuda.HalfTensor)
torch.set_default_tensor_type(torch.FloatTensor)

generation.py

# tokens = torch.full((bsz, total_len), self.tokenizer.pad_id).cuda().long()
tokens = torch.full((bsz, total_len), self.tokenizer.pad_id).long()

model.py

self.cache_k = torch.zeros(
    (args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
)#.cuda()
self.cache_v = torch.zeros(
    (args.max_batch_size, args.max_seq_len, self.n_local_heads, self.head_dim)
)#.cuda()

@AmirFone
Copy link

AmirFone commented Mar 2, 2023

Thank you! @b0kch01

@jaygdesai
Copy link

I am now stuck with following issue after following instructions provided by @b0kch01:

image

Any guidance?

@paulutsch
Copy link

paulutsch commented Mar 2, 2023

Bildschirmfoto 2023-03-02 um 23 30 01

After following @b0kch01's advices, I'm stuck here. Can anyone help out?

@b0kch01
Copy link

b0kch01 commented Mar 3, 2023

After following @b0kch01's advices, I'm stuck here. Can anyone help out?

When the log says MP=0, it means it cannot find any of the weights in the path you provided

@Urammar
Copy link

Urammar commented Mar 3, 2023

Those errors both typically indicate you havent actually pointed at either the model or the tokenizer. I noticed the tokenizer doesnt seem to get downloaded with the 7B

I'll upload it here to save you the trouble, this should be extracted to whatever directory you set your download to (1 above your 7B model directory itself) or just manually point to it with that command line

tokenizer.zip

@paulutsch
Copy link

paulutsch commented Mar 3, 2023

Those errors both typically indicate you havent actually pointed at either the model or the tokenizer. I noticed the tokenizer doesnt seem to get downloaded with the 7B

I'll upload it here to save you the trouble, this should be extracted to whatever directory you set your download to (1 above your 7B model directory itself) or just manually point to it with that command line

tokenizer.zip

Thanks! Apparently, the download.sh needs to be edited to function on macOS. Luckily, this guy made it work: https://github.com/facebookresearch/llama/pull/39/files

I implemented the changes and am now downloading the weights.

@mawilson1234
Copy link

I'm getting stuck in a different way after following @b0kch01's changes. I'm not running into the MP=0 issue, but I get some warnings about being unable to initialize the client and server sockets. Otherwise it looks similar to @jaygdesai's issue but with exitcode 7 instead of 9.

error

@b0kch01
Copy link

b0kch01 commented Mar 3, 2023

If you'd like, you can try my llama cpu fork. Tested to work on my Macbook Pro M1 Max

@turbo
Copy link

turbo commented Mar 3, 2023

Just in case someone's going to ask for MPS (M1/2 GPU support): the code uses view_as_complex, which is neither supported, nor does it have any PYTORCH_ENABLE_MPS_FALLBACK due to memory sharing issues. Even modifying the code to use MPS does not enable GPU support on Apple Silicon until pytorch/pytorch#77764 is fixed. So only CPU for now, but @b0kch01's version works nicely 🙂

@mawilson1234
Copy link

mawilson1234 commented Mar 3, 2023

Unfortunately, I'm still unable to run anything using @b0kch01's llama-cpu repo on Linux.

[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
Locating checkpoints
Found MP=1 checkpoints
Creating checkpoint instance...
Grabbing params...
Loading model arguments...
Creating tokenizer...
Creating transformer...
-- Creating embedding
-- Creating transformer blocks (32)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 13822) of binary: /gpfs/gibbs/project/frank/maw244/conda_envs/llama/bin/python3.10
Traceback (most recent call last):
	File "/gpfs/gibbs/project/frank/maw244/conda_envs/llama/bin/torchrun", line 8, in <module>
		sys.exit(main())
	File "/gpfs/gibbs/project/frank/maw244/conda_envs/llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
		return f(*args, **kwargs)
	File "/gpfs/gibbs/project/frank/maw244/conda_envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
		run(args)
	File "/gpfs/gibbs/project/frank/maw244/conda_envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
		elastic_launch(
	File "/gpfs/gibbs/project/frank/maw244/conda_envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
		return launch_agent(self._config, self._entrypoint, list(args))
	File "/gpfs/gibbs/project/frank/maw244/conda_envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
		raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
example.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time      : 2023-03-03_17:31:31
host      : c14n02.grace.hpc.yale.internal
rank      : 0 (local_rank: 0)
exitcode  : -9 (pid: 13822)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 13822
======================================================

Here's the version on the cluster:

LSB Version:    :core-4.1-amd64:core-4.1-ia32:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-ia32:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-ia32:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.9 (Maipo)
Release:        7.9
Codename:       Maipo

Edit:

Nevermind, it was an issue with the memory available on the interactive node. Got it working by adjusting the batch size and seq len and submitting it as a job. Thank you, @b0kch01!

@jiyaos
Copy link

jiyaos commented Mar 4, 2023

Screenshot 2023-03-04 at 2 17 11 AM

Hi, I'm stuck here, following @b0kch01's llama-cpu repo on my mac. Any suggestion? Thanks.

@paulutsch
Copy link

Screenshot 2023-03-04 at 2 17 11 AM

Hi, I'm stuck here, following @b0kch01's llama-cpu repo on my mac. Any suggestion? Thanks.

I‘m facing the same issue, but only with the 7B model. It seems like the consolidated.01.pth file doesn’t get downloaded when running the download.sh file: the consolidated.01.pth file is the only one I get a 403 status instead of 200.

I opened a new issue here.

@jiyaos
Copy link

jiyaos commented Mar 5, 2023

Thank you @Urammar

@astelmach01
Copy link

astelmach01 commented Jul 19, 2023

Running on a MacBook pro m1, after changing to this got a new error: AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

I'm getting the same error have you been able to resolve it?

To run on m1, You have to go through the repository and modify every line that references CUDA to use the CPU instead. Even then, the sample prompt took over an hour to run for me on the smallest llama model.

Doesn't pytorch support the apple silicon GPU, so it wouldn't be slow as a snail

@piex-1
Copy link

piex-1 commented Jul 26, 2023

When I was using my own jetson agx orin developer kit, I also had this error. I checked online and found that jetson does not seem to support NCCL. Is this normal, because I don’t want to use the CPU to run。
Screenshot from 2023-07-26 21-30-02

@aragon5956
Copy link

Try change

torch.distributed.init_process_group("nccl")

to

torch.distributed.init_process_group("gloo")

where ?

@albertodepaola albertodepaola added documentation Improvements or additions to documentation compatibility issues arising from specific hardware or system configs labels Sep 6, 2023
@albertodepaola
Copy link

Closing as author is inactive. If anyone has further questions, feel free to open a new issue. For future reference, check both llama and llama-recipes repos for getting started guides.

@hua3721
Copy link

hua3721 commented Apr 7, 2024

This solution works for me: #947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility issues arising from specific hardware or system configs documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests