-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error building extension 'cpu_adam' #889
Comments
I am not sure what exactly is happening here, since in the trace log I am seeing it says that it is trying to load the extension module cpu-adam, however the ds-report says it is not installed! I am thinking maybe this might be a problem of the system caching this module somehow! Thanks. |
Hi Reza, Thanks for your reply! I tried removing it, it did have Funnily enough, I tried running this on Colab and it seemed to have loaded it:
And Colab's
Albeit, I did get another error on Colab ( |
Quick update, I ran the command from the notebook (running
|
I was able to get the example from the notebook going after I downgraded the DeepSpeed version to 0.3.10. I do have a follow-up question though: Correct me if I'm wrong, but the only way to use DeepSpeed would be to use the HuggingFace |
We already have examples for running for some transformer networks. For this argument, I think you might just add local_rank to your parser arguments the same as here. |
So is this the issue with 0.3.13 of DeepSpeed? Because I'm facing the same issue as well with 0.3.13. Also, are you able to run HuggingFace-4.4.2 with DeepSpeed-0.3.10 ? I think you should've downgraded to HuggingFace-4.3.x. |
Hi @arthur-morgan-712 , Could you try building deepspeed ops while installing as suggested here? |
This is no longer needed in
Not at all. You can do your own integration and not rely on the HF Trainer. If you do use If you have build problems please make sure you read: Perhaps try to pre-build deepspeed: #885 (comment) |
Thanks @stas00 for clarifying this : ) |
And the error is right there in your report: #889 (comment)
You're missing the right build tools.
@RezaYazdaniAminabadi, perhaps |
Still running into: CUDA 10.2 Tried down grading deepspeed to 0.3.10 and ran into: Any other potential solutions to date or still open?? |
I also occur that. |
This sounds like a permission issue. Try to set e.g.:
|
That won't work because deepspeed hardcodes the default extension path to be /tmp/torch_extensions The default is not used if TORCH_EXTENSIONS_DIR is set in the environment, but it would certainly be an improvement for it to follow |
@stas00 @RezaYazdaniAminabadi
The above exception was the direct cause of the following exception: Traceback (most recent call last): |
@arthur-morgan-712, you have a problem with your cuda environment:
properly install the cuda environment, including all dev header files and do the run again and it'll work. on ubuntu I usually recommend the nvidia cuda pre-packaged .deb files, but be careful the latest cuda is already in 12.x so you make sure you're installing the same major version as pytorch - most likely cuda-11.x - I think 11.8 is the latest in that line. e.g. on my system the missing file is here:
|
In my case, this BUG is due to ninja compile error, you can change directory to |
Closing this issue as the original issue was resolved. If anyone is having issues with this, please open a new issue and link this one and we would be happy to take a look. |
hi @chiang ,
|
It can't find your cuda install - if you have it installed search for where e.g. on my machine it's: But most likely you don't have cuda installed. You can install it via I haven't tried, but this is probably the right package if you want to install it into your conda and has all the latest versions as well https://anaconda.org/nvidia/cuda-libraries - use the same version as your pytorch, to check which pytorch cuda version is used run:
|
hi @stas00 , |
@stas00 pls see the detail below
|
i export the LD_LIBRARY_PATH first and run the ninja -v under concern folder but it has no change for linkage err, Why? |
I see you have used Did you check that you have and which conda package did you install? |
I link the libcurand.so to the folder and still no change. and i find it works good if i add a link dir -L/usr/local/cuda/lib64 in ldflags of build.ninja file. it seems like deepspeed gen a wrong ninja config and cause the program fail. i think this should be a bug and require a fix |
@stas00 for detail info, pls refer the 5813 issue i mention above |
In which case you need to set:
To automate this see: https://askubuntu.com/questions/210884/setting-ld-library-path-for-cuda The other solution that often helps is to set
|
nope, the env LD_LIBRARY_PATH take effect at runtime so it is only used when program loaded. and for now the cpu_adam fail at compile time. So LD_LIBRARY_PATH provides no help. I manuall modify the |
What I shared works for me. That's what I use on my desktop to build deepspeed. |
|
Hey guys, I'm having a problem getting DeepSpeed working with XLM-Roberta. I'm trying to run it on an Amazon Linux machine, which is based on Red Hat. Here are a some versions of packages/dependencies I'm using:
cuda version: 10.2
transformers: 4.4.2
pytorch: 1.7.1
deepspeed: 0.3.13
gcc/c++/g++: (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
I must admit I had some issues upgrading the CUDA version from the default 10.0 on the instance to 10.2 and GCC from 4.8.5 to 7.2.1 but since I don't get the error that the torch and installed CUDA versions are different and that GCC has a version lower than 5, I'd assume I'm in the clear.
Here's the essential part of the code I'm running (from a notebook):
Here's the content of my config file:
Here's the output of my
ds_config
:And finally, here's the stack trace:
Thanks in advance for your help!
The text was updated successfully, but these errors were encountered: