-
Notifications
You must be signed in to change notification settings - Fork 535
Have problom in BERT pre-training: how to training on multiple GPUs #1508
Comments
Please provide the complete error message |
the whole message:
|
Firstly, I want to make sure: is my method correct for pre-training BERT model on multiply GPUs? |
Do you mean that you use gluon-nlp master branch with MXNet 1.7? It's not supported. You need to use MXNet 2 Alpha release https://github.com/apache/incubator-mxnet/releases/v2.0.0-alpha for using GluonNLP master branch. If you don't like to compile MXNet from source, you can also just follow https://github.com/dmlc/gluon-nlp#installation |
I use gluon-nlp branch 2.0 with MXNet 1.7. Is it also not supported? |
I think my environment of 'mpirun' mybe wrong, such as optional parameters:
it may causes problems with inter-process communication. So, what parameters need to set for Multi-GPU training ? |
I don't know how this branch was created, but there is actually no gluon-nlp 2.0. cc @szha @sxjscience let's delete the branch? |
I have no idea about the 2.0 branch. We may just delete it. @yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert |
I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment... |
That should work. In fact, is it feasible to try out our new version with the custom version of MXNet 2.0 and the GluonNLP master branch?
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: yangshuo0323 <[email protected]>
Sent: Friday, January 29, 2021 7:54:06 PM
To: dmlc/gluon-nlp <[email protected]>
Cc: Xingjian SHI <[email protected]>; Mention <[email protected]>
Subject: Re: [dmlc/gluon-nlp] Have problom in BERT pre-training: how to training on multiple GPUs (#1508)
I have no idea about the 2.0 branch. We may just delete it.
@yangshuo0323<https://github.com/yangshuo0323> Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert
I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#1508 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3X3PUCHIMHSXGPBYLLS4N7F5ANCNFSM4WWUK4MA>.
|
Ok, I will try out the new version of MXNet and GluonNLP. Thank you so much!
|
@yangshuo0323 Thanks! I will encourage to try our new version and we can help you if you meet any problems in training the model. To try the new MXNet, you can install with the following command: # Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python
# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python
# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python
# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python Also, you can just clone gluonnlp/master and install via the following command: python3 -m pip install -U -e ."[extras]" This will give the Also, you are recommended to install horovod via HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_MPI=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 python3 -m pip install --no-cache-dir horovod After that, feel free to try out the example in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert. We will try to help with any issues that you met. |
The previous error was due to the wrong installation of horovod, which maybe not use the env
|
Description
pip install mxnet-cu102
, verion is 1.7.0https://github.com/dmlc/gluon-nlp
, which branch is 2.0.gluon-nlp/scripts/bert/run_pretraining.py
:https://nlp.gluon.ai/model_zoo/bert/index.html#bert-model-zoo
Seek help:
I have read the guidance, but still don't known how to running.
Please help me, or can I have correct instruction or suggestion ? thanks.
The text was updated successfully, but these errors were encountered: