-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Megatron-11B #10301
[WIP] Add Megatron-11B #10301
Conversation
That's very neat, @anton-l! thank you for the port You demonstrated a very good creativity by finding a way to recompose the model shards!
As you correctly noticed studying Megatron-LM's horizontal model parallel sharding is on my TODO list. I suppose since I think the part that deals with sharding is here in the original: So if the horizontal MP is eventually re-ported (I hope it will be so) the model will need to know when to load the flattened version and when the sharded one. But I'm just just thinking aloud here, considering different options, not making any requests ;) The fp32 weights are ~41GB https://huggingface.co/anton-l/megatron-11b/tree/main - i.e. it's quite similar to t5-11b, so it should be possible to load it on a 40GB gpu w/ DeepSpeed ZeRO-Offload if there are some 256GB of RAM available. Also, FYI, Deepspeed are making a new port of Megatron-LM to work with DeepSpeed. https://github.com/jeffra/DSE/tree/master/megatron-lm |
@stas00 you're correct, I didn't port the model-parallel implementation. Fairseq uses an older Megatron-LM version as a submodule here for its MP map-reduce fuctions. This makes it quite cumbersome to reproduce, since it requires compiling an older However, on the surface it seems like adding support for model parallelism comes down to porting I guess a proper MP implementation should also take care of splitting the checkpointed layers regardless of how many GPUs are available (i.e. 2, 4 or 8). That would remove the requirement to have a full DGX setup if the user is willing to use gradient checkpointing/accumulation instead. |
@anhon-l, in order not to make your and reviewers' lives unnecessarily difficult, let's take the discussion of the Horizontal MP to a dedicated issue, since it could take some time to figure and none of is required for you to complete this PR and I trust @patil-suraj and @patrickvonplaten will support you at completing this awesome effort. So if you could re-post your last comment here: #10321 and I will follow up there. Thank you! |
This is amazing work, big kudos! The seemingly low text-generation quality surprises me though, because of the crazy good output you get from https://inferkit.com/ which is also just Megatron11b, according to their docs (https://inferkit.com/docs/generation). Their output seems to be much better than GPT2. |
@anton-l, would you like to complete this PR? For it to be reviewed it needs to be a normal PR and not a draft. I marked it as WIP so that the stale bot won't try to close it. Thank you. |
pinging @anton-l - let's revisit this? Please let us know what you need. I know meanwhile someone else did the porting of the original GPT2-345M checkpoint https://huggingface.co/nvidia/megatron-gpt2-345m and I see from the docs they use straight GPT2 transformers model to operate it. All they have is a conversion script: Please bear with me, I'm just starting to figure out Megatron-LM and its variants (there is also a Deepspeed variant), so I'm just slightly above clueless at the moment - should have a better understanding in a few days once I had a chance working with it. |
@stas00 sorry for the late reply! It's great that someone figured out a way to post the original megatron models. When I was looking into that, it wasn't exactly straightforward due to the differences between the attention block implementations in HF GPT2 and Megatron, which was probably patched/parameterized in the meantime. I chose to implement a separate model for the fairseq megatron because the model uses the same code as the existing MBART & FSMT, but there's only an encoder model, without the decoder. However, we could take a different route and convert the fairseq weights to fit GPT2, since it's clearly possible now. I'll try that tomorrow, and if it works out, we can discard this PR and just add a simple conversion script 👍 |
This PR seems very promising and I know the model would be really useful to many. As it was earlier pointed out, the converted model doesn't seem to have the same quality of generation as the model elsewhere. Perhaps the conversion script could have caused it somehow? Just curious if there was any success with converting the fairseq weights to fit GPT2. |
What does this PR do?
Fixes #9560
This PR introduces the Megatron model as described in https://github.com/pytorch/fairseq/blob/master/examples/megatron_11b/README.md
This one will probably be fun to test with DeepSpeed, as @stas00 mentioned it's referenced a lot in its docs 😄
It's important to mention that there are actually two independent implementations of Megatron-LM:
After some tinkering I realized that fairseq's checkpoint is already pretty compatible with the existing BART port. So, based on that and the fact that NVIDIA doesn't plan on releasing the 3B and 8B checkpoints, I chose to port only the fairseq version.
NOTE: The original fairseq implementation requires an 8-GPU server to even load the model weights, so I just load the checkpoints manually one by one and merge the model-parallelized tensors into single-model ones.
How to reproduce the conversion
state_dict
can be easily loaded into a CPU-compatible faiseq.TransformerLanguageModel. The de-parallelisation is based on ParlAI's conversion script.state_dict
into it.Here's how Megatron differs from the existing BART/MBART implemenations:
encoder_hidden_states
,encoder_attention_mask
) and the cross-attention to simplify the review process on your end.SinusoidalPositionalEmbedding
instead of learned ones, so I just yanked those from FSMT 😄layernorm_embedding
self_attn_layer_norm
is applied before self-attention (like in MBART) instead of after (like in BART).Important questions regarding the API:
decoder
variable can be left as is, since it's compatible with the fairseq checkpoint keys, but theencoder_*
references in the code bother me a lot. We need to somehow strike a balance betweenCopied from
and removing the unused parts.self_attn_layer_norm
should be a parameter in the config, similar todecoder_normalize_before=True
in faiseq. This will close the not-so-obvious difference between BART and MBART.layernorm_embedding
can also be parametrized, similar tolayernorm_embedding=False
in fairseq.Quick LM test
You can test out the model's capabilities like so (again, you'll probably need at least 85GB RAM, there's some weird memory duplication happening somewhere, this should not need more than 50):
To be honest, I'm not too impressed with its text-generation power. 😄 I guess it's either that the model was too large to train it for enough steps, or I missed something during the conversion. The original implementation does not have a text-generation script (or any non-wikitext results, for that matter), so I'm kinda in the dark here.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@patrickvonplaten, @patil-suraj