[WIP][Need help and discussion] : basic llama tensor parallel #32597
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR addresses an issue encountered when running the following command:
The current implementation results in the following error:
Problem Description and Discussion Points
Sequence Length:
It appears that the Tensor Parallel approach requires the sequence length to be evenly divisible, which is not currently handled in the existing implementation (though this doesn't apply in inference mode).
Potential Solution - Accelerate:
Given the benefits of Tensor Parallel in training, especially when compared to other Distributed Data Parallel methods like DeepSpeed and FSDP, I'm considering submitting a PR to the
accelerate
library. However, it’s important to note that the current structure of the transformers model may need to be adjusted to fully realize these benefits.Root Cause of the Error:
The error seems to stem from the fact that positional embeddings are applied immediately after token embeddings. This results in an incompatibility with Tensor Parallel, causing the misalignment seen in the error.
Request for Assistance:
Addressing this issue might require significant changes to the codebase. As such, I would greatly appreciate any feedback, guidance, or assistance in this matter.
As a recent graduate, I've observed that many are now using 2-4 nodes with 8-way GPUs. In these setups, Data Parallel (DP) methods like DeepSpeed and FSDP often suffer from high ring latency(many limited to memory constraint of the device). I believe Tensor Parallel, coupled with DP across nodes, could become a dominant approach in the near future. I’m eager to discuss and collaborate on making this approach compatible with the Transformers library or a similar framework.
Before submitting
Thank you in advance for your time and consideration. I look forward to any suggestions or feedback.
cc. @amyeroberts