Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] AllReduce performance debugging #90

Closed
chhwang opened this issue May 30, 2023 · 5 comments
Closed

[Performance] AllReduce performance debugging #90

chhwang opened this issue May 30, 2023 · 5 comments
Assignees

Comments

@chhwang
Copy link
Contributor

chhwang commented May 30, 2023

This issue is suggested by @saeedmaleki for qualifying #83. Desired enhancements:

  1. Improve reduce sum throughput (should achieve near HBM bandwidth)
  2. Communication-computation overlapping
  3. (For LL) changing the loop order, incremental flags
@chhwang chhwang self-assigned this May 30, 2023
@chhwang
Copy link
Contributor Author

chhwang commented May 31, 2023

  1. is a TODO for Add packet copy (LL) for AllReduce #85

@chhwang
Copy link
Contributor Author

chhwang commented Jun 6, 2023

@saeedmaleki For LL AllReduce, it seems we cannot avoid signalPacket() even if we use incremental flags. LL won't check these flags before writing, so the writer may overwrite unread data on the remote. If we let LL check flags before writing, it is nothing different from signalPacket().

@saeedmaleki
Copy link
Contributor

@saeedmaleki For LL AllReduce, it seems we cannot avoid signalPacket() even if we use incremental flags. LL won't check these flags before writing, so the writer may overwrite unread data on the remote. If we let LL check flags before writing, it is nothing different from signalPacket().

I have a trick! We can use two different buffers and set of packets and alternating between them from one kernel to another. This way we are guaranteed not to overwrite the data.

@chhwang
Copy link
Contributor Author

chhwang commented Jun 8, 2023

@saeedmaleki For LL AllReduce, it seems we cannot avoid signalPacket() even if we use incremental flags. LL won't check these flags before writing, so the writer may overwrite unread data on the remote. If we let LL check flags before writing, it is nothing different from signalPacket().

I have a trick! We can use two different buffers and set of packets and alternating between them from one kernel to another. This way we are guaranteed not to overwrite the data.

This is now implemented in #85. Big latency gain :)

@chhwang
Copy link
Contributor Author

chhwang commented Jun 8, 2023

All issues tackled in #85.

@chhwang chhwang closed this as completed Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants