-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc #562
base: main
Are you sure you want to change the base?
Conversation
1. Add torchrpc message queue. 2. Implement buffer based on CUDA-shared-tensor to optimize the data path of torchrpc. 3. Add 'bypass_eventloop' arg in Task() and Parallel(). 4. Add thread lock in distributer.py to prevent sender and receiver competition. 5. Add message queue perf test for torchrpc, nccl, nng, shm 6. Add comm_perf_helper.py to make program timing more convenient. 7. Modified the subscribe() of class MQ, adding 'fn' parameter and 'is_once' parameter. 8. Add new DummyLock and ConditionLock type in lock_helper.py 9. Add message queues perf test. 10. Introduced a new self-hosted runner to execute cuda, multiprocess, torchrpc related tests.
Codecov Report
@@ Coverage Diff @@
## main #562 +/- ##
==========================================
- Coverage 83.60% 82.41% -1.20%
==========================================
Files 565 571 +6
Lines 46375 47198 +823
==========================================
+ Hits 38774 38900 +126
- Misses 7601 8298 +697
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
1da53e2
to
30b3a73
Compare
@@ -1,4 +1,5 @@ | |||
[run] | |||
concurrency = multiprocessing,thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add concurrency = multiprocessing, so that codecov can count the coverage of subprocesses, and the default concurrency is set to threading. However,there are some things need to pay attention in using, refer to: https://pytest-cov.readthedocs.io/en/latest/subprocess-support.html
codecov.yml
Outdated
# fix me | ||
# The unittests of the torchrpc module are tested by different runners and cannot be included | ||
# in the test_unittest's coverage report. To keep CI happy, we don't count torchrpc related coverage. | ||
ignore: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether to add these ignore items to .coveragerc
ding/data/shm_buffer.py
Outdated
self.shape = shape | ||
self.device = device | ||
# We don't want the buffer to be involved in the computational graph | ||
with torch.no_grad(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creation tensor operation doesn't involve in computation graph, so we don't need torch.no_grad
here
ding/data/tests/test_shm_buffer.py
Outdated
|
||
event_run = ctx.Event() | ||
shm_buf_np = ShmBufferCuda(np.dtype(np.float32), shape=(1024, 1024), copy_on_get=True) | ||
shm_buf_torch = ShmBufferCuda(torch.float32, shape=(1024, 1024), copy_on_get=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add another unittest for the case copy_on_get=False
to validate it.
task.use(eps_greedy_handler(cfg)) | ||
task.use(StepCollector(cfg, policy.collect_mode, collector_env)) | ||
task.use(termination_checker(max_env_step=int(1e7))) | ||
else: | ||
raise KeyError("invalid router labels: {}".format(task.router.labels)) | ||
|
||
task.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this
ding/utils/lock_helper.py
Outdated
Overview: | ||
thread lock decorator. | ||
Arguments: | ||
- func ([type]): A function that needs to be protected by a lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Callable
OUTPUT_DICT[func_name] = OUTPUT_DICT[func_name] + str(round(avg_tt, 4)) + "," | ||
|
||
|
||
def print_timer_result_csv(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can use pretty_print
function in ding.utils
- args (:obj:`any`): Rest arguments for listeners. | ||
""" | ||
# Check if need to broadcast event to connected nodes, default is True | ||
assert self._running, "Please make sure the task is running before calling the this method, see the task.start" | ||
if only_local: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this
@@ -71,8 +71,8 @@ def _train(ctx: Union["OnlineRLContext", "OfflineRLContext"]): | |||
|
|||
if ctx.train_data is None: # no enough data from data fetcher | |||
return | |||
data = ctx.train_data.to(policy._device) | |||
train_output = policy.forward(data) | |||
# data = ctx.train_data.to(policy._device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why comment this
# so all data on the cpu side is copied to "cuda:0" here. In fact this | ||
# copy is unnecessary, because torchrpc can support both cpu side and gpu | ||
# side data to communicate using RDMA, but mixing the two transfer types | ||
# will cause a bug, see issue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the issue
Commit:
Add torchrpc message queue.
Implement buffer based on CUDA-shared-tensor to optimize the data path of torchrpc.
Add 'bypass_eventloop' arg in Task() and Parallel().
Add thread lock in distributer.py to prevent sender and receiver competition.
Add message queue perf test for torchrpc, nccl, nng, shm
Add comm_perf_helper.py to make program timing more convenient.
Modified the subscribe() of class MQ, adding 'fn' parameter and 'is_once' parameter.
Add new DummyLock and ConditionLock type in lock_helper.py
Add message queues perf test.
Introduced a new self-hosted runner to execute cuda, multiprocess, torchrpc related tests.
Description
DI-engine integrates torch.distributed.rpc module.
cli-ditask introduces new command line arguments
--mq_type
: Introducedtorchrpc:cuda
andtorchrpc:cpu
optionstorchrpc:cuda
: Use torchrpc for communication, and allow setting device_map, can use GPU direct RDMA.torchrpc:cpu:
Use torchrpc for communication, but device_map is not allowed to be set. All data on the GPU side will be copied to the CPU side for transmission.--init-method
: Initialization entry for init_rpc (required if --mq_type is torchrpc)--local-cuda-devices
: Set the rank range of local GPUs that can be used (optional, default is all visible devices)--cuda-device-map
: Used to set device_map, the format is as follows:(Optional, the default is to map all visible GPU to the GPU-0 of the peer)
Dynamic GPU communication groups
We create devices mappings between all possible devices in advance. This mapping is all-2-all, which can cover all communication situations. The purpose is to avoid errors caused by incomplete devicemap coverage. Setting redundant mappings will not have any side effects. The mappings are used to check the validity of the device during transport. Only after a new process joins the communication group will it try to create a channel based on these maps.
At the same time, we still expose the
--cuda-device-map
interface, which is used to allow users to configure the topology between devices, torchrpc will follow user input.Related Issue
TODO
Load balancing capability: in a time-heterogeneous RL task environment, each worker can run at full capacity.
Check List