-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend rxm and coll providers to be used with offload provider #2
Conversation
cf7c60e
to
6652138
Compare
7ee3eba
to
748c3a3
Compare
b61e2af
to
2a152a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 9 files reviewed, all discussions resolved
47e1560
to
3d57fec
Compare
The FI_PEER flag is sufficient for allocating peer objects. Signed-off-by: Sean Hefty <[email protected]>
This is carry over from the tcp provider but unused. Signed-off-by: Sean Hefty <[email protected]>
Bounce buffer copy overhead is high for ZE device memory. The rendezvous protocol takes advantage of GPU RDMA and performs better even for small messages. Signed-off-by: Jianxin Xiong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 14 files reviewed, all discussions resolved (waiting on @ldorau)
8087de6
to
4fb844e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 13 files at r4, 1 of 1 files at r5, 8 of 9 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @grom72)
prov/rxm/src/rxm_ep.c
line 393 at r7 (raw file):
rxm_ep = container_of(ep, struct rxm_ep, util_ep.ep_fid); //FI_PEER flag is used to force util_coll context
Redundant space at the end
Code quote:
t·
prov/rxm/src/rxm_ep.c
line 393 at r7 (raw file):
rxm_ep = container_of(ep, struct rxm_ep, util_ep.ep_fid); //FI_PEER flag is used to force util_coll context
Use /* */
C-style comment, please
Code quote:
//FI_PEER flag is used to force util_coll context
//where fi_join() is called from offload provider
prov/rxm/src/rxm_ep.c
line 402 at r7 (raw file):
if (ret) goto err_util_coll; // It is collective offload provider responsibility to store util_coll provider mc
.
Code quote:
// It is collective offload provider responsibility to store util_coll provider mc
prov/rxm/src/rxm_msg.c
line 818 at r7 (raw file):
rxm_ep_sar_calc_segs_cnt(rxm_ep, data_len)); } else { rndv_send:
An empty line should be used before a label, not after.
Code quote:
} else {
rndv_send:
ret = rxm_alloc_rndv_buf(rxm_ep, rxm_conn, context,
src/fabric.c
line 841 at r7 (raw file):
fi_param_define(NULL, "offload_coll_provider", FI_PARAM_STRING, "The name of colective offload provider (default: empty - no provider)");
of a collective offload provider
Code quote:
of colective offload provider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @grom72)
add call to create a logdir for daos on jfcst-daos during build stage added a parallel summary stage. Signed-off-by: Nikhil Nanal <[email protected]>
modifying run command to generate logs for daos. Signed-off-by: Nikhil Nanal <[email protected]>
Added class to summarize daos logs. modified check_name, check_pass, check_fail and check_line methods Signed-off-by: Nikhil Nanal <[email protected]>
Signed-off-by: Tang, Jingyin <[email protected]>
Daos summary
Add configure~ to .gitignore. Signed-off-by: Lukasz Dorau <[email protected]>
Prior to this patch, the provider would ignore the domain name in hints->domain_attr->name even if this was set. This patch fixed this issue. Signed-off-by: Vishwas Dsouza <[email protected]>
This test tests the -d <domain_name> argument of the fabtests. All the EFA domain names are tested with this test with the corresponding fabtest. Signed-off-by: Vishwas Dsouza <[email protected]>
This patch added a call to FI_WARN to print endpoint's address and libfabric version after endpoint is created. Signed-off-by: Wei Zhang <[email protected]>
Signed-off-by: Alexia Ingerson <[email protected]>
Define an explicit union for message headers and replace embedded unions defined in structures. This will allow passing the union to match calls. Signed-off-by: Sean Hefty <[email protected]>
Replace the active_rx parameter with a pointer to the headers, which is all that's needed. This separates match searching from the active_rx. Carry through this change up the call stack. Signed-off-by: Sean Hefty <[email protected]>
Fix assertion when destroying buffer pools that not all xfer_entry's have been returned. Move freeing of xfer_entry to immediately after writing a CQ entry. Signed-off-by: Sean Hefty <[email protected]>
fabtests release packages are co-located with libfabric since 1.7.0. Signed-off-by: Jianxin Xiong <[email protected]>
In the event where git cloning upstream/main provides newer python files than the Jenkinsfile is expecting. There will be failures when the new python expected the old Jenkinsfile to call something that it didn't. Changing to checkout the same branches python files to avoid this mismatch. Signed-off-by: Zach Dworkin <[email protected]>
The macro is for function parameters only. Signed-off-by: Jianxin Xiong <[email protected]>
coll_cq implementation can be reused by other collective providers. Signed-off-by: Tomasz Gromadzki <[email protected]>
…rity Collective offload capabilities reported if offload provider is available otherwise util collective provider capabilities are reported. Signed-off-by: Tomasz Gromadzki <[email protected]>
…initialization It is rxm provider responsability to initialize collective offload provider's fabric. Otherwise collective offload functionality will not be available Signed-off-by: Tomasz Gromadzki <[email protected]>
…IDER FI_OFFLOAD_PROVIDER environment variable shall be set to offload provider name to instruct libcabric to setup and use particular provider. Signed-off-by: Tomasz Gromadzki <[email protected]>
Peer provider must create peer_eq for offload provider, to allow offload provider reporting events to peer provider. Signed-off-by: Tomasz Gromadzki <[email protected]>
Offload provider may execute collective operations via util_coll provider. It must call fi_join() operation to get struct mc required for collective operations. It can only call fi_join() on it's peer provider (e.g. rxm). FI_PEER flag is used to inform peer provider to coll fi_join() operation for util_coll_ep Signed-off-by: Tomasz Gromadzki <[email protected]>
offload_coll_mask value is calculated based on the actual offload capabilities confirmed by fi_query_collective(). Signed-off-by: Tomasz Gromadzki <[email protected]>
…_ONLY is set fi_query_collective() reports only collective offload provider capabilities if OFI_OFFLOAD_PROV_ONLY flag is set. Otherwise the sum of both providers' capabilities is reported. Signed-off-by: Tomasz Gromadzki <[email protected]>
If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>
Extend rxm and coll providers to work with collective offload providers:
This change is