Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend rxm and coll providers to be used with offload provider #2

Merged
merged 27 commits into from
Dec 13, 2022

Conversation

grom72
Copy link
Collaborator

@grom72 grom72 commented Nov 28, 2022

Extend rxm and coll providers to work with collective offload providers:

  • collective offload provider is given via FI_OFFLOAD_COLL_PROVIDER environment variable
  • multinode tests do not fail if a provider does not explicitly support the requested collective operation
  • collective offload provider's capabilities override the util coll provider's capabilities

This change is Reviewable

@grom72 grom72 force-pushed the off_coll_util branch 3 times, most recently from 7ee3eba to 748c3a3 Compare November 29, 2022 09:28
@grom72 grom72 changed the title prov/coll - prepare cq to be used in other providers Extend rxm and coll providers to be used with offload provider Nov 29, 2022
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 9 files reviewed, all discussions resolved

@grom72 grom72 force-pushed the off_coll_util branch 6 times, most recently from 47e1560 to 3d57fec Compare December 1, 2022 17:03
shefty and others added 3 commits December 3, 2022 13:04
The FI_PEER flag is sufficient for allocating peer objects.

Signed-off-by: Sean Hefty <[email protected]>
This is carry over from the tcp provider but unused.

Signed-off-by: Sean Hefty <[email protected]>
Bounce buffer copy overhead is high for ZE device memory. The rendezvous
protocol takes advantage of GPU RDMA and performs better even for small
messages.

Signed-off-by: Jianxin Xiong <[email protected]>
Copy link

@haichangsi haichangsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 14 files reviewed, all discussions resolved (waiting on @ldorau)

@grom72 grom72 force-pushed the off_coll_util branch 2 times, most recently from 8087de6 to 4fb844e Compare December 5, 2022 12:56
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 13 files at r4, 1 of 1 files at r5, 8 of 9 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @grom72)


prov/rxm/src/rxm_ep.c line 393 at r7 (raw file):

	rxm_ep = container_of(ep, struct rxm_ep, util_ep.ep_fid);

	//FI_PEER flag is used to force util_coll context 

Redundant space at the end

Code quote:


prov/rxm/src/rxm_ep.c line 393 at r7 (raw file):

	rxm_ep = container_of(ep, struct rxm_ep, util_ep.ep_fid);

	//FI_PEER flag is used to force util_coll context 

Use /* */ C-style comment, please

Code quote:

	//FI_PEER flag is used to force util_coll context
	//where fi_join() is called from offload provider

prov/rxm/src/rxm_ep.c line 402 at r7 (raw file):

		if (ret)
			goto err_util_coll;
		// It is collective offload provider responsibility to store util_coll provider mc

.

Code quote:

// It is collective offload provider responsibility to store util_coll provider mc

prov/rxm/src/rxm_msg.c line 818 at r7 (raw file):

				   rxm_ep_sar_calc_segs_cnt(rxm_ep, data_len));
	} else {
rndv_send:

An empty line should be used before a label, not after.

Code quote:

	} else {
rndv_send:

		ret = rxm_alloc_rndv_buf(rxm_ep, rxm_conn, context,

src/fabric.c line 841 at r7 (raw file):

	fi_param_define(NULL, "offload_coll_provider", FI_PARAM_STRING,
			"The name of colective offload provider (default: empty - no provider)");

of a collective offload provider

Code quote:

of colective offload provider

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @grom72)

nikhilnanal and others added 5 commits December 5, 2022 15:51
add call to create a logdir for daos on jfcst-daos during build stage
added a parallel summary stage.

Signed-off-by: Nikhil Nanal <[email protected]>
modifying run command to generate logs for daos.

Signed-off-by: Nikhil Nanal <[email protected]>
Added class  to summarize daos logs. modified check_name,
check_pass, check_fail and check_line methods

Signed-off-by: Nikhil Nanal <[email protected]>
Add configure~ to .gitignore.

Signed-off-by: Lukasz Dorau <[email protected]>
vidsouza and others added 3 commits December 7, 2022 07:21
Prior to this patch, the provider would ignore the domain
name in hints->domain_attr->name even if this was set.
This patch fixed this issue.

Signed-off-by: Vishwas Dsouza <[email protected]>
This test tests the -d <domain_name> argument of the fabtests.
All the EFA domain names are tested with this test with the
corresponding fabtest.

Signed-off-by: Vishwas Dsouza <[email protected]>
This patch added a call to FI_WARN to print endpoint's address
and libfabric version after endpoint is created.

Signed-off-by: Wei Zhang <[email protected]>
aingerson and others added 15 commits December 7, 2022 18:12
Define an explicit union for message headers and replace
embedded unions defined in structures.  This will allow
passing the union to match calls.

Signed-off-by: Sean Hefty <[email protected]>
Replace the active_rx parameter with a pointer to the headers,
which is all that's needed.  This separates match searching
from the active_rx.

Carry through this change up the call stack.

Signed-off-by: Sean Hefty <[email protected]>
Fix assertion when destroying buffer pools that not all xfer_entry's
have been returned.  Move freeing of xfer_entry to immediately after
writing a CQ entry.

Signed-off-by: Sean Hefty <[email protected]>
fabtests release packages are co-located with libfabric since 1.7.0.

Signed-off-by: Jianxin Xiong <[email protected]>
In the event where git cloning upstream/main provides newer python files
than the Jenkinsfile is expecting. There will be failures when the new
python expected the old Jenkinsfile to call something that it didn't.
Changing to checkout the same branches python files to avoid this mismatch.

Signed-off-by: Zach Dworkin <[email protected]>
The macro is for function parameters only.

Signed-off-by: Jianxin Xiong <[email protected]>
coll_cq implementation can be reused by other collective providers.

Signed-off-by: Tomasz Gromadzki <[email protected]>
…rity

Collective offload capabilities reported if offload provider is available
otherwise util collective provider capabilities are reported.

Signed-off-by: Tomasz Gromadzki <[email protected]>
…initialization

It is rxm provider responsability to initialize collective offload provider's fabric.
Otherwise collective offload functionality will not be available

Signed-off-by: Tomasz Gromadzki <[email protected]>
…IDER

FI_OFFLOAD_PROVIDER environment variable shall be set to offload provider name
to instruct libcabric to setup and use particular provider.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Peer provider must create peer_eq for offload provider, to allow offload provider
reporting events to peer provider.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Offload provider may execute collective operations via util_coll provider.
It must call fi_join() operation to get struct mc required for collective operations.
It can only call fi_join() on it's peer provider (e.g. rxm). FI_PEER flag is used
to inform peer provider to coll fi_join() operation for util_coll_ep

Signed-off-by: Tomasz Gromadzki <[email protected]>
offload_coll_mask value is calculated based on the actual offload capabilities
confirmed by fi_query_collective().

Signed-off-by: Tomasz Gromadzki <[email protected]>
…_ONLY is set

fi_query_collective() reports only collective offload provider capabilities if
OFI_OFFLOAD_PROV_ONLY flag is set. Otherwise the sum of both providers' capabilities
is reported.

Signed-off-by: Tomasz Gromadzki <[email protected]>
@grom72 grom72 merged commit 8ad1691 into pmem:devel Dec 13, 2022
grom72 pushed a commit that referenced this pull request Mar 24, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.