Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/sharp scaffolding #8

Draft
wants to merge 8 commits into
base: devel
Choose a base branch
from
Draft

Conversation

grom72
Copy link
Collaborator

@grom72 grom72 commented Dec 13, 2022

Prov sharp scaffolding


This change is Reviewable

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 24 files at r1, all commit messages.
Reviewable status: 4 of 24 files reviewed, 2 unresolved discussions (waiting on @grom72)


Makefile.am line 6 at r1 (raw file):

# Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All rights reserved.
# (C) Copyright 2020 Hewlett Packard Enterprise Development LP
# Copyright (c) 2022 Intel Corporation. All right reserved.

There is already Intel Copyright 3 lines above.

Code quote:

# Copyright (c) 2022 Intel Corporation. All right reserved.

README.md line 219 at r1 (raw file):

The `off_sharp` provider is an utility provider that supports collective endpoints utilizing
SHARP protocol for barier and allreduce operations.

... for the barier and the allreduce ...

Code quote:

for barier and allreduce

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 24 files at r1.
Reviewable status: 6 of 24 files reviewed, 3 unresolved discussions (waiting on @grom72)


include/ofi_sharp.h line 70 at r1 (raw file):

#endif

#endif /* _OFI_SHM_H_ */

_OFI_SHARP_H_

Code quote:

_OFI_SHM_H_

@grom72 grom72 force-pushed the prov-sharp-scaffolding branch from fa83a1d to 0bb30c8 Compare December 14, 2022 09:58
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 24 files reviewed, 3 unresolved discussions (waiting on @ldorau)


README.md line 219 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

... for the barier and the allreduce ...

Done.


include/ofi_sharp.h line 70 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

_OFI_SHARP_H_

Done.

@grom72 grom72 force-pushed the prov-sharp-scaffolding branch from 0bb30c8 to 6f2c738 Compare December 14, 2022 10:30
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 24 files reviewed, 3 unresolved discussions (waiting on @ldorau)


Makefile.am line 6 at r1 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

There is already Intel Copyright 3 lines above.

Done.

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r2, 1 of 2 files at r3, all commit messages.
Reviewable status: 6 of 24 files reviewed, 1 unresolved discussion (waiting on @grom72)


Makefile.am line 3 at r3 (raw file):

#
# Copyright (c) 2016 Cisco Systems, Inc.  All rights reserved.
# Copyright (c) 2017-2018, 2022 Intel Corporation, Inc. All right reserved.

I think it should be 2017-2022

Code quote:

2017-2018, 2022

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 24 files at r1.
Reviewable status: 10 of 24 files reviewed, 8 unresolved discussions (waiting on @grom72)


man/fi_sharp.7.md line 15 at r3 (raw file):

The SHARP provider is a collectives offload provider that can be used on Linux
systems supporting SHARP protocol.

the SHARP protocol

Code quote:

SHARP protocol

man/fi_sharp.7.md line 19 at r3 (raw file):

# SUPPORTED FEATURES

This release contains an initial implementation of the SHM provider that

SHARP

Code quote:

SHM

man/fi_sharp.7.md line 23 at r3 (raw file):

*Endpoint types*
: The provider supports only endpoint type *FI_EP_COLLECTIVE*.

only one endpoint type: *FI_EP_COLLECTIVE*.
or
only the *FI_EP_COLLECTIVE* endpoint type.

Code quote:

only endpoint type *FI_EP_COLLECTIVE*.

man/fi_sharp.7.md line 26 at r3 (raw file):

*Endpoint capabilities*
: Endpoints cna support only fi_barrier and fi_allreduce operations.

can

Code quote:

cna

man/fi_sharp.7.md line 41 at r3 (raw file):

*MR registration mode*
  The provider implements FI_MR_VIRT_ADDR memory mode.

the FI_MR_VIRT_ADDR memory mode

Code quote:

FI_MR_VIRT_ADDR memory mode

man/fi_sharp.7.md line 50 at r3 (raw file):

The SHARP provider has hard-coded maximums for supported queue sizes and data
transfers.  These values are reflected in the related fabric attribute
structures

structures. - dot at the end

Code quote:

structures

man/fi_sharp.7.md line 56 at r3 (raw file):

# RUNTIME PARAMETERS

The *SHARP* provider checks for the following environment variables:

Please be consistent: The *SHARP* provider or The SHARP provider - the same in all places

Code quote:

The *SHARP* provider

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 24 files at r1.
Reviewable status: 11 of 24 files reviewed, 8 unresolved discussions (waiting on @grom72)

@grom72 grom72 force-pushed the prov-sharp-scaffolding branch 2 times, most recently from 67f4224 to be4aeab Compare December 14, 2022 12:30
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 24 files reviewed, 8 unresolved discussions (waiting on @ldorau)


man/fi_sharp.7.md line 15 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

the SHARP protocol

Done.

Code quote:

 SHARP protocol.

man/fi_sharp.7.md line 19 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

SHARP

Done.


man/fi_sharp.7.md line 23 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

only one endpoint type: *FI_EP_COLLECTIVE*.
or
only the *FI_EP_COLLECTIVE* endpoint type.

Done.


man/fi_sharp.7.md line 26 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

can

Done.


man/fi_sharp.7.md line 41 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

the FI_MR_VIRT_ADDR memory mode

Done.


man/fi_sharp.7.md line 50 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

structures. - dot at the end

Done.


man/fi_sharp.7.md line 56 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Please be consistent: The *SHARP* provider or The SHARP provider - the same in all places

Done.

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 24 files at r1, 1 of 2 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: 15 of 24 files reviewed, 1 unresolved discussion (waiting on @grom72)

@grom72 grom72 force-pushed the prov-sharp-scaffolding branch from be4aeab to c9ba8f8 Compare December 14, 2022 15:00
@grom72 grom72 force-pushed the prov-sharp-scaffolding branch 2 times, most recently from d168f84 to ac48ad7 Compare December 15, 2022 10:58
Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 34 files reviewed, 2 unresolved discussions (waiting on @grom72)


prov/sharp/Makefile.include line 11 at r8 (raw file):

	prov/sharp/src/sharp_eq.c		\
	prov/sharp/src/sharp_ep.c		\
	prov/sharp/src/sharp_cq.c		\

Fix alignment, please (screenshot from vim):
image.png

Code quote:

_sharp_files = \
	include/ofi_sharp.h				\
	prov/sharp/src/sharp.h			\
	prov/sharp/src/sharp_attr.c		\
	prov/sharp/src/sharp_init.c		\
	prov/sharp/src/sharp_fabric.c	\
	prov/sharp/src/sharp_domain.c	\
	prov/sharp/src/sharp_eq.c		\
	prov/sharp/src/sharp_ep.c		\
	prov/sharp/src/sharp_cq.c		\
	prov/sharp/src/sharp_coll.c		\
	prov/sharp/src/sharp_progress.c

@grom72 grom72 force-pushed the prov-sharp-scaffolding branch from ac48ad7 to 2368690 Compare December 16, 2022 09:26
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 12 of 34 files reviewed, 2 unresolved discussions (waiting on @ldorau)


Makefile.am line 3 at r3 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

I think it should be 2017-2022

Done.


prov/sharp/Makefile.include line 11 at r8 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

Fix alignment, please (screenshot from vim):
image.png

Done.

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r8, 2 of 3 files at r9.
Reviewable status: 14 of 34 files reviewed, all discussions resolved

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 35 files reviewed, 8 unresolved discussions (waiting on @grom72)


prov/sharp/src/sharp_cq.c line 87 at r10 (raw file):

	if (!attr || !(attr->flags & FI_PEER)) {
		FI_WARN(provider, FI_LOG_CORE, "FI_PEER flag required\n");
                return -EINVAL;

FI_EINVAL

Code quote:

EINVAL

prov/sharp/src/sharp_cq.c line 92 at r10 (raw file):

	if (!peer_context || peer_context->size < sizeof(*peer_context)) {
		FI_WARN(provider, FI_LOG_CORE, "invalid peer CQ context\n");
                return -EINVAL;

.


prov/sharp/src/sharp_domain.c line 169 at r10 (raw file):

		FI_WARN(&sharp_prov, FI_LOG_CORE,
			"FI_PEER flag required\n");
		return -EINVAL;

.


prov/sharp/src/sharp_domain.c line 175 at r10 (raw file):

		FI_WARN(&sharp_prov, FI_LOG_CORE,
			"Invalid peer domain context\n");
		return -EINVAL;

.


prov/sharp/src/sharp_ep.c line 188 at r10 (raw file):

		FI_WARN(&sharp_prov, FI_LOG_CORE,
			"FI_PEER_TRANSFER mode required\n");
		return -EINVAL;

.


prov/sharp/src/sharp_ep.c line 194 at r10 (raw file):

		FI_WARN(&sharp_prov, FI_LOG_CORE,
			"Invalid peer transfer context\n");
		return -EINVAL;

.


prov/sharp/src/sharp_eq.c line 76 at r10 (raw file):

	if (!attr || !(attr->flags & FI_PEER)) {
		FI_WARN(&sharp_prov, FI_LOG_CORE, "FI_PEER flag required\n");
                return -EINVAL;

.


prov/sharp/src/sharp_eq.c line 81 at r10 (raw file):

	if (!peer_context || peer_context->size < sizeof(*peer_context)) {
		FI_WARN(&sharp_prov, FI_LOG_CORE, "invalid peer EQ context\n");
                return -EINVAL;

.

@grom72 grom72 changed the title Prov sharp scaffolding prov/sharp scaffolding Dec 22, 2022
@grom72 grom72 force-pushed the prov-sharp-scaffolding branch 3 times, most recently from a18df51 to eca015c Compare December 27, 2022 18:31
@grom72 grom72 force-pushed the prov-sharp-scaffolding branch from 6756031 to 9e6a2ca Compare December 28, 2022 09:35
@grom72 grom72 force-pushed the prov-sharp-scaffolding branch from 9e6a2ca to b86b6cb Compare December 28, 2022 12:41
@grom72 grom72 force-pushed the prov-sharp-scaffolding branch 3 times, most recently from 5af5f80 to 429572a Compare December 29, 2022 10:43
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 24 files at r1, 1 of 3 files at r2, 2 of 2 files at r4, 1 of 3 files at r9, 1 of 5 files at r12.
Reviewable status: 12 of 37 files reviewed, 8 unresolved discussions (waiting on @ldorau)


prov/sharp/src/sharp_cq.c line 87 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

FI_EINVAL

Done.


prov/sharp/src/sharp_cq.c line 92 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

.

Done.


prov/sharp/src/sharp_domain.c line 169 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

.

Done.


prov/sharp/src/sharp_domain.c line 175 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

.

Done.


prov/sharp/src/sharp_ep.c line 188 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

.

Done.


prov/sharp/src/sharp_ep.c line 194 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

.

Done.


prov/sharp/src/sharp_eq.c line 76 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

.

Done.


prov/sharp/src/sharp_eq.c line 81 at r10 (raw file):

Previously, ldorau (Lukasz Dorau) wrote…

.

Done.

@grom72 grom72 force-pushed the prov-sharp-scaffolding branch 2 times, most recently from 0df5f87 to 5d885a3 Compare December 29, 2022 13:08
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 14 files at r11.
Reviewable status: 11 of 37 files reviewed, 8 unresolved discussions (waiting on @ldorau)

Copy link
Member

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 14 files at r11.
Reviewable status: 14 of 37 files reviewed, all discussions resolved (waiting on @grom72)

Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 11 files at r7, 1 of 14 files at r11, 2 of 7 files at r13, 2 of 7 files at r15, 3 of 4 files at r16.
Reviewable status: 22 of 37 files reviewed, all discussions resolved (waiting on @ldorau)

TODO.md files defines next development steps. The list is ordered by priority.

Signed-off-by: Tomasz Gromadzki <[email protected]>
…r is available

* FOR TEST PURPOSE ONLY - TEMPORARY SOLUTION *
To let all collective tests pass let's force rxm to provide only offload collective
domain capabilities. THIS IS ONLY FOR TEST PPURPOSE. In the future, tests for
offload provider shall be executed with OFI_OFFLOAD_PROV_ONLY flag set in
fi_query_collective() call.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Both collective operation implemented ba colling peer collective operation and
transparently passing completion back to peer CQ

Signed-off-by: Tomasz Gromadzki <[email protected]>
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 24 files at r1, 1 of 3 files at r2, 1 of 2 files at r3, 1 of 8 files at r6, 6 of 11 files at r7, 4 of 14 files at r11, 1 of 5 files at r12, 1 of 7 files at r15, 5 of 11 files at r17, 6 of 6 files at r18, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @grom72)

grom72 and others added 4 commits January 13, 2023 16:14
fi_join_collective implementation with usage of coll_op's state machine

Signed-off-by: Tomasz Gromadzki <[email protected]>
Add mocks for sharp_coll_init and sharp_coll_finalize.

Signed-off-by: Lukasz Dorau <[email protected]>
In order to mock SHARP functions run:
$ ./configure --enable-sharp=mocked
or
$ ./configure --enable-sharp=auto

Running:
$ ./configure --enable-sharp=yes
make use of SHARP libraries instead of mocks.

Signed-off-by: Lukasz Dorau <[email protected]>
@grom72 grom72 force-pushed the prov-sharp-scaffolding branch from 27fd202 to 7770931 Compare January 13, 2023 15:16
Copy link
Collaborator Author

@grom72 grom72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r19, 5 of 5 files at r20, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @grom72)

grom72 pushed a commit that referenced this pull request Mar 24, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants