-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/sharp scaffolding #8
base: devel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 24 files at r1, all commit messages.
Reviewable status: 4 of 24 files reviewed, 2 unresolved discussions (waiting on @grom72)
Makefile.am
line 6 at r1 (raw file):
# Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All rights reserved. # (C) Copyright 2020 Hewlett Packard Enterprise Development LP # Copyright (c) 2022 Intel Corporation. All right reserved.
There is already Intel Copyright 3 lines above.
Code quote:
# Copyright (c) 2022 Intel Corporation. All right reserved.
README.md
line 219 at r1 (raw file):
The `off_sharp` provider is an utility provider that supports collective endpoints utilizing SHARP protocol for barier and allreduce operations.
... for the barier and the allreduce ...
Code quote:
for barier and allreduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 24 files at r1.
Reviewable status: 6 of 24 files reviewed, 3 unresolved discussions (waiting on @grom72)
include/ofi_sharp.h
line 70 at r1 (raw file):
#endif #endif /* _OFI_SHM_H_ */
_OFI_SHARP_H_
Code quote:
_OFI_SHM_H_
fa83a1d
to
0bb30c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 24 files reviewed, 3 unresolved discussions (waiting on @ldorau)
README.md
line 219 at r1 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
... for the barier and the allreduce ...
Done.
include/ofi_sharp.h
line 70 at r1 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
_OFI_SHARP_H_
Done.
0bb30c8
to
6f2c738
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 24 files reviewed, 3 unresolved discussions (waiting on @ldorau)
Makefile.am
line 6 at r1 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
There is already Intel Copyright 3 lines above.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r2, 1 of 2 files at r3, all commit messages.
Reviewable status: 6 of 24 files reviewed, 1 unresolved discussion (waiting on @grom72)
Makefile.am
line 3 at r3 (raw file):
# # Copyright (c) 2016 Cisco Systems, Inc. All rights reserved. # Copyright (c) 2017-2018, 2022 Intel Corporation, Inc. All right reserved.
I think it should be 2017-2022
Code quote:
2017-2018, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 24 files at r1.
Reviewable status: 10 of 24 files reviewed, 8 unresolved discussions (waiting on @grom72)
man/fi_sharp.7.md
line 15 at r3 (raw file):
The SHARP provider is a collectives offload provider that can be used on Linux systems supporting SHARP protocol.
the SHARP protocol
Code quote:
SHARP protocol
man/fi_sharp.7.md
line 19 at r3 (raw file):
# SUPPORTED FEATURES This release contains an initial implementation of the SHM provider that
SHARP
Code quote:
SHM
man/fi_sharp.7.md
line 23 at r3 (raw file):
*Endpoint types* : The provider supports only endpoint type *FI_EP_COLLECTIVE*.
only one endpoint type: *FI_EP_COLLECTIVE*.
or
only the *FI_EP_COLLECTIVE* endpoint type.
Code quote:
only endpoint type *FI_EP_COLLECTIVE*.
man/fi_sharp.7.md
line 26 at r3 (raw file):
*Endpoint capabilities* : Endpoints cna support only fi_barrier and fi_allreduce operations.
can
Code quote:
cna
man/fi_sharp.7.md
line 41 at r3 (raw file):
*MR registration mode* The provider implements FI_MR_VIRT_ADDR memory mode.
the FI_MR_VIRT_ADDR memory mode
Code quote:
FI_MR_VIRT_ADDR memory mode
man/fi_sharp.7.md
line 50 at r3 (raw file):
The SHARP provider has hard-coded maximums for supported queue sizes and data transfers. These values are reflected in the related fabric attribute structures
structures.
- dot at the end
Code quote:
structures
man/fi_sharp.7.md
line 56 at r3 (raw file):
# RUNTIME PARAMETERS The *SHARP* provider checks for the following environment variables:
Please be consistent: The *SHARP* provider
or The SHARP provider
- the same in all places
Code quote:
The *SHARP* provider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 24 files at r1.
Reviewable status: 11 of 24 files reviewed, 8 unresolved discussions (waiting on @grom72)
67f4224
to
be4aeab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 9 of 24 files reviewed, 8 unresolved discussions (waiting on @ldorau)
man/fi_sharp.7.md
line 15 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
the SHARP protocol
Done.
Code quote:
SHARP protocol.
man/fi_sharp.7.md
line 19 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
SHARP
Done.
man/fi_sharp.7.md
line 23 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
only one endpoint type: *FI_EP_COLLECTIVE*.
or
only the *FI_EP_COLLECTIVE* endpoint type.
Done.
man/fi_sharp.7.md
line 26 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
can
Done.
man/fi_sharp.7.md
line 41 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
the FI_MR_VIRT_ADDR memory mode
Done.
man/fi_sharp.7.md
line 50 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
structures.
- dot at the end
Done.
man/fi_sharp.7.md
line 56 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
Please be consistent:
The *SHARP* provider
orThe SHARP provider
- the same in all places
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 24 files at r1, 1 of 2 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: 15 of 24 files reviewed, 1 unresolved discussion (waiting on @grom72)
be4aeab
to
c9ba8f8
Compare
d168f84
to
ac48ad7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 34 files reviewed, 2 unresolved discussions (waiting on @grom72)
prov/sharp/Makefile.include
line 11 at r8 (raw file):
prov/sharp/src/sharp_eq.c \ prov/sharp/src/sharp_ep.c \ prov/sharp/src/sharp_cq.c \
Fix alignment, please (screenshot from vim):
Code quote:
_sharp_files = \
include/ofi_sharp.h \
prov/sharp/src/sharp.h \
prov/sharp/src/sharp_attr.c \
prov/sharp/src/sharp_init.c \
prov/sharp/src/sharp_fabric.c \
prov/sharp/src/sharp_domain.c \
prov/sharp/src/sharp_eq.c \
prov/sharp/src/sharp_ep.c \
prov/sharp/src/sharp_cq.c \
prov/sharp/src/sharp_coll.c \
prov/sharp/src/sharp_progress.c
ac48ad7
to
2368690
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 12 of 34 files reviewed, 2 unresolved discussions (waiting on @ldorau)
Makefile.am
line 3 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
I think it should be
2017-2022
Done.
prov/sharp/Makefile.include
line 11 at r8 (raw file):
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r8, 2 of 3 files at r9.
Reviewable status: 14 of 34 files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 13 of 35 files reviewed, 8 unresolved discussions (waiting on @grom72)
prov/sharp/src/sharp_cq.c
line 87 at r10 (raw file):
if (!attr || !(attr->flags & FI_PEER)) { FI_WARN(provider, FI_LOG_CORE, "FI_PEER flag required\n"); return -EINVAL;
FI_EINVAL
Code quote:
EINVAL
prov/sharp/src/sharp_cq.c
line 92 at r10 (raw file):
if (!peer_context || peer_context->size < sizeof(*peer_context)) { FI_WARN(provider, FI_LOG_CORE, "invalid peer CQ context\n"); return -EINVAL;
.
prov/sharp/src/sharp_domain.c
line 169 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "FI_PEER flag required\n"); return -EINVAL;
.
prov/sharp/src/sharp_domain.c
line 175 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "Invalid peer domain context\n"); return -EINVAL;
.
prov/sharp/src/sharp_ep.c
line 188 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "FI_PEER_TRANSFER mode required\n"); return -EINVAL;
.
prov/sharp/src/sharp_ep.c
line 194 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "Invalid peer transfer context\n"); return -EINVAL;
.
prov/sharp/src/sharp_eq.c
line 76 at r10 (raw file):
if (!attr || !(attr->flags & FI_PEER)) { FI_WARN(&sharp_prov, FI_LOG_CORE, "FI_PEER flag required\n"); return -EINVAL;
.
prov/sharp/src/sharp_eq.c
line 81 at r10 (raw file):
if (!peer_context || peer_context->size < sizeof(*peer_context)) { FI_WARN(&sharp_prov, FI_LOG_CORE, "invalid peer EQ context\n"); return -EINVAL;
.
a18df51
to
eca015c
Compare
6756031
to
9e6a2ca
Compare
9e6a2ca
to
b86b6cb
Compare
5af5f80
to
429572a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 24 files at r1, 1 of 3 files at r2, 2 of 2 files at r4, 1 of 3 files at r9, 1 of 5 files at r12.
Reviewable status: 12 of 37 files reviewed, 8 unresolved discussions (waiting on @ldorau)
prov/sharp/src/sharp_cq.c
line 87 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
FI_EINVAL
Done.
prov/sharp/src/sharp_cq.c
line 92 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_domain.c
line 169 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_domain.c
line 175 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_ep.c
line 188 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_ep.c
line 194 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_eq.c
line 76 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_eq.c
line 81 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
0df5f87
to
5d885a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r11.
Reviewable status: 11 of 37 files reviewed, 8 unresolved discussions (waiting on @ldorau)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 14 files at r11.
Reviewable status: 14 of 37 files reviewed, all discussions resolved (waiting on @grom72)
5d885a3
to
f9fd745
Compare
f9fd745
to
cd0a9b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r7, 1 of 14 files at r11, 2 of 7 files at r13, 2 of 7 files at r15, 3 of 4 files at r16.
Reviewable status: 22 of 37 files reviewed, all discussions resolved (waiting on @ldorau)
TODO.md files defines next development steps. The list is ordered by priority. Signed-off-by: Tomasz Gromadzki <[email protected]>
…r is available * FOR TEST PURPOSE ONLY - TEMPORARY SOLUTION * To let all collective tests pass let's force rxm to provide only offload collective domain capabilities. THIS IS ONLY FOR TEST PPURPOSE. In the future, tests for offload provider shall be executed with OFI_OFFLOAD_PROV_ONLY flag set in fi_query_collective() call. Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Both collective operation implemented ba colling peer collective operation and transparently passing completion back to peer CQ Signed-off-by: Tomasz Gromadzki <[email protected]>
cd0a9b4
to
27fd202
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 24 files at r1, 1 of 3 files at r2, 1 of 2 files at r3, 1 of 8 files at r6, 6 of 11 files at r7, 4 of 14 files at r11, 1 of 5 files at r12, 1 of 7 files at r15, 5 of 11 files at r17, 6 of 6 files at r18, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @grom72)
fi_join_collective implementation with usage of coll_op's state machine Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Add mocks for sharp_coll_init and sharp_coll_finalize. Signed-off-by: Lukasz Dorau <[email protected]>
In order to mock SHARP functions run: $ ./configure --enable-sharp=mocked or $ ./configure --enable-sharp=auto Running: $ ./configure --enable-sharp=yes make use of SHARP libraries instead of mocks. Signed-off-by: Lukasz Dorau <[email protected]>
27fd202
to
7770931
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r19, 5 of 5 files at r20, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @grom72)
If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>
Prov sharp scaffolding
This change is