Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: RMA write failed in fi_ubertest #49 #5659

Closed
zhngaj opened this issue Feb 21, 2020 · 5 comments
Closed

prov/shm: RMA write failed in fi_ubertest #49 #5659

zhngaj opened this issue Feb 21, 2020 · 5 comments

Comments

@zhngaj
Copy link
Contributor

zhngaj commented Feb 21, 2020

Failure

RMA write failed in fi_ubertest (test 49)

[shm, latency, write--, FI_EP_RDM, FI_AV_UNSPEC, eq_wait_none, cq_wait_none, cntr_wait_none, comp_queue -- tx (bind-NONE, op-NONE), rx: (bind-NONE, op-NONE),  FI_PROGRESS_MANUAL, FI_THREAD_SAFE (only hints), [FI_MR_VIRT_ADDR], [], [FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE]]
name                                              bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
lat                                               16      10k     312k        0.04s      8.75       1.83       0.55
lat                                               32      10k     625k        0.03s     19.08       1.68       0.60
lat                                               64      10k     1.2m        0.03s     39.36       1.63       0.61
fi_ubertest: ./include/ofi_mem.h:224: smr_freestack_pop_impl: Assertion `next != NULL' failed.
Aborted (core dumped)

To reproduce

  1. checkout master branch commit 5ef62492e

  2. run server/client command:
    Server: FI_SHM_DISABLE_CMA=1 /path/to/libfabric/fabtests/install/bin/fi_ubertest -x
    Client: FI_SHM_DISABLE_CMA=1 /path/to/libfabric/fabtests/install/bin/fi_ubertest -u /path/to/libfabric/fabtests/test_configs/shm/all.test -y 49 -z 49 [NODE_IP]

@zhngaj
Copy link
Contributor Author

zhngaj commented Feb 25, 2020

PR #5667 to fix this issue

@shefty
Copy link
Member

shefty commented Feb 25, 2020

shm reports that it does not require FI_MR_VIRT_ADDR. The fi_getinfo call should return with this bit cleared. Check the non-cma path to ensure that it is using 0-based offsets for RMA.

@zhngaj
Copy link
Contributor Author

zhngaj commented Feb 25, 2020

Thanks for your comment, @shefty

shm reports that it does not require FI_MR_VIRT_ADDR. The fi_getinfo call should return with this bit cleared.

You are correct. In this case, I think fi_ubertest 49 needs to use 0-based offsets for RMA even though it requests FI_MR_VIRT_ADDR via hints.
Looking at the function ft_fw_update_info

static void ft_fw_update_info(struct ft_info *test_info, struct fi_info *info)
, the mr_mode update is missing. If this looks reasonable, I can update the PR.

In addition, shm provide implements FI_MR_VIRT_ADDR memory mode, as per https://ofiwg.github.io/libfabric/master/man/fi_shm.7.html. What would be an appropriate way to test virtual address based (FI_MR_VIRT_ADDR) RMA operations, given that fi_getinfo call clears this bit.

@zhngaj
Copy link
Contributor Author

zhngaj commented Feb 26, 2020

I opened another PR #5672 to update the test_info's mr_mode bit with the one returned from the provider.

With this change, the test will use the 0-based offset for RMA.

[ec2-user@ip-172-31-32-230 ~]$ FI_SHM_DISABLE_CMA=1 ./ofiwg-libfabric/fabtests/install/bin/fi_ubertest -u ./ofiwg-libfabric/fabtests/test_configs/shm/all.test -y 49 -z 49 172.31.32.230
Test configurations loaded: 690
Starting test 49 / 690:
[shm, latency, write--, FI_EP_RDM, FI_AV_UNSPEC, eq_wait_none, cq_wait_none, cntr_wait_none, comp_queue -- tx (bind-NONE, op-NONE), rx: (bind-NONE, op-NONE),  FI_PROGRESS_MANUAL, FI_THREAD_SAFE (only hints), [], [], [FI_MSG, FI_RMA, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE]]
name                                              bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
lat                                               16      10k     312k        0.04s      8.04       1.99       0.50
lat                                               32      10k     625k        0.04s     17.53       1.83       0.55
lat                                               64      10k     1.2m        0.04s     36.22       1.77       0.57
lat                                               128     10k     2.4m        0.04s     67.96       1.88       0.53
lat                                               192     10k     3.6m        0.04s    101.51       1.89       0.53
lat                                               256     10k     4.8m        0.04s    128.39       1.99       0.50
lat                                               384     10k     7.3m        0.04s    193.84       1.98       0.50
lat                                               512     10k     9.7m        0.04s    256.01       2.00       0.50
lat                                               768     10k     14m         0.04s    380.38       2.02       0.50
lat                                               1k      10k     19m         0.04s    499.40       2.05       0.49
lat                                               1.5k    10k     29m         0.08s    381.24       4.03       0.25
lat                                               2k      10k     39m         0.05s    899.43       2.28       0.44
lat                                               3k      10k     58m         0.05s   1263.99       2.43       0.41
lat                                               4k      10k     78m         0.05s   1583.85       2.59       0.39
Ending test 49 / 690, result: Success
Success: 1
Skipped: 0
ENODATA: 0
ENOSYS : 0
EIO    : 0
ERROR  : 0

@zhngaj
Copy link
Contributor Author

zhngaj commented Mar 24, 2020

close

@zhngaj zhngaj closed this as completed Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants