Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why we tried ib_read_bw and ib_write_bw testings without FFO installed but succeeded? And why we installed libibverbs but can't find drivers? #10

Open
ling0329 opened this issue Feb 28, 2019 · 8 comments

Comments

@ling0329
Copy link

In Section 4.3 where one-sided operations are discussed, we see there are two problems to support one-sided operations, and the first is the local FFR does not know the corresponding s-mem on the other side. To solve this problem, FreeFlow builds a central key-value store in FFO for all FFRs to learn the mapping between mem’s pointer in application’s virtual memory space and the corresponding s-mem’s pointer in FFR’s virtual memory space. However, our testings of ib_read_bw and ib_write_bw all succeeded without FFO installed, though we don't know how to install FFO.
It should be noted that all of our ib_send/read/write_bw testings are based on rdma_cm mode, because if we install libibverbs, we will encounter a warning of 'no userspace device-specific driver found'.
image
So we only install libmlx4 and librdmacm, and all testings are based on standard libibvers of rdma. Then if we test based on non rdma_cm mode, it will not go through router.
Did you met this problem before? We tried to solve this problem, and found that the function try_driver in init.c fails to find dirvers when executing
image
Then we think it is caused by driver initialization, and locate to function mlx4_driver_init defined in mlx4.c in libmlx4. We also found in file mlx4.c, you cut many lines, that make us confused. The problem we finally located to is in the following code, it doesn't 'goto found', so 'return NULL' early.
image
But why? Why rdma_cm mode doesn't met this problem? But with libibvers installed, both modes are influenced?
Wish your answer!

@bobzhuyb
Copy link
Contributor

Did you install Mellanox OFED driver outside the container, and mount the user space driver path into the container, like -v /sys/class/:/sys/class/ ? You can find this in the README.md command line. Do you have /sys/class/infiniband_verbs/uverbs0 ?

@ling0329
Copy link
Author

Yes, we have installed Mellanox OFED driver both outside and inside the container. And we can make sure we have mounted the user space driver path into the container, because we used the command you provided to start the application container, without any modification. We also have
image

@ling0329
Copy link
Author

ling0329 commented Mar 1, 2019

We have found out why it fails to find devices. Because the abi_version of Mellanox NICs we used is 1, not within 3 to 4, so it needs to match libmlx5, not libmlx4. However, we must use high version Mellanox NICs. Anyway, thank you for your attentions.

@ling0329
Copy link
Author

ling0329 commented Mar 4, 2019

Sorry to bother you again. But we still want to know why we tried ib_read_bw and ib_write_bw testings without FFO installed but succeeded under rdma_cm mode? According to the analyses what have been discussed in your paper, we can see FFO is indispensable when executing one-sided operations, but how is it reflected in the open source environment.
Here is our test case of ib_read_bw.
In the server side,
image
In the client side,
image
And we got the output
image
The above testing was executed between two containers from different hosts, and all succeeded.
Maybe our testing method was wrong, but it really went through FFR.
Look forward to your reply.
Thanks.

@nilyibo
Copy link

nilyibo commented Apr 16, 2019

@ling0329 I think in this implementation, they hardcoded the one-sided mapping information in code.
From README:

the released implementation hard-codes the host IPs and virtual IP to host IP mapping in https://github.com/Microsoft/Freeflow/blob/master/ffrouter/ffrouter.cpp#L215 and https://github.com/Microsoft/Freeflow/blob/master/ffrouter/ffrouter.h#L76.

Also, it looks like you are also trying to run Freeflow with newer NICs. Do you get it to work successfully? And can you share what version of Ubuntu and OFED you are running? (both container and host OS)

@ling0329
Copy link
Author

We are still trying to solve this problem but failed. Actually, we are not ready to modify libmlx5, because there are much differences between libmlx4 and libmlx5.
This is our Ubuntu version on host OS
image
The Ubuntu version of container is the same
image
Our OFED is MLNX_OFED_LINUX-4.0-2.0.0.1-ubuntu14.04-x86_64. We use ConnectX-4 40G NICs, and you can see details here
image
MT27700 Family is not listed in libmlx4
image
and we want to try more newer NICs, like ConnectX-5 25G.

@nilyibo
Copy link

nilyibo commented Apr 17, 2019

I see. Thanks for sharing your setup.
Yeah, porting these changes to libmlx5 is probably gonna take a lot of effort.
It seems that FreeFlow only works with ConnectX-3. I saw your workaround for the hca_table check and used that to get rdma_client/rdma_server working, albeit it still hangs 20% of the time.

@bobzhuyb
Copy link
Contributor

The current architecture of Freeflow works only with libmlx4. It's possible to use the LD_PRELOAD trick to re-implement a cross-driver-version solution by intercepting relevant calls. However, it requires quite a bit efforts, and all the authors of this project are now busy with something else...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants