Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set $UCX_TLS to 'all' for impi installed on top of UCX #2253

Merged
merged 1 commit into from
Nov 27, 2020

Conversation

lexming
Copy link
Contributor

@lexming lexming commented Nov 27, 2020

Fixes issue easybuilders/easybuild-easyconfigs#10899

Intel added the MLX provider in impi v2019.5, which supports Mellanox HCAs and requires UCX. See: https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband

In v2019.6, Intel MPI seems to be internally using something similar to UCX_TLS=dc,ud,rc,sm,self. This means that impi will only work by default (without setting UCX_TLS) on Mellanox ConnectX-5 and newer HCAs. Everything else fails due to the lack of the dc TLS and requires explicitly setting UCX_TLS with the available transports. This is actually the workaround instructed by Intel in the aforementioned link.

In v2019.8, impi got some improvements in this regard and with it the MLX provider can determine the available transports in UCX on its own. So it works by default (without setting UCX_TLS) in all hardware configurations as long as the communication is done with MLX, so with Mellanox cards. However, it still fails with everything else.

The most effective solution is to explicitly set UCX_TLS=all, this will leave the choice to UCX to choose the best available transport on its own and avoids errors with impi in all those systems that do not work by default. Explicitly setting UCX_TLS=all will not change the behaviour of systems that were already working well without it.

The only alternative to this solution would be to not install UCX with impi in systems without Mellanox HCAs, but that option cannot be provided from EB.

@lexming lexming added this to the 4.3.2 (next release) milestone Nov 27, 2020
@lexming
Copy link
Contributor Author

lexming commented Nov 27, 2020

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS impi-2019.7.217-iccifort-2020.1.217.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node303.hydra.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/c632e648f6662279ddcd11e36bb6b717 for a full test report.

# since impi v2019.8, the MLX provider works without UCX_TLS, but setting it does not hurt
ucx_root = get_software_root('UCX')
if ucx_root:
txt += self.module_generator.set_environment('UCX_TLS', 'all')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that all versions of UCX up until 1.9.0 knows about "all" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what is the difference between UCX_TLS=all and not setting UCX_TLS? Does UCX_TLS=all include more transports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, added ages ago with openucx/ucx#104

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also what is the difference between UCX_TLS=all and not setting UCX_TLS? Does UCX_TLS=all include more transports?

There is no difference, in both cases UCX will consider all available TLS and choose the best one. So explicitly setting UCX_TLS=all will not change the behaviour of systems that were already working well without it.

@lexming
Copy link
Contributor Author

lexming commented Nov 27, 2020

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS impi-2019.8.254-iccifort-2020.1.217.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node303.hydra.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/98af5568335b5f3505f6eef578ece658 for a full test report.

@lexming
Copy link
Contributor Author

lexming commented Nov 27, 2020

For reference, previous test is in a machine without a Mellanox HCA (so no MLX provider). The same test with current impi easyblock in develop fails the sanity checks. See https://gist.github.com/lexming/4e64523090ddf1c48baf4b47353e918b

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren
Copy link
Contributor

Going in, thanks @lexming!

@akesandgren akesandgren merged commit 4b7f1f0 into easybuilders:develop Nov 27, 2020
@lexming lexming deleted the impi_ucxtls branch November 27, 2020 14:23
@lexming
Copy link
Contributor Author

lexming commented Nov 27, 2020

Test report by @lexming

Overview of tested easyconfigs (in order)

  • SUCCESS impi-2019.9.304-iccifort-2020.4.304.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node303.hydra.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/a466d5bd8b549ef4d102f23459cb1630 for a full test report.

@boegel boegel changed the title set UCX_TLS in impi with UCX set $UCX_TLS to 'all' for impi installed on top of UCX Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants