-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/MM: Add user namespace handling for posix and cma transports #9213
Conversation
why not just use |
for manual testing using UCX_TLS=posix,cma allows direct confirmation that both lanes are properly setup if namespace is ok. for testing in shell, using more relaxed sysv specifically with each transport in turn tests the transport unreachability + sysv selection for each transport (error will be specific) and success is expected. Expecting success seems better than expecting failure, since failure could be something else happening before is reachable call. |
Hi @tvegas1 -san, I confirmed this PR works with Apptainer. I would like to know what transport was actually used (ie. posix, sysv, etc). This is how I run apptainer with your PR (our test environment has OmniPath).
|
try |
Thank you @hoopoepg -san! It looks like falling back to
|
hmmm, as I can see it uses all tree transports (shm + ib) for on-host communications:
|
@hoopoepg -san, What is the difference between This is when we use user namespace. (
This is when we did not use user namespace. (
I used the same command for both cases.
|
seems it works as intended, self is same process as per my understanding, so same namespaces, intra-node is different processes, when unsharing user namespace, we have to use non posix/cma, meaning ib. If you added sysv, i guess sysv would be used (unless ipc namespace is also unshared). |
Thank you @tvegas1 -san :) user namespace (apptainer) & added
This is great!!
I'm using Intel MPI Benchmark (IMB-MPI1 Allreduce) for this test. |
@tvegas1 -san, without user namespace (
and then, even if user namespace is not used, trasport fallback to
when I didn't add
|
i see in the code that the order for UCX_TLS is not meaningful and i think it might be possible that sysv is actually simply always preferred in your case, you could confirm that this behavior is the same with/without this PR (without using user namespaces). |
without PR, without user namespace,
I got this output
|
tried v1.10.0 stock, no patch, with mpirun on perftest and with UCX_TLS=posix,sysv, I see sysv being selected, it would be interesting if you would be able to try the PR on its current base/version |
Thank you and good morning @tvegas1 -san, I'll test your PR on v1.10.1 and will update here. (note) I used UCX v1.10.1 from GitHub for previous test (without your PR).
This is configuration log for UCX
and this is configuration log for OpenMPI
|
Hi @tvegas1 -san, I applied your PR to v1.10.1.
I got one failure on
test results UCX v1.10.1 + 9213.patch / apptainer without user namespace /
UCX v1.10.1 + 9213.patch / apptainer without user namespace /
for reference, UCX v1.10.1 w/o patch / apptainer without user namespace /
for reference, your PR / apptainer without user namespace /
for reference, master / apptainer without user namespace /
|
I can not run benchmark under following conditions (while your PR does work): UCX v1.10.1 + 9213.patch / apptainer with user namespace / UCX_TLS=posix,sysv,cma,ib
UCX v1.10.1 + 9213.patch / apptainer with user namespace / UCX_TLS=sysv,cma,ib
UCX master + 9213.patch / apptainer with user namespace / UCX_TLS=posix,sysv,cma,ib
I actually thought UCX master + 9213.patch is kind of equivalent to your PR.. Maybe I am missing something. |
Thanks for running tests, will try to setup apptainer on my side but expect delay. for posix, seems it could be ptrace cap missing if you are running without user namespace as i don't see
|
Thank you @tvegas1 -san, Apptainer has two different packages,
both package installs But,,
I can share my definition files that I used for my tests, if you need. To install Apptainer for fresh system: For Ubuntu 22.04/20.04
For RockyLinux 8/9 (EL8/EL9)
Thank you for your support. |
Thanks, could you please share the definition file? |
About KCMP, I checked "your PR", "master+9213.patch" and "v1.10.1+9213.patch". and It looks like all versions havs KCMP. I'll check the other things as well.
|
@tvegas1 -san,
Yes. This is master+9213.patch example. (Our test system has OmniPath.) Definition file (
build image
Host side setup example against above definition file
and corresponding environment variables on host side
run container (I assume you installed w/o user namespace
w/ user namespace
|
Thanks for spec file, I built ucx only container and tested PR code with two |
What error did you met without user namespace? Could you please share the output? The issue we met is due to the following user namespace differences: mpirun -np 2 apptainer run --userns ... (host namespace #0) |
specifying posix TL, manually running one client and one server using bash |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s), but failed to run 1 pipeline(s). |
@tvegas1 -san, I found the patch was not properly applied to The only difference between
|
@tvegas1 apptainer/apptainer#1583 |
Thanks for testing and updating with latest status. With your provided spec file I was also able to confirm that the PR allows transport fallback when using apptainer with |
e0de0cd
to
df7fe93
Compare
contrib/test_jenkins.sh
Outdated
@@ -1014,6 +1015,63 @@ test_memtrack() { | |||
UCX_MEMTRACK_DEST=stdout UCX_HANDLE_ERRORS=none UCX_MEMTRACK_LIMIT=412MB ./test/apps/test_memtrack_limit |& grep -C 100 'reached' | |||
} | |||
|
|||
test_namespace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it part of test_jenkins.sh and not a separate script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now moved to test_namespace.sh
contrib/test_jenkins.sh
Outdated
|
||
echo "==== Running perftest namespace positive tests ====" | ||
|
||
export UCX_LOG_LEVEL="debug" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the line (intent was to be verbose in the logs to ease any future issue, by dumping lanes)
contrib/test_jenkins.sh
Outdated
cmd="$perftest $args -p $server_port" | ||
step_server_port | ||
$cmd & | ||
sleep 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need ; ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
contrib/test_jenkins.sh
Outdated
export UCX_TLS="$tl,sysv" | ||
cmd="$perftest $args -p $server_port" | ||
step_server_port | ||
$cmd & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need to use the local var cmd here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trying to reuse since we use the same cli two time
contrib/test_jenkins.sh
Outdated
export UCX_LOG_LEVEL="debug" | ||
|
||
echo "==== Running perftest default and non-default USER namespace test for posix ====" | ||
export UCX_TLS="$tl,sysv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do "UCX_TLS=... command" without export
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added UCX_TLS in front of each command
contrib/test_jenkins.sh
Outdated
|
||
for tl in posix cma; do | ||
echo "==== Running perftest different PID namespace test for $tl ====" | ||
user=$(whoami) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use the var $USER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
contrib/test_jenkins.sh
Outdated
"sudo unshare --pid --fork sudo -u $user UCX_TLS=$tl,sysv UCX_LOG_LEVEL=$UCX_LOG_LEVEL $perftest" "$args" "127.0.0.1" 0 0 | ||
done | ||
|
||
for tl in posix cma; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"do" in next line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
contrib/test_jenkins.sh
Outdated
echo "==== Running perftest different PID namespace test for $tl ====" | ||
user=$(whoami) | ||
run_client_server_app \ | ||
"sudo unshare --pid --fork sudo -u $user UCX_TLS=$tl,sysv UCX_LOG_LEVEL=$UCX_LOG_LEVEL $perftest" "$args" "127.0.0.1" 0 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need sudo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unshare --pid
needs privileges
src/uct/sm/mm/posix/mm_posix.c
Outdated
has_ns = !ucs_sys_ns_is_default(UCS_SYS_NS_TYPE_PID) || | ||
!ucs_sys_ns_is_default(UCS_SYS_NS_TYPE_USER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really have to pass user NS in the address?
If PID ns is the same, couldn't we try to open the remote file and if we could not open (for any reason) consider the remote iface as not reachable? or use access(2) to check if have permissions to open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the remote file contains the pid and peer_fd, and it is available doing unpack so it seems it arrives after the is reachable check/lane is setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, now checking directly the capability to posix open
src/uct/sm/scopy/cma/cma_iface.c
Outdated
static inline int uct_cma_has_vm_perms(pid_t UCS_V_UNUSED peer_pid) | ||
{ | ||
#ifdef HAVE_LINUX_KCMP_H | ||
return (syscall(SYS_kcmp, getpid(), peer_pid, KCMP_VM, 0, 0) >= 0) || | ||
(errno != EPERM); | ||
#else | ||
return 1; /* Try anyways */ | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can try process_vm_readv of some dummy address?
I would assume that if the process is accessible we would wither be successful or get EFAULT. and if it's not accessible it would be EPERM.
This way would not depend on new kernel to check the readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed (actually opening it seems stronger than checking access, and code is reused)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides Mikhail's comments
@tvegas1 currently CI looks like this: Can we place the new tests outside of "Tests" list, call them "Namespace Tests", so it will be the same level as "Wire compat" and "io_demo"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash
f52df8d
to
597105d
Compare
@tvegas1 seems there is a conflict now, can u pls resolve by merge commit? |
2735079
to
3a1951a
Compare
What
Add implicit user namespace reachability checks for posix and cma transports to allow proper transport selection. Failing to detect transport unreachability leads to error, where we should instead fallback on other transport.
Issue on posix transport:
Issue on cma transport:
Why ?
The cma transport needs ptrace capability and the posix transport, when using
/proc/<pid>/fd/
, also requires that capability. In short the ptrace capability is local to the current user namespace (and it's children user namespaces).UCX current reachability checks for relevant intra-node transports:
When two UCX instances are executed in different:
posix transport has two modes:
/proc/<pid>/fd/
: it avoids potential lefoversThe posix non-proc link mode does not need ptrace, so it can work even in different user namespaces.
How ?
Reachability check needs to actually try to establish the connection.
Proposed handling when user namespaces are different:
Tests
Using
UCX_TLS=posix,cma
, both lanes must be usable:Adding user namespaces Azure tests below.
With same non-default namespaces:
With different user namespaces:
With different pid namespaces:
Test output: