Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with intra-node communication ehen using openMPI in containerized environment #8958

Closed
denisbertini opened this issue Mar 20, 2023 · 59 comments
Assignees

Comments

@denisbertini
Copy link

When using openMPI 4.1.5 together with UCX v 1.14.0 installed inside a container ( apptainer ) it seems that communication via direct adress space between process is not permitted:

 [lxbk1012:3549626:0:3549626]      cma_ep.c:84   process_vm_readv(pid=3549634 {0x24ed520,12480}-->{0x1bd42a0,12480}) returned -1: Operation not permitted
[lxbk1012:3549633:0:3549633]      cma_ep.c:84   process_vm_readv(pid=3549634 {0x2561490,12480}-->{0x1bd42a0,12480}) returned -1: Operation not permitted
==== backtrace (tid:3549626) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f0dab078ba4]
 1  /usr/local/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xb0) [0x7f0dab075bb0]
 2  /usr/local/ucx/lib/libucs.so.0(ucs_log_default_handler+0xf09) [0x7f0dab07a6d9]
 3  /usr/local/ucx/lib/libucs.so.0(ucs_log_dispatch+0xcc) [0x7f0dab07aa7c]
 4  /usr/local/ucx/lib/ucx/libuct_cma.so.0(+0x2683) [0x7f0daa0e7683]
 5  /usr/local/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x168) [0x7f0daa0e7998]
 6  /usr/local/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x55) [0x7f0dab2c8955]
 7  /usr/local/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xa3) [0x7f0dab06d0d3]
 8  /usr/local/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x59) [0x7f0dab2c84a9]
 9  /usr/local/ucx/lib/libucs.so.0(+0x21dc3) [0x7f0dab06ddc3]
10  /usr/local/ucx/lib/libucp.so.0(ucp_worker_progress+0x22) [0x7f0dab530a62]
11  /usr/local/lib/libopen-pal.so.40(opal_progress+0x2c) [0x7f0db45bdf1c]
12  /usr/local/lib/libmpi.so.40(ompi_request_default_wait+0x3d) [0x7f0db5ad358d]
13  /usr/local/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x5e4) [0x7f0db5b29b04]
14  /usr/local/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xb6) [0x7f0db5b29d26]
15  /usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x3c) [0x7f0da2b74c7c]
16  /usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x41) [0x7f0da2b744c1]
17  /usr/local/lib/libmpi.so.40(PMPI_Allreduce+0x169) [0x7f0db5aeb179]
18  /usr/local/lib/libmpi_mpifh.so.40(mpi_allreduce_+0x7b) [0x7f0db5e1802b]
19  /usr/local/bin/epoch3d() [0x435dde]
20  /usr/local/bin/epoch3d() [0x530f88]
21  /usr/local/bin/epoch3d() [0x4d3a76]
22  /usr/local/bin/epoch3d() [0x40318d]
23  /lib64/libc.so.6(__libc_start_main+0xe5) [0x7f0db487bd85]
24  /usr/local/bin/epoch3d() [0x4031ce]

It looks like the optimized communication between processes within one node is not allowed because both process are launched inside a container. Do you know a way arround this limitation?

@shamisp
Copy link
Contributor

shamisp commented Mar 25, 2023

Can you please elaborate - two process talking to each other within the same container ? Or two separate container talking to each other ?

@denisbertini
Copy link
Author

2 processes each of them inside their own separate container talking to each other. In this case is actually direct CMA not possible i guess ...

@denisbertini
Copy link
Author

I ran ucx on debug mode:

# OSU MPI Allreduce Latency Test v7.0
# Size       Avg Latency(us)   Min Latency(us)   Max Latency(us)  Iterations
[1680008823.363205] [lxbk1014:1937990:0]       wireup_ep.c:415  UCX  DEBUG ep 0x7ff033d79040: destroy wireup ep 0x1f5b010
[1680008823.363219] [lxbk1014:1937990:0]          ucp_ep.c:1289 UCX  DEBUG ep 0x7ff033d79000: unprogress iface 0x1f2e1d0 ud_mlx5/mlx5_0:1
[1680008823.363226] [lxbk1014:1937990:0]           ud_ep.c:1786 UCX  DEBUG ep 0x1ff3290: disconnect
[1680008823.363245] [lxbk1014:1937990:0]          ucp_ep.c:1289 UCX  DEBUG ep 0x7ff033d79040: unprogress iface 0x1f2e1d0 ud_mlx5/mlx5_0:1
[1680008823.363258] [lxbk1014:1937990:0]           async.c:157  UCX  DEBUG removed async handler 0x1ed9190 [id=1000017 ref 1] ???() from hash
[1680008823.363266] [lxbk1014:1937990:0]           async.c:547  UCX  DEBUG removing async handler 0x1ed9190 [id=1000017 ref 1] ???()
[1680008823.363279] [lxbk1014:1937990:0]           async.c:172  UCX  DEBUG release async handler 0x1ed9190 [id=1000017 ref 0] ???()
[1680008823.363304] [lxbk1014:1937990:0]           ud_ep.c:1786 UCX  DEBUG ep 0x7ff028000e00: disconnect
[1680008823.375784] [lxbk1014:1937990:0]           mpool.c:282  UCX  DEBUG mpool ucp_am_bufs: allocated chunk 0x2047064 of 24660 bytes with 128 elements
4                       2.93              2.92              2.94        1000
8                       2.91              2.91              2.91        1000
16                      2.87              2.86              2.88        1000
32                      3.03              3.03              3.03        1000
64                      2.99              2.98              3.00        1000
[1680008823.422718] [lxbk1014:1937990:0]           mm_ep.c:68   UCX  DEBUG mm_ep 0x1fac630: attached remote segment id 0xd0010 at 0x7ff02c285000 cookie 0x7ff03c79ac80
128                     3.08              3.08              3.08        1000
256                     3.11              3.09              3.14        1000
512                     3.14              3.14              3.15        1000
1024                    3.30              3.30              3.31        1000
2048                    3.86              3.85              3.88        1000
4096                    4.77              4.76              4.78        1000
8192                    5.39              5.39              5.39        1000
16384                   7.96              7.95              7.97        1000
32768                  11.27             11.26             11.28        1000
65536                  18.72             18.69             18.75         100
131072                 33.02             32.97             33.08         100
262144                 62.02             62.00             62.04         100
[1680008823.702993] [lxbk1014:1937990:0]           mpool.c:282  UCX  DEBUG   mpool ucp_rkeys: allocated chunk 0x204d0c0 of 16472 bytes with 128 elements
[1680008823.703017] [lxbk1014:1937990:0]           mpool.c:282  UCX  DEBUG mpool uct_scopy_iface_tx_mp: allocated chunk 0x2051120 of 6232 bytes with 8 elements
[1680008823.703189] [lxbk1014:1937990:0]      ucp_worker.c:531  UCX  DEBUG worker 0x7ff038011010: error handler called for UCT EP 0x7ff034021910: Connection reset by remote peer
[1680008823.703203] [lxbk1014:1937990:0]          ucp_ep.c:1412 UCX  DEBUG ep 0x7ff033d79040: set_ep_failed status Connection reset by remote peer on lane[1]=0x7ff034021910
[1680008823.703210] [lxbk1014:1937990:0]          ucp_ep.c:1373 UCX  DEBUG ep 0x7ff033d79040: discarding lanes
[1680008823.703219] [lxbk1014:1937990:0]          ucp_ep.c:1381 UCX  DEBUG ep 0x7ff033d79040: discard uct_ep[0]=0x1fac630
[1680008823.703230] [lxbk1014:1937990:0]          ucp_ep.c:1381 UCX  DEBUG ep 0x7ff033d79040: discard uct_ep[1]=0x7ff034021910
[1680008823.703237] [lxbk1014:1937990:0]          ucp_ep.c:1381 UCX  DEBUG ep 0x7ff033d79040: discard uct_ep[2]=0x1fa4120
[1680008823.703254] [lxbk1014:1937990:0]           mpool.c:282  UCX  DEBUG mpool send-ops-mpool: allocated chunk 0x2052980 of 16472 bytes with 256 elements
[1680008823.703268] [lxbk1014:1937990:0]          ucp_ep.c:1453 UCX  DIAG  ep 0x7ff033d79040: error 'Connection reset by remote peer' on cma/memory will not be handled since no error callback is installed
[lxbk1014:1937990:0:1937990]      cma_ep.c:84   process_vm_readv(pid=1937991 {0x7ff033af4010,131072}-->{0x7fed3e0a7000,131072}) returned -1: Operation not permitted
==== backtrace (tid:1937990) ====
 0  /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x294) [0x7ff032c6da44]
 1  /usr/local/ucx/lib/libucs.so.0(ucs_fatal_error_message+0xb0) [0x7ff032c6aa30]
 2  /usr/local/ucx/lib/libucs.so.0(ucs_log_default_handler+0xf09) [0x7ff032c6f579]
 3  /usr/local/ucx/lib/libucs.so.0(ucs_log_dispatch+0xdc) [0x7ff032c6f92c]
 4  /usr/local/ucx/lib/ucx/libuct_cma.so.0(+0x2683) [0x7ff031cdb683]
 5  /usr/local/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x168) [0x7ff031cdb998]
 6  /usr/local/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x55) [0x7ff032ebc9f5]
 7  /usr/local/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0x9c) [0x7ff032c61e8c]
 8  /usr/local/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x59) [0x7ff032ebc529]
 9  /usr/local/ucx/lib/libucs.so.0(+0x21b94) [0x7ff032c62b94]
10  /usr/local/ucx/lib/libucp.so.0(ucp_worker_progress+0x22) [0x7ff033125232]
11  /usr/local/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x155) [0x7ff0335bdb35]
12  /usr/local/lib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0x93) [0x7ff03d39ca33]
13  /usr/local/lib/libmpi.so.40(ompi_coll_base_allreduce_intra_redscat_allgather+0x46c) [0x7ff03d3a47bc]
14  /usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0x41) [0x7ff02e7da4c1]
15  /usr/local/lib/libmpi.so.40(PMPI_Allreduce+0x169) [0x7ff03d359179]
16  /usr/local/bin/collective/osu_allreduce() [0x4024f0]
17  /lib64/libc.so.6(__libc_start_main+0xe5) [0x7ff03c414d85]
18  /usr/local/bin/collective/osu_allreduce() [0x401aae]

Any idea what the connection reset by remote peer means in that context?

@denisbertini
Copy link
Author

For isolated processes in user namespace ( container ) the following code will detect cma
and set cma_supported to 1 ( enabled ).

static int uct_cma_test_ptrace_scope()
(
static const char *ptrace_scope_file = "/proc/sys/kernel/yama/ptrace_scope";
    const char *extra_info_str;
    int cma_supported;
    char buffer[32];
    ssize_t nread;
    char *value;

    /* Check if ptrace_scope allows using CMA.
     * See https://www.kernel.org/doc/Documentation/security/Yama.txt
     */
    nread = ucs_read_file(buffer, sizeof(buffer) - 1, 1, "%s", ptrace_scope_file);
    if (nread < 0) {
        /* Cannot read file - assume that Yama security module is not enabled */
        ucs_debug("could not read '%s' - assuming Yama security is not enforced",
                  ptrace_scope_file);
        return 1;
    }

    ucs_assert(nread < sizeof(buffer));
    extra_info_str = "";
    cma_supported  = 0;
    buffer[nread]  = '\0';
    value          = ucs_strtrim(buffer);
    if(!strcmp(value, "0")) {
        /* ptrace scope 0 allow attaching within same UID */
        cma_supported = 1;

Just looking at the underlying ptrace system /proc/sys/kernel/yama/ptrace_scope file is not enough in the case of isolated processes

@ptim0626
Copy link

I came across the same error in an Apptainer container too, and although I am not sure it is relevant or not, I added the flag --ipc and the error is gone.

@denisbertini
Copy link
Author

you mean that adding the option
apptainer --ipc solved the issue with CMA between isolated processes ?

@ptim0626
Copy link

I run the container with something like apptainer exec --ipc ...... ./my_application and the error from process_vm_readv is gone.

@denisbertini
Copy link
Author

interesting ! do you know what this option does, ipc user namespace sharing ?

@ptim0626
Copy link

From the docs it said running the container in a new inter-process communication namespace, and I am not sure why it helps. Also I don't know if there is any performance impact.

@denisbertini
Copy link
Author

if CMA is actually really used than one should see quite a performance boost ....

@denisbertini
Copy link
Author

solution seems to be accepted by the Apptainer developpers see :
apptainer/apptainer#1227
I still need to do some perf tests +/- CMA though ...

@ptim0626
Copy link

Great, good to know that setting APPTAINER_UNSHARE_IPC=true also works.

@tvegas1
Copy link
Contributor

tvegas1 commented Jul 13, 2023

Started related PR #9213

@tvegas1
Copy link
Contributor

tvegas1 commented Aug 2, 2023

apptainer: more details on --ipc option and other workaround: apptainer/apptainer#769 (comment)
apptainer: rfe: apptainer/apptainer#1583

@yosefe
Copy link
Contributor

yosefe commented Aug 28, 2023

fixed in ucx master branch

@panda1100
Copy link
Contributor

panda1100 commented Sep 7, 2023

@tvegas1 I wrote a report related to this issue and the other issue
https://ciq.com/blog/workaround-for-communication-issue-with-mpi-apps-apptainer-without-setuid/

@denisbertini
Copy link
Author

Your report is protected

@panda1100
Copy link
Contributor

@denisbertini My mistake. I updated a link above. The first one was a link for staging server (^^;

@denisbertini
Copy link
Author

very good, i am trying to find a possible solution for this problem using apptainer instance. I think it is definitely the way to go.
did you elaborate already some solution to this problem?

@panda1100
Copy link
Contributor

Current workaround is using apptainer instance, please see very end of the report.
Apptainer v1.3.0 we are planning to implement this workaround as apptainer runtime option.

@denisbertini
Copy link
Author

ok but how can this work together with a scheduler for example, SLURM ?

@denisbertini
Copy link
Author

Is there a timeline for apptainer 1.3.0 ?

@panda1100
Copy link
Contributor

panda1100 commented Sep 7, 2023

@denisbertini I also showed an example job script for SLURM in the report (Please see section "Multi-node test with Slurm").
No timeline for apptainer v1.3.0 yet. Next release will be v1.2.3 and after that we most likely move on to v1.3.0.
https://github.com/apptainer/apptainer/milestones

@denisbertini
Copy link
Author

@panda1100 in your multi node test, you used ssh ... this looks like a hack, and we do not allow ssh on compute node.
i tried to use directly srun but it failed to keep the instance permanent ... any idea

@panda1100
Copy link
Contributor

panda1100 commented Sep 7, 2023

@denisbertini Yes, this is kind of ugly solution right now .. Apptainer apptainer/apptainer#1583 will be like internally handle this without ssh.

srun doesn't work because each slurm job step kills processes when each job step completed, in this case, when apptainer instance successfully started.

Without ssh, simple bash script sample.sh may help (just an idea and didn't test yet), something like:

if apptainer instance is already started
    apptainer run instance://INSTANCE_NAME YOUR_APP
else
    apptainer instance start YOUR.sif INSTANCE_NAME
    apptainer run instance://INSTANCE_NAME YOUR_APP
#!/bin/bash
#SBATCH --partition=hpc
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

srun --mpi=pmi2 sample.sh
srun --nodes=4 --ntasks=4 --ntasks-per-node=1 apptainer instance stop $SLURM_JOB_ID

if execve syscall restricted by policy, this doesn't work...

@denisbertini
Copy link
Author

does not work so far

FATAL:   failed to start instance: while running /usr/libexec/apptainer/bin/starter: exit status 255

@denisbertini
Copy link
Author

may be linked to execve permissions

@denisbertini
Copy link
Author

yes i tried to add it at the end, so:

if [ `flock -x /tmp apptainer instance list $SLURM_JOB_ID | grep $SLURM_JOB_ID | wc -l` -eq 1 ]
then
    apptainer instance stop $SLURM_JOB_ID
fi

@denisbertini
Copy link
Author

it works through but with this kind of errors:

ATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
FATAL:   instance 8459273 already exists
INFO:    Stopping 8459273 instance of /lustre/rz/dbertini/containers/prod/rlx8_ompi_ucx.sif (PID=3088010)
/lustre/rz/dbertini/vae23/theory/./instances.sh: line 14: 3088344 Killed                  apptainer exec instance://$SLURM_JOB_ID $DIR/simple.sh
/lustre/rz/dbertini/vae23/theory/./instances.sh: line 14: 3088112 Killed                  apptainer exec instance://$SLURM_JOB_ID $DIR/simple.sh
/lustre/rz/dbertini/vae23/theory/./instances.sh: line 14: 3088140 Killed                  apptainer exec instance://$SLURM_JOB_ID $DIR/simple.sh
/lustre/rz/dbertini/vae23/theory/./instances.sh: line 14: 3088847 Killed                  apptainer exec instance://$SLURM_JOB_ID $DIR/simple.sh

@denisbertini
Copy link
Author

If i remove the apptainer stop command everything works, but the instance on the node is not cleaned

@panda1100
Copy link
Contributor

if your slurm doesn't kill processes at the end of job step,
stop instance at next job step should be work.

#!/bin/bash
#SBATCH --partition=hpc
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40

srun --mpi=pmi2 flock.sh
srun --nodes=2 --ntasks=2 --ntasks-per-node=1 apptainer instance stop $SLURM_JOB_ID

@denisbertini
Copy link
Author

Do you know a way to tell apptainer instance not to write output logs in the $HOME directory?
In out compute node no $HOME directory is mounted...

@panda1100
Copy link
Contributor

@denisbertini
Copy link
Author

you mean you can change the location of the logs and other app info files that apptainer instance write as default to

$HOME/.apptainer/instance

buy using the mount option -B ?
I do not understand that, can you show me a example ?

@panda1100
Copy link
Contributor

@denisbertini To change $HOME/.apptainer to somewhere else,
APPTAINER_CONFIGDIR environment variable may help your situation.

This environment variable is for specifying the directory to use for per-user configuration. The default is $HOME/.apptainer.
https://apptainer.org/docs/user/main/appendix.html#c

@denisbertini
Copy link
Author

denisbertini commented Sep 9, 2023

no, even setting this variable , apptainer wants still to write something in the $HOMME/.apptainer directory ...
See:

@panda1100
Copy link
Contributor

@denisbertini Oh, Thank you for the pointer. I'll work with the team on that.

@panda1100
Copy link
Contributor

@denisbertini How about explicitly change HOME environment variable inside slurm job script? I haven't tested yet though.

This is an example, /data is where you have write permission.

#!/bin/bash
#SBATCH --partition=hpc
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40

export HOME=/data/YOUR_USERNAME

srun --mpi=pmi2 flock.sh
srun --nodes=2 --ntasks=2 --ntasks-per-node=1 apptainer instance stop $SLURM_JOB_ID

and bind mount /data/YOUR_USERNAME when instance start (not sure this is required or not)

#!/bin/sh
if [ `flock -x /tmp/${SLURM_JOB_ID}.lock apptainer instance list $SLURM_JOB_ID | grep $SLURM_JOB_ID | wc -l` -eq 1 ]
then
        apptainer run instance://$SLURM_JOB_ID alltoall
else
        flock -o -x /tmp/${SLURM_JOB_ID}.lock apptainer instance start -B /data/YOUR_USERNAME hpcx-imb.sif $SLURM_JOB_ID
        apptainer run instance://$SLURM_JOB_ID alltoall
fi

@denisbertini
Copy link
Author

the problem i see, even using this approach is that the instances are not visible anymore

INFO:    instance started successfully
INFO:    instance started successfully
FATAL:   no instance found with name 8685190
FATAL:   no instance found with name 8685190

@denisbertini
Copy link
Author

the problem is linked to apptainer instance list which somehow does not see the created instance

@denisbertini
Copy link
Author

BTW with the modified script, only the logs part of the instances are properly redirected, still apptainer wants to write the other folder app on my HOME directory!

ERROR:   container cleanup failed: no instance found with name 8685190
FATAL:   post start process failed: mkdir /u: permission denied

@panda1100
Copy link
Contributor

Thank you @denisbertini, I also replicated the issue.

@panda1100
Copy link
Contributor

@denisbertini The fix (apptainer/apptainer#1666) merged and will be release as Apptainer v1.2.3.

@denisbertini
Copy link
Author

great ! thanks ... any timeline for the release v1.2.3 ?

@panda1100
Copy link
Contributor

@denisbertini At least from https://github.com/apptainer/apptainer/milestones completion status, it looks like it will be soon...

@denisbertini
Copy link
Author

@panda1100 can one imagine instead of just scripts to better integration of apptainer instance via for example SLURM spank plugin, what do you think?
Basic idea: the user will have to type something like

srun --container=mycontainer.sif ...

and then only one instance will be created/compute node in the prolog phase and stopped in the epilog phase ...
what do you think?

@panda1100
Copy link
Contributor

This workaround is only for temporal solution until permanent solution released. The permanent solution we are planning is apptainer/apptainer#1583 . This RFE will implement runtime option something like --mpi and it will not use apptainer instance.

@denisbertini
Copy link
Author

@panda1100 is there already a apptainer patch available which includes the fix related to proper behavior of CONFIGDIR for
instance start ?

@denisbertini
Copy link
Author

i would like to test it on our cluster

@panda1100
Copy link
Contributor

@denisbertini Not merged yet but I tested with apptainer/apptainer#1672

This is the procedures I used on Rocky Linux 9.2

dnf install -y epel-release
crb enable
dnf groupinstall -y 'Development Tools'

dnf install -y --enablerepo=epel libseccomp-devel squashfs-tools \
  squashfuse fuse-overlayfs fakeroot \
  /usr/*bin/fuse2fs cryptsetup wget git \
  golang patch

git clone -b main https://github.com/apptainer/apptainer.git
https://patch-diff.githubusercontent.com/raw/apptainer/apptainer/pull/1672.patch
cd apptainer/
patch -p1 < ../1672.patch
./mconfig --prefix=/opt/apptainer-1666-main-patch
cd ./builddir
make
make install

@denisbertini
Copy link
Author

this will not be necessary anymore, since we already have the release v1.2.3

@denisbertini
Copy link
Author

@panda1100 do you know if there is a recommended version of the linux kernel needed to properly run apptainer v1.x in the defautl non-setuid mode. We are facing a problem with Rocky Linux 8.x running kernel 4.18.x that create user namespace are not properly cleaned up and when the kernel max_user_namespace is reached we get this error:

Container launched: /cvmfs/vae.gsi.de/vae23/containers/vae23-user_container_20230811T0945.sif
INFO   : A system administrator may need to enable user namespaces, install
INFO   :   apptainer-suid, or compile with ./mconfig --with-suid
ERROR  : Failed to create user namespace: maximum number of user namespaces exceeded, check /proc/sys/user/max_user_namespaces

@panda1100
Copy link
Contributor

@denisbertini interesting. please, create an issue on Apptainer repo. Let's work on there.

@denisbertini
Copy link
Author

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants