Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSError: [Errno 24] Too many open files #16

Open
keskitalo opened this issue May 17, 2023 · 2 comments
Open

OSError: [Errno 24] Too many open files #16

keskitalo opened this issue May 17, 2023 · 2 comments

Comments

@keskitalo
Copy link

It seems that moving to POSIX shared memory may have added limitations to pshmem. One of my workflows is failing with

Process 0: MPIShared_e5ec3957d523 failed MMap of 115168 bytes (14396 elements of 8 bytes each): [Errno 24] Too many open files
TOAST ERROR: Proc 0: Traceback (most recent call last):
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/mpi.py", line 509, in exception_guard
    yield
Proc 0:   File "/home/reijo/.conda/envs/toastdev/bin/toast_so_sim.py", line 934, in <module>
    main()
Proc 0:   File "/home/reijo/.conda/envs/toastdev/bin/toast_so_sim.py", line 913, in main
    data = simulate_data(job, args, toast_comm, telescope, schedule)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Proc 0:   File "/home/reijo/.conda/envs/toastdev/bin/toast_so_sim.py", line 327, in simulate_data
    ops.sim_ground.apply(data)
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/timing.py", line 107, in df
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/ops/operator.py", line 107, in apply
    self.exec(data, detectors, use_accel=use_accel, **kwargs)
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/timing.py", line 107, in df
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/ops/operator.py", line 47, in exec
    self._exec(
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/timing.py", line 81, in df
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/ops/sim_ground.py", line 597, in _exec
    ob.shared.create_column(
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/toast/observation_data.py", line 1352, in create_column
    MPIShared(
Proc 0:   File "/home/reijo/.conda/envs/toastdev/lib/python3.11/site-packages/pshmem/shmem.py", line 195, in __init__
    self._shmap = mmap.mmap(
                  ^^^^^^^^^^
Proc 0: OSError: [Errno 24] Too many open files

When I run against pshem version right before the merge of the POSIX branch, there is no problem.

@tskisner
Copy link
Owner

I guess it is not surprising that the OS limit for shared memory segments (i.e. open files) gets hit eventually. This is set by a kernel config parameter for the whole node and the per-user limit is shown with ulimit -Sn. Here is a page with more discussion.

The surprising thing is that this per user/process limit is (apparently) smaller than the limit on MPI shared memory windows, which is globally across the system.

  • Is this at NERSC or on the simons1 machine?
  • Can you try increasing the limit in the same shell where you running the workflow? Something like ulimit -n 8192 before mpirun / srun.

@keskitalo
Copy link
Author

I tried ulimit -n 8192 but the workflow still failed. I had issues on simons1 and on my laptop running Fedora.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants