Gateway stopped at 3017 bdevs, gw restarts but returns an empty config #309

pcuzner · 2023-11-02T23:08:27Z

My goal was to scale to 128 subsystems, 4096 namespaces - but during the configuration of the bdevs for the 95th namespace, the gateway stopped (without a crash), with the following;

INFO:control.grpc:Received request to create bdev subsys-95-disk-9 from nvmeof/subsys-95-disk-9 with block size 512
INFO:control.grpc:Allocating cluster name='cluster_context_377'
terminate called after throwing an instance of 'std::system_error'
what():  Too many open files
ERROR:control.server:spdk_get_version failed with:
[Errno 104] Connection reset by peer
INFO:control.server:Aborting SPDK(client.nvmeof.test.d30-h11-000-r750.hnobgu) pid 3...
ERROR:control.grpc:create_bdev failed with:
Connection closed with partial response:

INFO:control.server:Stopping the server...
INFO:control.server:Exiting the gateway process.

The gw pod has the following ulimits set

sh-5.1# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 2059423
max locked memory           (kbytes, -l) unlimited
max memory size             (kbytes, -m) unlimited
open files                          (-n) 10240
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1048576
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

After a "systemctl reset-failed" and "systemd restart" for the gateway, the log shows

[2023-11-02 21:57:45.484313] app.c: 712:spdk_app_start: *NOTICE*: Total cores available: 8
[2023-11-02 21:57:45.547479] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 1
[2023-11-02 21:57:45.547524] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 3
[2023-11-02 21:57:45.547601] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 7
[2023-11-02 21:57:45.547538] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 4
[2023-11-02 21:57:45.547563] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 5
[2023-11-02 21:57:45.547578] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 6
[2023-11-02 21:57:45.547498] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 2
[2023-11-02 21:57:45.547605] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 0
[2023-11-02 21:57:45.568357] accel_sw.c: 681:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.
DEBUG:control.server:create_transport: tcp options: {"in_capsule_data_size": 8192, "max_io_qpairs_per_ctrlr": 8}
[2023-11-02 21:57:45.737874] tcp.c: 629:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
INFO:control.state:nvmeof.None.state omap object already exists.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/remote-source/ceph-nvmeof/app/control/state.py", line 420, in _update_caller
self.update()
File "/remote-source/ceph-nvmeof/app/control/state.py", line 434, in update
omap_version = int(omap_state_dict[self.omap.OMAP_VERSION_KEY])
KeyError: 'omap_version'

However, looking at the object and key it looks OK

# rados -p nvmeof getomapval nvmeof.None.state omap_version
value (4 bytes) :
00000000  36 32 31 33                                       |6213|
00000004

At this point the API is up, but returning nothing

[root@d30-h11-000-r750 nvmeof.test.d30-h11-000-r750.hnobgu]# nvmeof-cli get_subsystems
INFO:__main__:Get subsystems:
[]

But I can see all the keys present on the omap object..

The text was updated successfully, but these errors were encountered:

pcuzner · 2023-11-03T03:55:35Z

Tried 0.0.5, and hit the same issue - looks like a ulimit issue for the pod - but looking at the fd's for the nvmt pid, there are 6500 socket entries...maybe sockets aren't getting closed?

gbregman · 2023-11-09T09:44:10Z

@pcuzner the OMAP version key error was fixed. This was a problem whenever we had more than 1024 entries in the OMAP file. The code only read the file once but Ceph limits the number of entries returned to 1024. The result was we didn't read some of the entries. When one of the missing entries was the omap_version one you get an exception as this key is referred to explicitly in the code. The current code gets the OMAP values in a loop. So we'll read the entire file.

gbregman · 2023-11-09T09:46:52Z

@pcuzner in which process did you see the 6500 sockets? You write "nvmt pid". It's not clear to me which process is that.

gbregman · 2023-11-09T13:16:32Z

I'm running a loop creating 4096 bdevs. While it's running I can see that the SPDK process opens more and more files. Looking at the /proc FS I see a lot of entries like:

.
.
.111
lrwx------. 1 root root 64 Nov  9 13:09 90 -> '/dev/hugepages/spdk_pid22map_75 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 91 -> '/dev/hugepages/spdk_pid22map_76 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 92 -> '/dev/hugepages/spdk_pid22map_77 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 93 -> '/dev/hugepages/spdk_pid22map_78 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 94 -> '/dev/hugepages/spdk_pid22map_79 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 95 -> '/dev/hugepages/spdk_pid22map_80 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 96 -> '/dev/hugepages/spdk_pid22map_81 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 97 -> '/dev/hugepages/spdk_pid22map_82 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 98 -> '/dev/hugepages/spdk_pid22map_83 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 99 -> '/dev/hugepages/spdk_pid22map_84 (deleted)'
.
.
.

so, it seems that the file descriptors are kept open after the the map was deleted

pcuzner · 2023-11-09T20:34:01Z

Apologies for the cryptic nvmet...the pid started with /usr/local/bin/nvmf_tgt holds the sockets and a lot of them correspond to the number of librbd clients that get created as per the bdevs_per_cluster setting.

gbregman · 2023-11-09T20:38:26Z

Yes, this is the SPDK. Asked Jonas about it and he said this is how the SPDK works. Only when you exit the application all these files get freed.

gbregman · 2023-11-13T10:00:55Z

@pcuzner I see that in your environment, the limit on open files is 10,240. In our environment the limit is 20,480, which is set in docker-compose.yaml. Did you do anything to reduce the limit?

gbregman · 2023-11-13T10:03:35Z

@caroav the limit on the max number of open files was set to 20,480 by Ernesto. He's done it to increase the limit we had back then. But, when I remove the limit added by Ernesto and just use the default values I get a limit of 1,048,576 which is much bigger.

gbregman · 2023-11-13T19:21:59Z

We build the nvmeof container with a limit of 20,480 open files. But, looking at the source of cephadm I see:

    elif daemon_type == CephNvmeof.daemon_type:
        name = ident.daemon_name
        container_args.extend(['--ulimit', 'memlock=-1:-1'])
        container_args.extend(['--ulimit', 'nofile=10240'])
        container_args.extend(['--cap-add=SYS_ADMIN', '--cap-add=CAP_SYS_NICE'])

so, it seems that cephadm overrides the value we use and decrease it to 10,240.

gbregman · 2023-11-14T09:13:18Z

@pcuzner I used a modified version of your script to create 2048 bdevs, 2048 namespaces, 64 subsystems and 64 listeners. After doing all that I see that SPDK has 3877 open files. As far as I could see the increase in the SPDK open files count was linear to the creation of bdevs. Each new bdev created added about 1.78 open files. Assuming this ratio keeps, getting to 4096 bdevs should give us about 7500 open files. Which is way under the limit. Even the cephadm limit which is half of the limit we use. I also ran get_subsystems and it was OK. As I wrote above, the OMAP version key issue was resolved in PR #272. So, I'm not sure if we still have a problem here. Can you repeat your test using the latest code? That is, the code committed to git.

gbregman · 2023-11-14T09:14:11Z

@caroav maybe it's time to update the binaries in the repository? We got several issues which were already solved in our version of the code.

gbregman · 2024-01-15T14:41:46Z

@pcuzner I couldn't reproduce the issue and in the meantime there was a major change to the CLI code. So, I close this for now. Please re-open in case you see it again with the current code.

github-project-automation bot added this to NVMe-oF Nov 2, 2023

github-project-automation bot moved this to 🆕 New in NVMe-oF Nov 2, 2023

pcuzner changed the title ~~Gateway stopped at 3017 bdevs, gw restarts but present an empty config~~ Gateway stopped at 3017 bdevs, gw restarts but returns an empty config Nov 2, 2023

pcuzner added the bug Something isn't working label Nov 2, 2023

caroav assigned gbregman Nov 5, 2023

gbregman closed this as completed Jan 15, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in NVMe-oF Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway stopped at 3017 bdevs, gw restarts but returns an empty config #309

Gateway stopped at 3017 bdevs, gw restarts but returns an empty config #309

pcuzner commented Nov 2, 2023

pcuzner commented Nov 3, 2023

gbregman commented Nov 9, 2023

gbregman commented Nov 9, 2023

gbregman commented Nov 9, 2023 •

edited

Loading

pcuzner commented Nov 9, 2023

gbregman commented Nov 9, 2023

gbregman commented Nov 13, 2023

gbregman commented Nov 13, 2023

gbregman commented Nov 13, 2023

gbregman commented Nov 14, 2023

gbregman commented Nov 14, 2023

gbregman commented Jan 15, 2024

Gateway stopped at 3017 bdevs, gw restarts but returns an empty config #309

Gateway stopped at 3017 bdevs, gw restarts but returns an empty config #309

Comments

pcuzner commented Nov 2, 2023

pcuzner commented Nov 3, 2023

gbregman commented Nov 9, 2023

gbregman commented Nov 9, 2023

gbregman commented Nov 9, 2023 • edited Loading

pcuzner commented Nov 9, 2023

gbregman commented Nov 9, 2023

gbregman commented Nov 13, 2023

gbregman commented Nov 13, 2023

gbregman commented Nov 13, 2023

gbregman commented Nov 14, 2023

gbregman commented Nov 14, 2023

gbregman commented Jan 15, 2024

gbregman commented Nov 9, 2023 •

edited

Loading