Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway stopped at 3017 bdevs, gw restarts but returns an empty config #309

Closed
pcuzner opened this issue Nov 2, 2023 · 12 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@pcuzner
Copy link
Contributor

pcuzner commented Nov 2, 2023

My goal was to scale to 128 subsystems, 4096 namespaces - but during the configuration of the bdevs for the 95th namespace, the gateway stopped (without a crash), with the following;

INFO:control.grpc:Received request to create bdev subsys-95-disk-9 from nvmeof/subsys-95-disk-9 with block size 512
INFO:control.grpc:Allocating cluster name='cluster_context_377'
terminate called after throwing an instance of 'std::system_error'
what():  Too many open files
ERROR:control.server:spdk_get_version failed with:
[Errno 104] Connection reset by peer
INFO:control.server:Aborting SPDK(client.nvmeof.test.d30-h11-000-r750.hnobgu) pid 3...
ERROR:control.grpc:create_bdev failed with:
Connection closed with partial response:

INFO:control.server:Stopping the server...
INFO:control.server:Exiting the gateway process.

The gw pod has the following ulimits set

sh-5.1# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 2059423
max locked memory           (kbytes, -l) unlimited
max memory size             (kbytes, -m) unlimited
open files                          (-n) 10240
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 1048576
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

After a "systemctl reset-failed" and "systemd restart" for the gateway, the log shows

[2023-11-02 21:57:45.484313] app.c: 712:spdk_app_start: *NOTICE*: Total cores available: 8
[2023-11-02 21:57:45.547479] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 1
[2023-11-02 21:57:45.547524] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 3
[2023-11-02 21:57:45.547601] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 7
[2023-11-02 21:57:45.547538] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 4
[2023-11-02 21:57:45.547563] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 5
[2023-11-02 21:57:45.547578] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 6
[2023-11-02 21:57:45.547498] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 2
[2023-11-02 21:57:45.547605] reactor.c: 926:reactor_run: *NOTICE*: Reactor started on core 0
[2023-11-02 21:57:45.568357] accel_sw.c: 681:sw_accel_module_init: *NOTICE*: Accel framework software module initialized.
DEBUG:control.server:create_transport: tcp options: {"in_capsule_data_size": 8192, "max_io_qpairs_per_ctrlr": 8}
[2023-11-02 21:57:45.737874] tcp.c: 629:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
INFO:control.state:nvmeof.None.state omap object already exists.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/usr/lib64/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/remote-source/ceph-nvmeof/app/control/state.py", line 420, in _update_caller
self.update()
File "/remote-source/ceph-nvmeof/app/control/state.py", line 434, in update
omap_version = int(omap_state_dict[self.omap.OMAP_VERSION_KEY])
KeyError: 'omap_version'

However, looking at the object and key it looks OK

# rados -p nvmeof getomapval nvmeof.None.state omap_version
value (4 bytes) :
00000000  36 32 31 33                                       |6213|
00000004

At this point the API is up, but returning nothing

[root@d30-h11-000-r750 nvmeof.test.d30-h11-000-r750.hnobgu]# nvmeof-cli get_subsystems
INFO:__main__:Get subsystems:
[]

But I can see all the keys present on the omap object..

@github-project-automation github-project-automation bot moved this to 🆕 New in NVMe-oF Nov 2, 2023
@pcuzner pcuzner changed the title Gateway stopped at 3017 bdevs, gw restarts but present an empty config Gateway stopped at 3017 bdevs, gw restarts but returns an empty config Nov 2, 2023
@pcuzner pcuzner added the bug Something isn't working label Nov 2, 2023
@pcuzner
Copy link
Contributor Author

pcuzner commented Nov 3, 2023

Tried 0.0.5, and hit the same issue - looks like a ulimit issue for the pod - but looking at the fd's for the nvmt pid, there are 6500 socket entries...maybe sockets aren't getting closed?

@gbregman
Copy link
Contributor

gbregman commented Nov 9, 2023

@pcuzner the OMAP version key error was fixed. This was a problem whenever we had more than 1024 entries in the OMAP file. The code only read the file once but Ceph limits the number of entries returned to 1024. The result was we didn't read some of the entries. When one of the missing entries was the omap_version one you get an exception as this key is referred to explicitly in the code. The current code gets the OMAP values in a loop. So we'll read the entire file.

@gbregman
Copy link
Contributor

gbregman commented Nov 9, 2023

@pcuzner in which process did you see the 6500 sockets? You write "nvmt pid". It's not clear to me which process is that.

@gbregman
Copy link
Contributor

gbregman commented Nov 9, 2023

I'm running a loop creating 4096 bdevs. While it's running I can see that the SPDK process opens more and more files. Looking at the /proc FS I see a lot of entries like:

.
.
.111
lrwx------. 1 root root 64 Nov  9 13:09 90 -> '/dev/hugepages/spdk_pid22map_75 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 91 -> '/dev/hugepages/spdk_pid22map_76 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 92 -> '/dev/hugepages/spdk_pid22map_77 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 93 -> '/dev/hugepages/spdk_pid22map_78 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 94 -> '/dev/hugepages/spdk_pid22map_79 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 95 -> '/dev/hugepages/spdk_pid22map_80 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 96 -> '/dev/hugepages/spdk_pid22map_81 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 97 -> '/dev/hugepages/spdk_pid22map_82 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 98 -> '/dev/hugepages/spdk_pid22map_83 (deleted)'
lrwx------. 1 root root 64 Nov  9 13:09 99 -> '/dev/hugepages/spdk_pid22map_84 (deleted)'
.
.
.

so, it seems that the file descriptors are kept open after the the map was deleted

@pcuzner
Copy link
Contributor Author

pcuzner commented Nov 9, 2023

Apologies for the cryptic nvmet...the pid started with /usr/local/bin/nvmf_tgt holds the sockets and a lot of them correspond to the number of librbd clients that get created as per the bdevs_per_cluster setting.

@gbregman
Copy link
Contributor

gbregman commented Nov 9, 2023

Yes, this is the SPDK. Asked Jonas about it and he said this is how the SPDK works. Only when you exit the application all these files get freed.

@gbregman
Copy link
Contributor

@pcuzner I see that in your environment, the limit on open files is 10,240. In our environment the limit is 20,480, which is set in docker-compose.yaml. Did you do anything to reduce the limit?

@gbregman
Copy link
Contributor

@caroav the limit on the max number of open files was set to 20,480 by Ernesto. He's done it to increase the limit we had back then. But, when I remove the limit added by Ernesto and just use the default values I get a limit of 1,048,576 which is much bigger.

@gbregman
Copy link
Contributor

We build the nvmeof container with a limit of 20,480 open files. But, looking at the source of cephadm I see:

    elif daemon_type == CephNvmeof.daemon_type:
        name = ident.daemon_name
        container_args.extend(['--ulimit', 'memlock=-1:-1'])
        container_args.extend(['--ulimit', 'nofile=10240'])
        container_args.extend(['--cap-add=SYS_ADMIN', '--cap-add=CAP_SYS_NICE'])

so, it seems that cephadm overrides the value we use and decrease it to 10,240.

@gbregman
Copy link
Contributor

@pcuzner I used a modified version of your script to create 2048 bdevs, 2048 namespaces, 64 subsystems and 64 listeners. After doing all that I see that SPDK has 3877 open files. As far as I could see the increase in the SPDK open files count was linear to the creation of bdevs. Each new bdev created added about 1.78 open files. Assuming this ratio keeps, getting to 4096 bdevs should give us about 7500 open files. Which is way under the limit. Even the cephadm limit which is half of the limit we use. I also ran get_subsystems and it was OK. As I wrote above, the OMAP version key issue was resolved in PR #272. So, I'm not sure if we still have a problem here. Can you repeat your test using the latest code? That is, the code committed to git.

@gbregman
Copy link
Contributor

@caroav maybe it's time to update the binaries in the repository? We got several issues which were already solved in our version of the code.

@gbregman
Copy link
Contributor

@pcuzner I couldn't reproduce the issue and in the meantime there was a major change to the CLI code. So, I close this for now. Please re-open in case you see it again with the current code.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in NVMe-oF Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants