Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With 64 subsystems and 2048 namespaces, get_subsystems throws a traceback in the gateway log #306

Closed
pcuzner opened this issue Nov 2, 2023 · 9 comments
Assignees
Labels
bug Something isn't working productization

Comments

@pcuzner
Copy link
Contributor

pcuzner commented Nov 2, 2023

At this scale, when I issue a get_subsystems I can see the following in the gateway log

Screenshot_20231102_172749

@pcuzner pcuzner added the bug Something isn't working label Nov 2, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in NVMe-oF Nov 2, 2023
@gbregman
Copy link
Contributor

gbregman commented Nov 2, 2023

As far as I can see the exception was just in displaying a log message. No need to fail the command for that. We can just move the log message a little forward in the code, enclose it in a "try" block and ignore such errors.

@gbregman
Copy link
Contributor

gbregman commented Nov 2, 2023

@pcuzner could you send us the grpc.py file used here?

@baum
Copy link
Collaborator

baum commented Nov 2, 2023

BlockingIOError message might be relevant. @pcuzner 🖖 could you provide exact container version used and reproduction steps? Specifically, does the gateway codebase include #258 ?

@gbregman
Copy link
Contributor

gbregman commented Nov 2, 2023

@baum , notice that the exception was in:

self.logger.info(f"get_subsystems: {ret}")

It's not really I/O, just writing to the log. The strange thing is that this line is inside a "try" block and still we didn't get into the "except" clause

@pcuzner
Copy link
Contributor Author

pcuzner commented Nov 2, 2023

This is against registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:0.0.4-1
The trackback is thrown, but the output is received. I don't know if it's complete, since it's so large though.

@pcuzner
Copy link
Contributor Author

pcuzner commented Nov 5, 2023

Note that this is also seen with 0.0.5, and with gateway configuration that has higher numbers on smaller subsystems. I noticed the error at 94 subsystems, 376 namespaces for example - so this issue is not limited to configurations of large numbers of namespaces

@gbregman
Copy link
Contributor

gbregman commented Nov 9, 2023

@pcuzner can you specify which CLI command exactly you used to create all these subsystems and namespaces? How many bdevs did you create?

@pcuzner
Copy link
Contributor Author

pcuzner commented Nov 9, 2023

I added the script I used to the downstream BZ.https://bugzilla.redhat.com/show_bug.cgi?id=2247718. The attachment is called "scaling script".

I've been runing the scale tests in multiple dimensions - small subsystem count with high namespaces per subsystem through to high subsystem count with low namespace per subsystem. The memory and open files issue is the same due to the librbd client creation.

@gbregman
Copy link
Contributor

@pcuzner I couldn't reproduce the issue and in the meantime there was a major change to the CLI code. So, I close this for now. Please re-open in case you see it again with the current code.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in NVMe-oF Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working productization
Projects
Archived in project
Development

No branches or pull requests

4 participants