-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: Increase file num limit to fix various roachprod issue #74320
Conversation
We should try to identify the root cause before bumping it. While the additional OS memory (to store more FDs) is negligible, it's nice to have a reasonable upper bound on how many files/sockets a process should be able to open. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @srosenberg, @tbg, and @xun-cockroachlabs)
pkg/roachprod/install/scripts/start.sh, line 96 at r1 (raw file):
-p "MemoryMax=${MEMORY_MAX}" \ -p LimitCORE=infinity \ -p LimitNOFILE=655360 \
I'm not very familiar with systemd, but I'm completely failing to see how adjusting the file limit for CRDB can possibly affect whether roachprod get
works or not. roachprod get
is essentially a wrapper around scp
and does not in any way touch CRDB. As far as I can tell, this change should have zero interaction with roachprod get
. Am I completely missing something?
If you can reliably get roachprod get
to fail, then my recommendation is to add some debugging to roachprod get
. For example, you could add a roachprod get --verbose
flag that passes through to scp -v
.
I have some understanding of the roachprod repo because of the custom label work I had done, but don't completely understand the code base. I couldn't get a constant repro of the |
CRDB certainly uses lots of file descriptors, but that is completely immaterial. Scp doesn't talk to CRDB, it talks to sshd. The file descriptor usage of sshd is completely separate from the file descriptor usage of CRDB. Adjusting the file descriptor limits of CRDB should have absolutely no effect on sshd, and therefore no effect on scp and |
@xun-cockroachlabs I think the issues you were seeing was with tpcc run -- and those do need higher limit. |
Maybe, although the number of open fds seems to scale linearly with the warehouses. E.g., while I was running |
Pebble maintains a cache of open sstables in order to avoid the overhead of reopening the sstable on every ready access. The size of this cache is controlled by the process limit on number of open file descriptors, but is also affected by the data set size. With smaller warehouse counts you probably don't have sufficient data size to create a large number of sstables. |
I think we finally found the root cause (at least one of them). While running From
NOTE: |
bors r+ |
Build failed (retrying...): |
Doesn't the worker typically run on a different machine from the Also, as I've noted in the past, Am I just missing something about why bumping the per-process file limit would affect |
Build failed (retrying...): |
bors r- Broke CI |
Canceled. |
@petermattis #76076 might be a better fix. I think the thing that got lost here is that we are running "workload" using cockroach binary. So, The above linked PR makes configurable, w/out changing the default we have set for roachprod. |
@petermattis Also.. I've no idea how this change is related to roachprod get command.. So, I completely understand confusion -- I'm confused myself. |
Superseded by #74320 |
It has nothing to do with |
When running this year's cloud report, we began to hit two major issues:
In tpcc test we frequently hit various kv errors, some of them will result in test emit COMMAND_ERROR, and end the test well before a machine type could reach its peak performance;
This limit also caused a random issue in
roachprod get
, which various microbench tests use to copy result files from host to client. When parallel microbench tests are running and result files were successfully generated on the host, the copy op will randomly fail, result in all empty files (length=0) under the target directory.Both of the above issues happened in random but pretty frequently.
Increase the file number limit seemed to fix the above issues: For hundred runs after applying this fix, not a single copy issue has happened so far, and runs near the real limit is much stable.
Release note: none