Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: Increase file num limit to fix various roachprod issue #74320

Closed
wants to merge 1 commit into from

Conversation

xun-cockroachlabs
Copy link
Contributor

@xun-cockroachlabs xun-cockroachlabs commented Dec 30, 2021

When running this year's cloud report, we began to hit two major issues:
In tpcc test we frequently hit various kv errors, some of them will result in test emit COMMAND_ERROR, and end the test well before a machine type could reach its peak performance;
This limit also caused a random issue in roachprod get, which various microbench tests use to copy result files from host to client. When parallel microbench tests are running and result files were successfully generated on the host, the copy op will randomly fail, result in all empty files (length=0) under the target directory.
Both of the above issues happened in random but pretty frequently.
Increase the file number limit seemed to fix the above issues: For hundred runs after applying this fix, not a single copy issue has happened so far, and runs near the real limit is much stable.

Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg
Copy link
Member

We should try to identify the root cause before bumping it. While the additional OS memory (to store more FDs) is negligible, it's nice to have a reasonable upper bound on how many files/sockets a process should be able to open.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @srosenberg, @tbg, and @xun-cockroachlabs)


pkg/roachprod/install/scripts/start.sh, line 96 at r1 (raw file):

  -p "MemoryMax=${MEMORY_MAX}" \
  -p LimitCORE=infinity \
  -p LimitNOFILE=655360 \

I'm not very familiar with systemd, but I'm completely failing to see how adjusting the file limit for CRDB can possibly affect whether roachprod get works or not. roachprod get is essentially a wrapper around scp and does not in any way touch CRDB. As far as I can tell, this change should have zero interaction with roachprod get. Am I completely missing something?

If you can reliably get roachprod get to fail, then my recommendation is to add some debugging to roachprod get. For example, you could add a roachprod get --verbose flag that passes through to scp -v.

@xun-cockroachlabs
Copy link
Contributor Author

xun-cockroachlabs commented Jan 2, 2022

I have some understanding of the roachprod repo because of the custom label work I had done, but don't completely understand the code base. I couldn't get a constant repro of the roachprod get issue either, out of ~150 first round result I had collected, ~2 dozens had this issue so I had to rerun. Although we still don't have complete understanding of the root cause, making this change seem to help to get rid of the above scp copy issue, as well as fixed various kv failures that caused the tpcc test to end abruptly with error issue around 2500-3500 store number level (last year's setting) for vcpu32 machine types, so since this change several hundred test has been done, not a single scp copy issue occurred, and all machine types have significant increase of perf data than last year, certain machine types can reach ~5500 level.
tpcc test seems to involve huge amount of data, several billion rows will be imported at the beginning of a run, I guess running those expensive queries will consume a lot of files internally by db. I am very busy to make changes to deal with all sorts of issues we have seen during tpcc runs, and run the finally pretty stable version to come up optimal tpcc results for test round one, after this wrapped up, I will turn on verbose and run test with the original number for next round, to see if I capture more info for further root causing.

@tbg tbg removed their request for review January 3, 2022 07:21
@petermattis
Copy link
Collaborator

I guess running those expensive queries will consume a lot of files internally by db

CRDB certainly uses lots of file descriptors, but that is completely immaterial. Scp doesn't talk to CRDB, it talks to sshd. The file descriptor usage of sshd is completely separate from the file descriptor usage of CRDB. Adjusting the file descriptor limits of CRDB should have absolutely no effect on sshd, and therefore no effect on scp and roachprod get. We need a theory for why this change can possibly help the roachprod get flakiness before merging it.

@miretskiy
Copy link
Contributor

@xun-cockroachlabs I think the issues you were seeing was with tpcc run -- and those do need higher limit.
I don't think ssh is to blame here. Though, if you have crdb use up just enough file descriptors, then even connecting via ssh would not be enough. Are you sure the user (ubuntu) also does not have some silly limits configured?

@srosenberg
Copy link
Member

I think the issues you were seeing was with tpcc run -- and those do need higher limit.

Maybe, although the number of open fds seems to scale linearly with the warehouses. E.g., while I was running blktrace on tpcc load and run, I monitored open fds . With 1000 warehouses, load never exceeded 5k and run never exceeded 7k (majority of open files are SSTs). With 4000 warehouses I'd expect <= 30k in theory (haven't tested yet).

@petermattis
Copy link
Collaborator

Maybe, although the number of open fds seems to scale linearly with the warehouses. E.g., while I was running blktrace on tpcc load and run, I monitored open fds . With 1000 warehouses, load never exceeded 5k and run never exceeded 7k (majority of open files are SSTs). With 4000 warehouses I'd expect <= 30k in theory (haven't tested yet).

Pebble maintains a cache of open sstables in order to avoid the overhead of reopening the sstable on every ready access. The size of this cache is controlled by the process limit on number of open file descriptors, but is also affected by the data set size. With smaller warehouse counts you probably don't have sufficient data size to create a large number of sstables.

@srosenberg
Copy link
Member

I think we finally found the root cause (at least one of them). While running tpcc workload without waiting, the number of TCP connections is determined by the logic below which is tantamount to 10 * activeWarehouses. Thus, for a high number of warehouses, say 6000 the load generator will need 60,000 file descriptors just for the TCP connections; this doesn't leave much breathing room for the server.

From tpcc/tpcc.go,

if w.workers == 0 {
  w.workers = w.activeWarehouses * NumWorkersPerWarehouse
}
  
  if w.numConns == 0 {
    // If we're not waiting, open up a connection for each worker. If we are
    // waiting, we only use up to a set number of connections per warehouse.
    // This isn't mandated by the spec, but opening a connection per worker
    // when they each spend most of their time waiting is wasteful.
    if w.waitFraction == 0 {
	    w.numConns = w.workers
    } else {
	    w.numConns = w.activeWarehouses * numConnsPerWarehouse
    }
 }

NOTE: NumWorkersPerWarehouse defaults to 10.

@srosenberg
Copy link
Member

bors r+

@srosenberg srosenberg added the O-cloudreport Originated from CloudReport label Feb 4, 2022
@craig
Copy link
Contributor

craig bot commented Feb 4, 2022

Build failed (retrying...):

@petermattis
Copy link
Collaborator

Doesn't the worker typically run on a different machine from the cockroach nodes?

Also, as I've noted in the past, LimitNOFile limits the number of file descriptors per process. So even if the worker and cockroach were running on the same machine we should be fine. And both the worker and cockroach processes are independent from sshd which is what is needed for roachprod get.

Am I just missing something about why bumping the per-process file limit would affect roachprod get?

@craig
Copy link
Contributor

craig bot commented Feb 4, 2022

Build failed (retrying...):

@tbg
Copy link
Member

tbg commented Feb 4, 2022

bors r-

Broke CI

@craig
Copy link
Contributor

craig bot commented Feb 4, 2022

Canceled.

@miretskiy
Copy link
Contributor

@petermattis #76076 might be a better fix.

I think the thing that got lost here is that we are running "workload" using cockroach binary. So, cockroach workload tpcc .... could easily go over 65K limit.

The above linked PR makes configurable, w/out changing the default we have set for roachprod.

@miretskiy
Copy link
Contributor

@petermattis Also.. I've no idea how this change is related to roachprod get command.. So, I completely understand confusion -- I'm confused myself.

@srosenberg
Copy link
Member

Superseded by #74320

@srosenberg srosenberg closed this Feb 4, 2022
@srosenberg
Copy link
Member

Am I just missing something about why bumping the per-process file limit would affect roachprod get?

It has nothing to do with roachprod get.

@rafiss rafiss deleted the increase_file_num_limit branch June 22, 2022 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-cloudreport Originated from CloudReport
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants