roachprod: Increase file num limit to fix various roachprod issue #74320

xun-cockroachlabs · 2021-12-30T03:57:21Z

When running this year's cloud report, we began to hit two major issues:
In tpcc test we frequently hit various kv errors, some of them will result in test emit COMMAND_ERROR, and end the test well before a machine type could reach its peak performance;
This limit also caused a random issue in roachprod get, which various microbench tests use to copy result files from host to client. When parallel microbench tests are running and result files were successfully generated on the host, the copy op will randomly fail, result in all empty files (length=0) under the target directory.
Both of the above issues happened in random but pretty frequently.
Increase the file number limit seemed to fix the above issues: For hundred runs after applying this fix, not a single copy issue has happened so far, and runs near the real limit is much stable.

Release note: none

cockroach-teamcity · 2021-12-30T03:57:27Z

This change is

srosenberg · 2021-12-30T04:11:22Z

We should try to identify the root cause before bumping it. While the additional OS memory (to store more FDs) is negligible, it's nice to have a reasonable upper bound on how many files/sockets a process should be able to open.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @srosenberg, @tbg, and @xun-cockroachlabs)

pkg/roachprod/install/scripts/start.sh, line 96 at r1 (raw file):

  -p "MemoryMax=${MEMORY_MAX}" \
  -p LimitCORE=infinity \
  -p LimitNOFILE=655360 \

I'm not very familiar with systemd, but I'm completely failing to see how adjusting the file limit for CRDB can possibly affect whether roachprod get works or not. roachprod get is essentially a wrapper around scp and does not in any way touch CRDB. As far as I can tell, this change should have zero interaction with roachprod get. Am I completely missing something?

If you can reliably get roachprod get to fail, then my recommendation is to add some debugging to roachprod get. For example, you could add a roachprod get --verbose flag that passes through to scp -v.

xun-cockroachlabs · 2022-01-02T18:52:12Z

I have some understanding of the roachprod repo because of the custom label work I had done, but don't completely understand the code base. I couldn't get a constant repro of the roachprod get issue either, out of ~150 first round result I had collected, ~2 dozens had this issue so I had to rerun. Although we still don't have complete understanding of the root cause, making this change seem to help to get rid of the above scp copy issue, as well as fixed various kv failures that caused the tpcc test to end abruptly with error issue around 2500-3500 store number level (last year's setting) for vcpu32 machine types, so since this change several hundred test has been done, not a single scp copy issue occurred, and all machine types have significant increase of perf data than last year, certain machine types can reach ~5500 level.
tpcc test seems to involve huge amount of data, several billion rows will be imported at the beginning of a run, I guess running those expensive queries will consume a lot of files internally by db. I am very busy to make changes to deal with all sorts of issues we have seen during tpcc runs, and run the finally pretty stable version to come up optimal tpcc results for test round one, after this wrapped up, I will turn on verbose and run test with the original number for next round, to see if I capture more info for further root causing.

petermattis · 2022-01-03T15:37:01Z

I guess running those expensive queries will consume a lot of files internally by db

CRDB certainly uses lots of file descriptors, but that is completely immaterial. Scp doesn't talk to CRDB, it talks to sshd. The file descriptor usage of sshd is completely separate from the file descriptor usage of CRDB. Adjusting the file descriptor limits of CRDB should have absolutely no effect on sshd, and therefore no effect on scp and roachprod get. We need a theory for why this change can possibly help the roachprod get flakiness before merging it.

miretskiy · 2022-01-03T17:03:10Z

@xun-cockroachlabs I think the issues you were seeing was with tpcc run -- and those do need higher limit.
I don't think ssh is to blame here. Though, if you have crdb use up just enough file descriptors, then even connecting via ssh would not be enough. Are you sure the user (ubuntu) also does not have some silly limits configured?

srosenberg · 2022-01-07T23:52:21Z

I think the issues you were seeing was with tpcc run -- and those do need higher limit.

Maybe, although the number of open fds seems to scale linearly with the warehouses. E.g., while I was running blktrace on tpcc load and run, I monitored open fds . With 1000 warehouses, load never exceeded 5k and run never exceeded 7k (majority of open files are SSTs). With 4000 warehouses I'd expect <= 30k in theory (haven't tested yet).

petermattis · 2022-01-10T14:49:48Z

Maybe, although the number of open fds seems to scale linearly with the warehouses. E.g., while I was running blktrace on tpcc load and run, I monitored open fds . With 1000 warehouses, load never exceeded 5k and run never exceeded 7k (majority of open files are SSTs). With 4000 warehouses I'd expect <= 30k in theory (haven't tested yet).

Pebble maintains a cache of open sstables in order to avoid the overhead of reopening the sstable on every ready access. The size of this cache is controlled by the process limit on number of open file descriptors, but is also affected by the data set size. With smaller warehouse counts you probably don't have sufficient data size to create a large number of sstables.

srosenberg · 2022-02-04T17:42:43Z

I think we finally found the root cause (at least one of them). While running tpcc workload without waiting, the number of TCP connections is determined by the logic below which is tantamount to 10 * activeWarehouses. Thus, for a high number of warehouses, say 6000 the load generator will need 60,000 file descriptors just for the TCP connections; this doesn't leave much breathing room for the server.

From tpcc/tpcc.go,

if w.workers == 0 {
  w.workers = w.activeWarehouses * NumWorkersPerWarehouse
}
  
  if w.numConns == 0 {
    // If we're not waiting, open up a connection for each worker. If we are
    // waiting, we only use up to a set number of connections per warehouse.
    // This isn't mandated by the spec, but opening a connection per worker
    // when they each spend most of their time waiting is wasteful.
    if w.waitFraction == 0 {
	    w.numConns = w.workers
    } else {
	    w.numConns = w.activeWarehouses * numConnsPerWarehouse
    }
 }

NOTE: NumWorkersPerWarehouse defaults to 10.

srosenberg · 2022-02-04T17:48:43Z

bors r+

craig · 2022-02-04T18:20:30Z

Build failed (retrying...):

GitHub CI (Cockroach)

petermattis · 2022-02-04T18:27:21Z

Doesn't the worker typically run on a different machine from the cockroach nodes?

Also, as I've noted in the past, LimitNOFile limits the number of file descriptors per process. So even if the worker and cockroach were running on the same machine we should be fine. And both the worker and cockroach processes are independent from sshd which is what is needed for roachprod get.

Am I just missing something about why bumping the per-process file limit would affect roachprod get?

craig · 2022-02-04T18:38:59Z

Build failed (retrying...):

GitHub CI (Cockroach)

tbg · 2022-02-04T19:05:55Z

bors r-

Broke CI

craig · 2022-02-04T19:05:57Z

Canceled.

miretskiy · 2022-02-04T19:32:56Z

@petermattis #76076 might be a better fix.

I think the thing that got lost here is that we are running "workload" using cockroach binary. So, cockroach workload tpcc .... could easily go over 65K limit.

The above linked PR makes configurable, w/out changing the default we have set for roachprod.

miretskiy · 2022-02-04T19:36:19Z

@petermattis Also.. I've no idea how this change is related to roachprod get command.. So, I completely understand confusion -- I'm confused myself.

srosenberg · 2022-02-04T20:09:32Z

Superseded by #74320

srosenberg · 2022-02-04T20:18:44Z

Am I just missing something about why bumping the per-process file limit would affect roachprod get?

It has nothing to do with roachprod get.

roachprod: Increase file num limit to fix various roachprod issue

b1e4600

xun-cockroachlabs requested review from srosenberg, tbg and miretskiy December 30, 2021 03:57

xun-cockroachlabs requested a review from a team as a code owner December 30, 2021 03:57

petermattis reviewed Dec 30, 2021

View reviewed changes

tbg removed their request for review January 3, 2022 07:21

srosenberg approved these changes Feb 4, 2022

View reviewed changes

srosenberg added the O-cloudreport Originated from CloudReport label Feb 4, 2022

srosenberg closed this Feb 4, 2022

rafiss deleted the increase_file_num_limit branch June 22, 2022 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: Increase file num limit to fix various roachprod issue #74320

roachprod: Increase file num limit to fix various roachprod issue #74320

xun-cockroachlabs commented Dec 30, 2021 •

edited

Loading

cockroach-teamcity commented Dec 30, 2021

srosenberg commented Dec 30, 2021

petermattis left a comment

xun-cockroachlabs commented Jan 2, 2022 •

edited

Loading

petermattis commented Jan 3, 2022

miretskiy commented Jan 3, 2022

srosenberg commented Jan 7, 2022

petermattis commented Jan 10, 2022

srosenberg commented Feb 4, 2022

srosenberg commented Feb 4, 2022

craig bot commented Feb 4, 2022

petermattis commented Feb 4, 2022

craig bot commented Feb 4, 2022

tbg commented Feb 4, 2022

craig bot commented Feb 4, 2022

miretskiy commented Feb 4, 2022

miretskiy commented Feb 4, 2022

srosenberg commented Feb 4, 2022

srosenberg commented Feb 4, 2022

roachprod: Increase file num limit to fix various roachprod issue #74320

roachprod: Increase file num limit to fix various roachprod issue #74320

Conversation

xun-cockroachlabs commented Dec 30, 2021 • edited Loading

cockroach-teamcity commented Dec 30, 2021

srosenberg commented Dec 30, 2021

petermattis left a comment

Choose a reason for hiding this comment

xun-cockroachlabs commented Jan 2, 2022 • edited Loading

petermattis commented Jan 3, 2022

miretskiy commented Jan 3, 2022

srosenberg commented Jan 7, 2022

petermattis commented Jan 10, 2022

srosenberg commented Feb 4, 2022

srosenberg commented Feb 4, 2022

craig bot commented Feb 4, 2022

petermattis commented Feb 4, 2022

craig bot commented Feb 4, 2022

tbg commented Feb 4, 2022

craig bot commented Feb 4, 2022

miretskiy commented Feb 4, 2022

miretskiy commented Feb 4, 2022

srosenberg commented Feb 4, 2022

srosenberg commented Feb 4, 2022

xun-cockroachlabs commented Dec 30, 2021 •

edited

Loading

xun-cockroachlabs commented Jan 2, 2022 •

edited

Loading