Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow 8tb restore #99206

Closed
tbg opened this issue Mar 22, 2023 · 1 comment
Closed

slow 8tb restore #99206

tbg opened this issue Mar 22, 2023 · 1 comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@tbg
Copy link
Member

tbg commented Mar 22, 2023

Describe the problem

I ran #98767, i.e. the 8tb restore test but not using a weird RAID0 (but instead using gp3 at 125mb/s). This passed 7/10, but there were three failures documented here for further investigation.

Full artifacts: https://drive.google.com/drive/folders/18HBSubR9dTZG48gcW6AdU_PSG3rX-RVs?usp=sharing (tar -xjv <file>)

Run 6

Everything looks balanced.

the nodes are consuming quite a bit of CPU image

RAM, leaseholders, etc, is balanced. No LSM inversion. Admission control not active as far as I can tell.

The ingest seems to hit some sort of blocker, you can see that live bytes just stops growing at the previous rate:

image

At the same time of this change, the logcommit latencies go from the ~200ms range into mostly the single digits. So we're just not throwing much at the cluster any more.

This seems like the kind of investigation the DR team could take the lead on - I'm unsure how to glean from the artifacts the progress the restore job is making. It's possible we just ended up distributing the load poorly (or something like that), or that the s3 data source slowed down, etc.

I glanced through the logs and the cluster looks happy - we just stop loading it at the same rate at some point:

image

Run 8

run_003404.366381060_n1_cockroach-sql-insecu: 00:40:18 cluster.go:1991: > result: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_003404.373614968_n1_cockroach-sql-insecu.log: exit status 1
(1) ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned
  | stderr:
  | ERROR: importing 14408 ranges: splitting key /Table/140/1/200001370685517/"CMPT": unable to find store 4 in range r14863:/Table/140/1/2000013{07912147/"CMPT"-76653510/"SBMT"} [(n6,s6):1, (n1,s1):7, (n10,s10):8, next=9, gen=453, sticky=1679448977.911505452,0]
  | Failed running "sql"

Run 9

This looks much like run 6, did not investigate further.

image

Jira issue: CRDB-25755

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Mar 22, 2023
@tbg tbg changed the title kvserver: imbalanced 8tb restore kvserver: slow 8tb restore Mar 22, 2023
@tbg tbg added T-disaster-recovery and removed T-kv KV Team labels Mar 22, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 22, 2023

cc @cockroachdb/disaster-recovery

@tbg tbg changed the title kvserver: slow 8tb restore slow 8tb restore Mar 22, 2023
craig bot pushed a commit that referenced this issue Apr 3, 2023
97589: backupccl: send chunks with fail scatters to random node in generative ssp r=rhu713 a=rhu713

For chunks that have failed to scatter, this patch routes the chunk to a
random node instead of the current node. This is necessary as prior to the
generative version, split and scatter processors were on every node, thus there
was no imbalance introduced from routing chunks that have failed to scatter to
the current node. The new generative split and scatter processor is only on 1
node, and thus would cause the same node to process all chunks that have failed
to scatter.

Addresses run 6 and 9 of #99206 

Release note: None

98978: upgrades: add a missing unit test r=adityamaru a=knz

My previous change in this area failed to add the unit test.

Epic: CRDB-23559
Release note: None

Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@dt dt closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants