slow 8tb restore #99206

tbg · 2023-03-22T10:13:25Z

Describe the problem

I ran #98767, i.e. the 8tb restore test but not using a weird RAID0 (but instead using gp3 at 125mb/s). This passed 7/10, but there were three failures documented here for further investigation.

Full artifacts: https://drive.google.com/drive/folders/18HBSubR9dTZG48gcW6AdU_PSG3rX-RVs?usp=sharing (tar -xjv <file>)

Run 6

Everything looks balanced.

the nodes are consuming quite a bit of CPU

RAM, leaseholders, etc, is balanced. No LSM inversion. Admission control not active as far as I can tell.

The ingest seems to hit some sort of blocker, you can see that live bytes just stops growing at the previous rate:

At the same time of this change, the logcommit latencies go from the ~200ms range into mostly the single digits. So we're just not throwing much at the cluster any more.

This seems like the kind of investigation the DR team could take the lead on - I'm unsure how to glean from the artifacts the progress the restore job is making. It's possible we just ended up distributing the load poorly (or something like that), or that the s3 data source slowed down, etc.

I glanced through the logs and the cluster looks happy - we just stop loading it at the same rate at some point:

Run 8

run_003404.366381060_n1_cockroach-sql-insecu: 00:40:18 cluster.go:1991: > result: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_003404.373614968_n1_cockroach-sql-insecu.log: exit status 1
(1) ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=500000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-05T17:30:00Z' " returned
  | stderr:
  | ERROR: importing 14408 ranges: splitting key /Table/140/1/200001370685517/"CMPT": unable to find store 4 in range r14863:/Table/140/1/2000013{07912147/"CMPT"-76653510/"SBMT"} [(n6,s6):1, (n1,s1):7, (n10,s10):8, next=9, gen=453, sticky=1679448977.911505452,0]
  | Failed running "sql"

Run 9

This looks much like run 6, did not investigate further.

Jira issue: CRDB-25755

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-03-22T10:50:56Z

cc @cockroachdb/disaster-recovery

97589: backupccl: send chunks with fail scatters to random node in generative ssp r=rhu713 a=rhu713 For chunks that have failed to scatter, this patch routes the chunk to a random node instead of the current node. This is necessary as prior to the generative version, split and scatter processors were on every node, thus there was no imbalance introduced from routing chunks that have failed to scatter to the current node. The new generative split and scatter processor is only on 1 node, and thus would cause the same node to process all chunks that have failed to scatter. Addresses run 6 and 9 of #99206 Release note: None 98978: upgrades: add a missing unit test r=adityamaru a=knz My previous change in this area failed to add the unit test. Epic: CRDB-23559 Release note: None Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Mar 22, 2023

tbg changed the title ~~kvserver: imbalanced 8tb restore~~ kvserver: slow 8tb restore Mar 22, 2023

tbg added T-disaster-recovery and removed T-kv KV Team labels Mar 22, 2023

blathers-crl bot added the A-disaster-recovery label Mar 22, 2023

tbg changed the title ~~kvserver: slow 8tb restore~~ slow 8tb restore Mar 22, 2023

tbg mentioned this issue Mar 22, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed #97019

Closed

exalate-issue-sync bot assigned rhu713 Mar 22, 2023

rhu713 mentioned this issue Mar 24, 2023

backupccl: send chunks with fail scatters to random node in generative ssp #97589

Merged

rhu713 mentioned this issue Apr 6, 2023

release-23.1: backupccl: send chunks with fail scatters to random node in generative ssp #100866

Merged

rhu713 mentioned this issue Apr 26, 2023

release-23.1.0: backupccl: send chunks with fail scatters to random node in generative ssp #102335

Merged

pav-kv mentioned this issue Aug 8, 2023

roachtest: restore/tpce/8TB/aws/nodes=10/cpus=8 failed [CRDB-25503 replication send oom] #106496

Closed

dt closed this as not planned Won't fix, can't repro, duplicate, stale Oct 17, 2023

jlinder unassigned rhu713 Oct 23, 2023

exalate-issue-sync bot closed this as completed Oct 23, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow 8tb restore #99206

slow 8tb restore #99206

tbg commented Mar 22, 2023 •

edited

Loading

blathers-crl bot commented Mar 22, 2023

slow 8tb restore #99206

slow 8tb restore #99206

Comments

tbg commented Mar 22, 2023 • edited Loading

Run 6

Run 8

Run 9

blathers-crl bot commented Mar 22, 2023

tbg commented Mar 22, 2023 •

edited

Loading