-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow 8tb restore #99206
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Comments
tbg
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
labels
Mar 22, 2023
cc @cockroachdb/disaster-recovery |
craig bot
pushed a commit
that referenced
this issue
Apr 3, 2023
97589: backupccl: send chunks with fail scatters to random node in generative ssp r=rhu713 a=rhu713 For chunks that have failed to scatter, this patch routes the chunk to a random node instead of the current node. This is necessary as prior to the generative version, split and scatter processors were on every node, thus there was no imbalance introduced from routing chunks that have failed to scatter to the current node. The new generative split and scatter processor is only on 1 node, and thus would cause the same node to process all chunks that have failed to scatter. Addresses run 6 and 9 of #99206 Release note: None 98978: upgrades: add a missing unit test r=adityamaru a=knz My previous change in this area failed to add the unit test. Epic: CRDB-23559 Release note: None Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Describe the problem
I ran #98767, i.e. the 8tb restore test but not using a weird RAID0 (but instead using gp3 at 125mb/s). This passed 7/10, but there were three failures documented here for further investigation.
Full artifacts: https://drive.google.com/drive/folders/18HBSubR9dTZG48gcW6AdU_PSG3rX-RVs?usp=sharing (
tar -xjv <file>
)Run 6
Everything looks balanced.
the nodes are consuming quite a bit of CPU
RAM, leaseholders, etc, is balanced. No LSM inversion. Admission control not active as far as I can tell.
The ingest seems to hit some sort of blocker, you can see that live bytes just stops growing at the previous rate:
At the same time of this change, the logcommit latencies go from the ~200ms range into mostly the single digits. So we're just not throwing much at the cluster any more.
This seems like the kind of investigation the DR team could take the lead on - I'm unsure how to glean from the artifacts the progress the restore job is making. It's possible we just ended up distributing the load poorly (or something like that), or that the s3 data source slowed down, etc.
I glanced through the logs and the cluster looks happy - we just stop loading it at the same rate at some point:
Run 8
Run 9
This looks much like run 6, did not investigate further.
Jira issue: CRDB-25755
The text was updated successfully, but these errors were encountered: