Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,kv: column backfills can get stuck in a failed state when ranges are too large and unsplittable #51949

Closed
ajwerner opened this issue Jul 27, 2020 · 2 comments
Labels
A-disaster-recovery A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity T-disaster-recovery X-stale

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Jul 27, 2020

Describe the problem

When ranges get too large (2x their range_max_bytes by default, controlled by kv.range.backpressure_range_size_multiplier) we enable a backpressure mechanism to prevent writes from making it larger under the assumption that the split queue is going to come around and split it. Split boundaries are not allowed to split up the versions of a key. In cases where the entire range is owed to a single key and its many versions, we won't split and we will not alleviate backpressure until GC occurs. In normal cases this isn't great but the error is relatively clear and the mitigation is generally to lower the GC threshold. Furthermore, much work has been done to make GC of many versions less problematic in v20.1 and later.

One situation where this backpressure is particularly problematic and not particularly visible is when writes to this range are due to a column backfill as may occur when adding a computed column or a column with a default value. In these cases, the job to create the column will fail and then trigger the rollback job which will get caught retrying until the backpressure lets up. This isn't great.

To Reproduce

  • Create a range with a single key.
  • Set the range_max_bytes to just less than half the size of the range
  • Attempt to add a computed column to the table.

Expected behavior

Not really clear, I suppose we could try to avoid this situation in the first place. This problem will go away if we adopt #47989 which is my preferred approach. The better resolution to this problem is to shorten the GC TTL and clear out some of those versions. In a real pinch one can increase the kv.range.backpressure_range_size_multiplier to allow the job to proceed at the hazard of creating a large range. In 20.1 and later we've ensured that GC of these large ranges does not buffer keys in RAM and has been tested up to many gigabytes of keys. In 19.2 and earlier it does buffer all of the keys in ram.

Epic CRDB-8816

Jira issue: CRDB-3995

@blathers-crl
Copy link

blathers-crl bot commented Jul 27, 2020

Hi @ajwerner, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@ajwerner ajwerner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-schema-changes labels Aug 14, 2020
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity T-disaster-recovery X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants