Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc: paginate GC of local key ranges #45444

Merged
merged 2 commits into from
Feb 28, 2020
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 26, 2020

Touches #40815.

I'm not calling it fixed yet because there are two remaining issues:

  1. GC will not trigger automatically based on a large abort span. This
    means that manual intervention is needed in such a case.
  2. The local key ranges are processed after the user key ranges.
    Even if no user keys are GC'able, all of it has to be scanned
    regardless (to arrive at that conclusion). With a large enough
    range size, this scan could theoretically consume the entirety
    of the time budget.

However, in practice we expect operators to be able to "fix" any
range via a manual GC once this PR is merged. Follow-up work needs
to be done to address 1) and 2) above.

Release note (bug fix): Improved the ability of garbage collection
process to make process through ranges exhibiting abnormally large
numbers of transaction records and/or abort span entries.

@tbg tbg requested a review from ajwerner February 26, 2020 15:41
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice and simple

One approach for 2 is to scale the timeout by the range size. We could conservatively estimate that we can scan and gc 10 MB/s or something like that.

Have any ideas for 1?

Reviewed 1 of 1 files at r1, 3 of 3 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@tbg
Copy link
Member Author

tbg commented Feb 26, 2020

re:2, I like your suggestion but I was also thinking that we should process the abort span first, at least if it's large (but then why not always).

re 1), at least heuristically we could just say that "if sysbytes > 64mb then run GC". This backfires if we can get >64mb of sysbytes for any other reason than a large gc'able abort span (or tons of old aborted txns). Tracking the abort span and txn record count in the stats should be doable, though, and given how much trouble we've had with them it's probably overdue: #27055

@tbg
Copy link
Member Author

tbg commented Feb 26, 2020

bors r=ajwerner
TFTR!

@ajwerner
Copy link
Contributor

Thanks for picking this up so quickly!

re:2, I like your suggestion but I was also thinking that we should process the abort span first, at least if it's large (but then why not always).

SGTM. I could see doing both.

re 1), at least heuristically we could just say that "if sysbytes > 64mb then run GC". This backfires if we can get >64mb of sysbytes for any other reason than a large gc'able abort span (or tons of old aborted txns). Tracking the abort span and txn record count in the stats should be doable, though, and given how much trouble we've had with them it's probably overdue: #27055

It's probably worth giving the more complete solution a bit of time and only reverting to this heuristic if doing something better proves difficult.

@craig
Copy link
Contributor

craig bot commented Feb 26, 2020

Build failed (retrying...)

@tbg
Copy link
Member Author

tbg commented Feb 27, 2020

bors r=ajwerner

@tbg
Copy link
Member Author

tbg commented Feb 27, 2020

bors r=ajwerner

tbg added a commit to tbg/cockroach that referenced this pull request Feb 27, 2020
Partial (manual) backport of cockroachdb#45444.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Feb 27, 2020
Partial (manual) backport of cockroachdb#45444.

Release note: None
@tbg
Copy link
Member Author

tbg commented Feb 27, 2020

bors r=ajwerner

one of these days

tbg added 2 commits February 28, 2020 13:01
This will come in handy in a future commit.

Release note: None
Touches cockroachdb#40815.

I'm not calling it fixed yet because there are two remaining issues:

1. GC will not trigger automatically based on a large abort span. This
   means that manual intervention is needed in such a case.
2. The local key ranges are processed after the user key ranges.
   Even if no user keys are GC'able, all of it has to be scanned
   regardless (to arrive at that conclusion). With a large enough
   range size, this scan could theoretically consume the entirety
   of the time budget.

However, in practice we expect operators to be able to "fix" any
range via a manual GC once this PR is merged. Follow-up work needs
to be done to address 1) and 2) above.

Release note (bug fix): Improved the ability of garbage collection
process to make process through ranges exhibiting abnormally large
numbers of transaction records and/or abort span entries.
@tbg tbg force-pushed the gcqueue-paginate branch from b09485e to 4c86748 Compare February 28, 2020 12:01
@tbg
Copy link
Member Author

tbg commented Feb 28, 2020

bors r=ajwerner

@craig
Copy link
Contributor

craig bot commented Feb 28, 2020

Build succeeded

@craig craig bot merged commit 22544bf into cockroachdb:master Feb 28, 2020
tbg added a commit to tbg/cockroach that referenced this pull request Mar 2, 2020
Partial (manual) backport of cockroachdb#45444.

Release note: None
@tbg tbg deleted the gcqueue-paginate branch March 2, 2020 13:40
tbg added a commit to tbg/cockroach that referenced this pull request Apr 3, 2020
Partial (manual) backport of cockroachdb#45444.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants