Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: range data is not being GCed in a timely manner #38251

Closed
andreimatei opened this issue Jun 17, 2019 · 15 comments · Fixed by #92118
Closed

storage: range data is not being GCed in a timely manner #38251

andreimatei opened this issue Jun 17, 2019 · 15 comments · Fixed by #92118
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Jun 17, 2019

A customer has a cluster with a humongous range (range 6 - the "system config" range) that seems to constantly have around 3-days worth of old values, instead of 25h worth. And those extra two days amount to gigabytes of dead values.
The GC queue does not seem interested in processing the range. When enqueued manually, it says low score:
range6_enqueue

It also says "likely last GC: 43h" ago, which I think is wrong (the real is more like 72h, which is also confirmed by the gcThreshold from the range stats).

Why the low score, I don't know yet. Perhaps something's busted with the scoring heuristics?

I've got the range data, all 7GB of it but highly compressed, and the range stats (internal only).
https://drive.google.com/drive/folders/1zgM9h5m3-voqV_Usm6fHxnuM8p4dK0LK?usp=sharing
There you can see, for example, all the versions of key /Table/3/1/4191/2/1, which range from June 14, 2019 11:17:25 AM to June 17, 2019 7:57:27 AM.

This is a spin-off of #38088.

gz#4843

Jira issue: CRDB-5642

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting A-kv Anything in KV that doesn't belong in a more specific category. labels Jun 17, 2019
@andreimatei
Copy link
Contributor Author

One possibly interesting fact is that the range has only 7 keys with tons and tons of old versions. The other keys are well behaved.

@tbg
Copy link
Member

tbg commented Jun 18, 2019

The shouldQueue method uses the GCBytesAge to determine whether there are enough old keys to justify a gc run (=a full scan of the range). Unfortunately, the GCBytesAge is only loosely related to actually being able to GC something, depending on how the old versions are distributed the GCBytesAge could be relatively high even though nothing can be deleted yet. The math related in working out when we're guaranteed to be able to delete a significant fraction of data implies that we won't be deleting values eagerly when they go out of scope. See

// makeGCQueueScoreImpl is used to compute when to trigger the GC Queue. It's
for details, but basically the bottom line is that setting some TTL doesn't tell you that things will be deleted soon thereafter, just that they'll stick around until then.

On the plus side, the screenshot above indicates that the range is very close to being GC'ed (it needs a score of 2).

I agree that none of this seems great, especially since it affects the system config range.

As a quick fix, I wonder if we can lower the TTL, which should linearly affect the score (i.e. halve the TTL, double the score) and thus the frequency at which GC runs.

@tbg
Copy link
Member

tbg commented Jun 18, 2019

PS even if we fixed the GC queue to be super aggressive, it seems that this range would be too big still. You say there are "gigabytes" for three days, so there will still be vastly more than 64mb for one day, i.e. even with optimal GC the range seems to end up in the 1GB range because data is written so quickly.

@andreimatei
Copy link
Contributor Author

I now understand how the thing works; thanks for the pointers.

In my opinion, it's not a good thing that we only trigger GC based on a score normalized to the ranges live data size. For example, imagine you've got two ranges: one configured to be 1GB in size, the other 64MB. Now imagine the 1GB has 500MB of collectible garbage. We'd like to collect it even though it's relatively smaller than the live data, wouldn't we?

It's not clear to me how to improve the algorithm, though. How about if we maintained more granular statistics than a single GCBytesAge? For example, we could maintain stats for every hour - how many bytes died during that hour.

@andreimatei andreimatei added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jun 18, 2019
@andreimatei
Copy link
Contributor Author

Also a question - given that gcKeyScoreThreshold is 2, do I understand correctly that when you drop a table it'll actually take twice the TTL until that range is collected? I don't quiet follow how DeadFraction factors into the computation.

@tbg
Copy link
Member

tbg commented Jun 19, 2019

When you drop a table, hopefully it gets collected via ClearRange which sidesteps GC completely. If it doesn't (perhaps there's some interleaving going on) yes, it'll take roughly 2x the TTL. Say you have 1GB of data and say it was all written at t=0 and it gets dropped at t=24h. That contributes a GCBytesAge of 24 GB*h, so Age/(Junk*TTL) = 1. Another 24h later and you'll hit 2.

I think you know this, but just to make sure: the reason we don't set up GC so that it runs right at 1 (or 1.001) is that in the worst case of data distribution, we only get to delete a tiny fraction of the rows, but have to scan all of the data.

We'd like to collect it even though it's relatively smaller than the live data, wouldn't we?

The example is missing a comparison. Yeah, we'd want to collect this at some point, and we will. But we also have to scan 1GB to do it.

In any alternative strategy that you come up with, you have to handle the common case in which there is very little garbage. Say 1GB range and 1MB of garbage. You'll have to collect that eventually, how do you determine when? You want to make absolutely sure you're not scanning 1GB for nothing (perhaps repeatedly), so you need some way to confirm that you're going to collect a significant fraction of that 1MB when you invest your CPU time. That's the main contribution of the current mechanism, and in turn we get the shortcoming that even though there may be something worth collecting, it might take ~an extra TTL until we do.

Agree that it's worth thinking about. At the moment, we are scanning each range once per day anyway (consistency checker), there's an opportunity there to compute for sure what we can delete (and perhaps even doing it right there).

@andreimatei
Copy link
Contributor Author

When you drop a table, hopefully it gets collected via ClearRange which sidesteps GC completely

Aha, I forgot about ClearRange.

The example is missing a comparison. Yeah, we'd want to collect this at some point, and we will. But we also have to scan 1GB to do it.

The point is that we'd like to delete sooner, not later. So the example is a 1GB range with 500MB worth of garbage. As it stands, it will take 4 days until this garbage is collected (2 days until GCBytesAge=Junk*TTL and another two days to reach the factor of 2 threshold). I understand the cost of scanning, but still, if you conduct a poll I'm sure everybody will say collect the sucker sooner.

What really bothers me is that it takes an arbitrary amount of time to collect some garbage. If I delete a tiny row and nothing else, it'll be collected much later. And I don't have any knobs for it. Besides the resource usage questions, sometimes I really want my data wiped because of contracts or what not - think PCI compliance saying that you need to delete credit card numbers within some time window.
And I'm also unhappy about the reverse :) - I don't have a knob to say that if I can only collect one byte out of a 1GB range, I never want to collect that.

Basically, I find it hard to say when the thing that we currently do works well and when it doesn't. The only guarantee the current mechanism gives us is that when we collect, we'll collect something. And it has the reasonable behavior that, if there's more garbage relative to live data, it'll collect sooner than otherwise. But it's just vague cause the statistic - average age of dead byte - is so coarse. It doesn't provide us with an estimate on how much garbage we can collect.

So thinking fundamentally about the problem, the quantity of non-live bytes in a range is a monotonic function and we want the area under the curve for a prefix of it (up until the TTL). And we need to summarize the function somehow.
One simple engineering thing to do is keep a series of buckets representing time intervals and each maintaining the amount of bytes that died during that interval (say, 10 of them each representing TTL/10). The left-most bucket is special, representing all the dead bytes older than TTL. And so as time flows, we'd shift buckets towards the left and create new ones at the right.
And so with this datastructure, we'd have an approximation of the garbage that's currently collectible and we can make policies on it.

At the moment, we are scanning each range once per day anyway (consistency checker), there's an opportunity there to compute for sure what we can delete (and perhaps even doing it right there).

Yeah, I guess that would also shut me up, although it's not perfectly satisfying.

tbg added a commit to tbg/cockroach that referenced this issue Jun 26, 2019
This is a WIP because it doesn't actually implement anything it claims
yet. Instead, I wrote a big comment explaining how everything worked and
fit together, and left some TODOs in the code detailing what we'd want
to improve. Let's discuss and polish this on this PR first, and I'll
then amend it to actually implement the suggestions, updating/adding
tests in the process.

----

We noticed some unfortunate GC queue behavior in [38251] and I went back
to the code and found that there are several good improvements we can
make without completely switching up the way things work.

The point of contention was essentially that configuring a given TTL
does not imply that keys gc'able by that TTL will be GC'ed right when
they become GC'able. This is sort of intrinsic in the information we
have available when deciding when to GC, but we were leaving a factor
of two on the table. In result, users should reasonably be able to
expect that data is deleted approximately after the TTL for large
deletions and approximately after twice the TTL for "spread out"
deletions (such as repeatedly updating the same key). Previously
these were 2x and 4x, respectively.

Additionally (though this did not matter for the present issue), the
impact of the fraction of dead bytes vs live bytes into the score (and
by extension the scheduling of GC) wasn't very principled and has been
reworked, the goal being to delay GC until the expected amount of freed
bytes represents a certain fraction of the live data left in the range.
This is to avoid repeatedly scanning large amounts of live data only
to remove very small amounts of gc'able data.

[38251]: cockroachdb#38251

Release note: None
@BramGruneir
Copy link
Member

@tbg , any update on this issue?

@tbg
Copy link
Member

tbg commented Dec 8, 2020

No. There is some work-in-progress here but I am not working on it: #38432
Somewhat independently, @sumeerbhola has been exploring improving when disk space is actually reclaimed.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@tbg tbg unassigned andreimatei and tbg May 31, 2022
@daniel-crlabs
Copy link
Contributor

I accidentally found this gh issue and noticed I have a ticket from a customer exactly about this behavior. They have a table with 360GB + of data, and this table has over 2300 ranges and half of the data in all these ranges (as per range json files) is considered to be garbage, yet, it is not removed, which is causing the cluster to be dangerously close from running out of disk.

GH issue here for visibility and hoping this somehow gets improved, as it appears (based on the explanations above) this is an expected behavior (this should probably be documented somehow in a more clear fashion so customers are aware of this):

Cluster storage remains high #1882

@tbg
Copy link
Member

tbg commented Nov 7, 2022

I think we can set this value here to 1:

mvccGCKeyScoreThreshold = 2

Since this issue has been open for a while, a small refresher on the problem: we want to avoid the GC triggering before there's work to do, to avoid spinning.

So we need an invariant: if the GC score indicates a GC, carrying out that GC must reduce the score sufficiently to not trigger another GC. We compute such a score (valScore here). The invariant associated to that score is that GC will always bring that score to <1. So if we queue only when it's >= 1, we have what we want (no tight loop of GC). But we actually only trigger when the score hits 2!

This was dnoe out of an abundance of caution when I introduced this score and in hindsight has done more harm than good. The "worst case" for this score is a large range where everything is recently deleted (i.e. not gc'able), but there is one outlier KV pair that has been deleted very very long ago. This would drive the score up above the threshold of 1 and yet, when you GC, all you get to delete is that one pair. But, doesn't matter, the resulting score will be << 1. The hope was that a factor of 2 would give a better balance, where we delay GC until there's more work to do (GC has to scan the entire range, i.e. expensive).

However, at this point we have a GC cooldown period (per replica) of 1h, so tight loops aren't as much of an issue. The original reasoning for picking 2 over the optimal 1 just isn't very convincing at all. There is every indication that 1 makes a fine cut-off.

With a cut-off of 1, if at time t=0, a customer deletes the entire contents of a range, then once the TTL passes, it will become GC'able (and will be GC'ed subject to the queue coming around to take a look). This is what customers expect and we should give it to them.

Another interesting example to think through is a range in which there's only one row that gets overwritten once per second. At a 1h TTL and assuming GC as soon as the score hits 1.0, how much history will we keep?

Let's assume a GC run happened at t=0 and it's t=T now. The range will have 1h+T worth of writes, timestamped t=T, ..., t=0, t=-1, t=-2, ..., t=-3600, contributing GCBytesAge of 1, 2, ..., 3600+T. The size of the writes cancels out in GCBytesAge/(ttl*num_writes*GCBytes), which comes out as roughly 0.5+T/2h. In other word, the next GC will happen around the T=h mark again (when the score hits 1), but of course it will only get to remove half of the values (even though all are deleted). With the previous factor of two, we'd make the GC worth our while more, but of course we can't ever hope to remove all old versions since at most an hour's worth of them is GC'able.

Long story short, I think we can change the above factor from 2 to 1.

@erikgrinaker
Copy link
Contributor

at this point we have a GC cooldown period (per replica) of 1h, so tight loops aren't as much of an issue

If you're referring to the cooldown I added a while back, that was only for the intent score, not the overall score:

// mvccGCQueueIntentCooldownDuration is the duration to wait between MVCC GC
// attempts of the same range when triggered solely by intents. This is to
// prevent continually spinning on intents that belong to active transactions,
// which can't be cleaned up.
mvccGCQueueIntentCooldownDuration = 2 * time.Hour

But generalizing it should be straightforward.

@aliher1911
Copy link
Contributor

Long story short, I think we can change the above factor from 2 to 1.

Wouldn't that cause large deletions to always cause two GC runs? If we have any data deleted before large deletion we'll have GC triggered just before we reach threshold and then have another go after cooldown. Maybe having it at 1.1 would be better?

@erikgrinaker
Copy link
Contributor

Something slightly larger than 1 makes sense. Although there is also a random fuzz factor of up to 5%, so 1.05 might be sufficient. Let's run a few experiments to verify.

@blathers-crl
Copy link

blathers-crl bot commented Nov 8, 2022

cc @cockroachdb/replication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants