-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: range data is not being GCed in a timely manner #38251
Comments
One possibly interesting fact is that the range has only 7 keys with tons and tons of old versions. The other keys are well behaved. |
The shouldQueue method uses the GCBytesAge to determine whether there are enough old keys to justify a gc run (=a full scan of the range). Unfortunately, the GCBytesAge is only loosely related to actually being able to GC something, depending on how the old versions are distributed the GCBytesAge could be relatively high even though nothing can be deleted yet. The math related in working out when we're guaranteed to be able to delete a significant fraction of data implies that we won't be deleting values eagerly when they go out of scope. See cockroach/pkg/storage/gc_queue.go Line 179 in 7e2ceae
On the plus side, the screenshot above indicates that the range is very close to being GC'ed (it needs a score of 2). I agree that none of this seems great, especially since it affects the system config range. As a quick fix, I wonder if we can lower the TTL, which should linearly affect the score (i.e. halve the TTL, double the score) and thus the frequency at which GC runs. |
PS even if we fixed the GC queue to be super aggressive, it seems that this range would be too big still. You say there are "gigabytes" for three days, so there will still be vastly more than 64mb for one day, i.e. even with optimal GC the range seems to end up in the 1GB range because data is written so quickly. |
I now understand how the thing works; thanks for the pointers. In my opinion, it's not a good thing that we only trigger GC based on a score normalized to the ranges live data size. For example, imagine you've got two ranges: one configured to be 1GB in size, the other 64MB. Now imagine the 1GB has 500MB of collectible garbage. We'd like to collect it even though it's relatively smaller than the live data, wouldn't we? It's not clear to me how to improve the algorithm, though. How about if we maintained more granular statistics than a single GCBytesAge? For example, we could maintain stats for every hour - how many bytes died during that hour. |
Also a question - given that |
When you drop a table, hopefully it gets collected via ClearRange which sidesteps GC completely. If it doesn't (perhaps there's some interleaving going on) yes, it'll take roughly 2x the TTL. Say you have 1GB of data and say it was all written at t=0 and it gets dropped at t=24h. That contributes a GCBytesAge of 24 GB*h, so I think you know this, but just to make sure: the reason we don't set up GC so that it runs right at 1 (or 1.001) is that in the worst case of data distribution, we only get to delete a tiny fraction of the rows, but have to scan all of the data.
The example is missing a comparison. Yeah, we'd want to collect this at some point, and we will. But we also have to scan 1GB to do it. In any alternative strategy that you come up with, you have to handle the common case in which there is very little garbage. Say 1GB range and 1MB of garbage. You'll have to collect that eventually, how do you determine when? You want to make absolutely sure you're not scanning 1GB for nothing (perhaps repeatedly), so you need some way to confirm that you're going to collect a significant fraction of that 1MB when you invest your CPU time. That's the main contribution of the current mechanism, and in turn we get the shortcoming that even though there may be something worth collecting, it might take ~an extra TTL until we do. Agree that it's worth thinking about. At the moment, we are scanning each range once per day anyway (consistency checker), there's an opportunity there to compute for sure what we can delete (and perhaps even doing it right there). |
Aha, I forgot about
The point is that we'd like to delete sooner, not later. So the example is a 1GB range with 500MB worth of garbage. As it stands, it will take 4 days until this garbage is collected (2 days until What really bothers me is that it takes an arbitrary amount of time to collect some garbage. If I delete a tiny row and nothing else, it'll be collected much later. And I don't have any knobs for it. Besides the resource usage questions, sometimes I really want my data wiped because of contracts or what not - think PCI compliance saying that you need to delete credit card numbers within some time window. Basically, I find it hard to say when the thing that we currently do works well and when it doesn't. The only guarantee the current mechanism gives us is that when we collect, we'll collect something. And it has the reasonable behavior that, if there's more garbage relative to live data, it'll collect sooner than otherwise. But it's just vague cause the statistic - average age of dead byte - is so coarse. It doesn't provide us with an estimate on how much garbage we can collect. So thinking fundamentally about the problem, the quantity of non-live bytes in a range is a monotonic function and we want the area under the curve for a prefix of it (up until the TTL). And we need to summarize the function somehow.
Yeah, I guess that would also shut me up, although it's not perfectly satisfying. |
This is a WIP because it doesn't actually implement anything it claims yet. Instead, I wrote a big comment explaining how everything worked and fit together, and left some TODOs in the code detailing what we'd want to improve. Let's discuss and polish this on this PR first, and I'll then amend it to actually implement the suggestions, updating/adding tests in the process. ---- We noticed some unfortunate GC queue behavior in [38251] and I went back to the code and found that there are several good improvements we can make without completely switching up the way things work. The point of contention was essentially that configuring a given TTL does not imply that keys gc'able by that TTL will be GC'ed right when they become GC'able. This is sort of intrinsic in the information we have available when deciding when to GC, but we were leaving a factor of two on the table. In result, users should reasonably be able to expect that data is deleted approximately after the TTL for large deletions and approximately after twice the TTL for "spread out" deletions (such as repeatedly updating the same key). Previously these were 2x and 4x, respectively. Additionally (though this did not matter for the present issue), the impact of the fraction of dead bytes vs live bytes into the score (and by extension the scheduling of GC) wasn't very principled and has been reworked, the goal being to delay GC until the expected amount of freed bytes represents a certain fraction of the live data left in the range. This is to avoid repeatedly scanning large amounts of live data only to remove very small amounts of gc'able data. [38251]: cockroachdb#38251 Release note: None
@tbg , any update on this issue? |
No. There is some work-in-progress here but I am not working on it: #38432 |
I accidentally found this gh issue and noticed I have a ticket from a customer exactly about this behavior. They have a table with 360GB + of data, and this table has over 2300 ranges and half of the data in all these ranges (as per range json files) is considered to be garbage, yet, it is not removed, which is causing the cluster to be dangerously close from running out of disk. GH issue here for visibility and hoping this somehow gets improved, as it appears (based on the explanations above) this is an expected behavior (this should probably be documented somehow in a more clear fashion so customers are aware of this): |
I think we can set this value here to 1:
Since this issue has been open for a while, a small refresher on the problem: we want to avoid the GC triggering before there's work to do, to avoid spinning. So we need an invariant: if the GC score indicates a GC, carrying out that GC must reduce the score sufficiently to not trigger another GC. We compute such a score (valScore here). The invariant associated to that score is that GC will always bring that score to <1. So if we queue only when it's >= 1, we have what we want (no tight loop of GC). But we actually only trigger when the score hits 2! This was dnoe out of an abundance of caution when I introduced this score and in hindsight has done more harm than good. The "worst case" for this score is a large range where everything is recently deleted (i.e. not gc'able), but there is one outlier KV pair that has been deleted very very long ago. This would drive the score up above the threshold of 1 and yet, when you GC, all you get to delete is that one pair. But, doesn't matter, the resulting score will be << 1. The hope was that a factor of 2 would give a better balance, where we delay GC until there's more work to do (GC has to scan the entire range, i.e. expensive). However, at this point we have a GC cooldown period (per replica) of 1h, so tight loops aren't as much of an issue. The original reasoning for picking With a cut-off of 1, if at time t=0, a customer deletes the entire contents of a range, then once the TTL passes, it will become GC'able (and will be GC'ed subject to the queue coming around to take a look). This is what customers expect and we should give it to them. Another interesting example to think through is a range in which there's only one row that gets overwritten once per second. At a 1h TTL and assuming GC as soon as the score hits Let's assume a GC run happened at t=0 and it's t=T now. The range will have 1h+T worth of writes, timestamped t=T, ..., t=0, t=-1, t=-2, ..., t=-3600, contributing GCBytesAge of 1, 2, ..., 3600+T. The size of the writes cancels out in Long story short, I think we can change the above factor from |
If you're referring to the cooldown I added a while back, that was only for the intent score, not the overall score: cockroach/pkg/kv/kvserver/mvcc_gc_queue.go Lines 56 to 60 in 3414b03
But generalizing it should be straightforward. |
Wouldn't that cause large deletions to always cause two GC runs? If we have any data deleted before large deletion we'll have GC triggered just before we reach threshold and then have another go after cooldown. Maybe having it at 1.1 would be better? |
Something slightly larger than 1 makes sense. Although there is also a random fuzz factor of up to 5%, so 1.05 might be sufficient. Let's run a few experiments to verify. |
cc @cockroachdb/replication |
A customer has a cluster with a humongous range (range 6 - the "system config" range) that seems to constantly have around 3-days worth of old values, instead of 25h worth. And those extra two days amount to gigabytes of dead values.
The GC queue does not seem interested in processing the range. When enqueued manually, it says low score:
It also says "likely last GC: 43h" ago, which I think is wrong (the real is more like 72h, which is also confirmed by the gcThreshold from the range stats).
Why the low score, I don't know yet. Perhaps something's busted with the scoring heuristics?
I've got the range data, all 7GB of it but highly compressed, and the range stats (internal only).
https://drive.google.com/drive/folders/1zgM9h5m3-voqV_Usm6fHxnuM8p4dK0LK?usp=sharing
There you can see, for example, all the versions of key
/Table/3/1/4191/2/1
, which range fromJune 14, 2019 11:17:25 AM
toJune 17, 2019 7:57:27 AM
.This is a spin-off of #38088.
gz#4843
Jira issue: CRDB-5642
The text was updated successfully, but these errors were encountered: