-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Performance degradation caused by kv tombstones #17229
Comments
#16252 as well and I remember there was another instance (but can't find it now). Suffice to say it's a relatively frequent problem. |
An idea mentioned by @tschottdorf is to move historical data to a separate column family. Concretely, we could have a background process that periodically moved old versions and deletion tombstones to a separate column family. This could be done independently on each replica as the logical data would remain unchanged. Each replica would maintain a timestamp that would control whether the separate column family needs to be accessed or not. |
I'm tagging this issue for 1.3 because it keeps coming up and I want to make sure it gets into the next prioritization cycle. (If we could squeeze one more thing into 1.2, this would be a candidate) I'd also like to propose index-level zone configs as a potential solution for this. Normally, the solution to slow reads is to add an index, but that doesn't work here because the index itself is getting filled up with garbage. An index with lower GC retention would work. We'd just need to ensure that the query planner takes this into account and won't use an index that may have been GC'd at the query's timestamp. |
As part of the partitioning work, I expect we'll update our usage of zone configs internally to be based on byte ranges instead of database/table IDs. Targeting indexes falls out naturally from that work. I imagine the hard part is actually the user interface; the I heard @BramGruneir might have a zone config redesign in the pipeline? |
@BramGruneir is interested in providing SQL statements to set&get zone configs. I'd imagine this would be some sort of |
What if your index is partitioned, though? Then you might end up with $ ./config zone set -f - [email protected] and there'd be four potential zone configs that could take effect (db, tbl, idx1, partition1). We'll need to decide if they inherit or not. If they don't inherit it's simpler to understand what's happening—the first zone config that exists is used verbatim—but without inheritance updating a table-wide setting (say, replication factor) requires updating every partition zone config for every index. That could mean manually running several dozen Anyway, my point is that if partitioning doesn't motivate us to redesign the zone config interface, then partitioning + index zone configs should definitely motivate us. |
|
@spencerkimball This was the issue I mentioned earlier. |
It looks like SELECT COUNT(*) can take 1400ms even on a an empty table even after lowering the TTL and waiting several hours after the TTL |
Could be scanning over thousands of empty ranges since we don't merge.
…On Thu, Dec 7, 2017 at 4:56 PM vivekmenezes ***@***.***> wrote:
It looks like SELECT COUNT(*) can take 1400ms even on a an empty table
even after lowering the TTL and waiting several hours after the TTL
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17229 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTQW3OtKH-I4LFWW1KI3fgdHM1AaOks5s-F8GgaJpZM4Oj9K4>
.
|
In the report of 1400ms for a empty table (or nearly empty: the stats I have show 106KB of live data), there were 25 ranges (with all the live data in the last one). 17 of them were empty; the other 8 were awaiting GC with 32MB of non-live data. So I think in this case, the problem is not a bunch of empty ranges, it's that the GC queue appears to have fallen behind and hasn't deleted all this data that is past its TTL. We've made some improvements here recently (this test was using version 1.1.3; at least some of the GC queue improvements will be in 1.1.4). In the longer term, though, the ever-growing number of ranges due to the lack of merging is the bigger problem for this workload. Insertions to this table (with a unique_id primary key) are producing splits that are five minutes apart. |
We've also encountered a workload recently where a small set of rows is updated very frequently leaving thousands or tens of thousands of versions per key per day. Any scan of that table is going to have bad performance. Setting a lower GC TTL will help, but not completely mitigate the problem. And we can't set TTL too low without affecting backups (i.e. backups need to happen more frequently than the TTL). I've been thinking more about using a separate column family for old versions. Rather than having a queue which moves old versions to a separate column family, we could do this maintenance incrementally during MVCC operations. For instance, if an MVCC put creates a new version, we could do a quick check to see how many versions exists and if it is greater than N (e.g. 5 or 10), we could move a chunk of versions to the separate column family. The MVCC read operations would need a separate iterator for this column family, but I think this could be structured such that it is only accessed if you're doing a historical read. |
@petermattis I have thought about this too, but was worried about the efficiency of this approach. It's the natural way to implement flexible GC policies, such as "keep only X versions" (however useful that is in practice I don't know, but it's essentially the semantics you're suggesting) or the familiar "keep versions needed for reads at NOW()-TTLSeconds". This latter one may have to deal with lots of versions still, so we'd likely have to store (or at least cache, and that will often thrash) the number of versions for every key. One special case is that in which we only keep the latest version (i.e. no time travel, or a TTL of zero). In that case, A step up from that is moving into an old column family and multiplexing reads accordingly (which isn't completely trivial). It gets really interesting when you throw #16132 in the mix -- You could have two column families that get rotated every |
@tschottdorf We'd definitely have to benchmark the efficiency of using two column families. On the surface, it would primarily affect updates. An update that was previously the addition of a single new key would translate into the addition of a new key, the deletion of a historic version and the insertion of the historic version in the secondary column family. With a little sophistication, we do this in batches to amortize the cost. Reads become interesting, but I'm assuming the common case is reading the most recent timestamp. Trying to throw #16132 into the mix seems challenging. I don't see any obvious way to achieve that as you can't just rotate the current column family to be the historic one as it still contains the most recent values. |
Tentatively self-assigning for some prototyping. Concretely, I want to not even bother with column families right now, but it should be relatively straightforward to make That should give us the correct perf numbers for the (common) read best case in which the archive keyspace is never accessed. For write workloads that need to move to the archive and reads that need to access it, I see no reason to believe that column families will make things faster, and so the benchmark numbers there should be useful as well. I'm not sure that we need column families unless we want to configure the archive differently or drop whole column families quickly. @petermattis, any obvious flaws in the above plan? |
Two more observations (mostly out of scope for first prototype):
|
That's a good point. We really just need a separate keyspace.
Depending on where you implement the parallel keyspace logic, you probably won't be able to snapshot/rebalance so you'd be limited to a single-node cluster. That seems fine for a prototype and to explore the problem space.
For a queue, could the deletion tombstone be placed in the "archive" keyspace? That would leave the "live" keyspace containing only the latest version of visible values.
I've been imagining we'd implement this down in the C++ code and have some sort of "merged iterator". The upper layers of the code could be completely unaware of what was happening. |
We can do that. The downside is that it penalizes all writes as those have to check the archive.
If you think that that can be made efficient enough, that seems preferable to me too. |
Good point. That kind of stinks.
Do you have an alternative in mind? Doing this doing in C++ seems good for both efficiency and to keep most of the system unaware of what is going on. |
No, I'm perfectly happy doing this in C++. |
@tschottdorf To clarify: I think the reading side can be done entirely in C++. The writing side will probably have to be done in Go because all of our logic for MVCC writes is in Go. Anyways, I'm eager to see what you prototype. I think we need to address the performance degradation from large numbers of versions in 2.1. |
@petermattis we're on the same page. |
Some more concrete thoughts on this. Riffing on the terminology from Ressi, the key space would be (logically) divided in an active and a passive generation (simulating potentially a later separate column family). Practically (to reduce migration concerns) this means that the active generation is the keyspace we use now (i.e. userspace keys are mapped directly into MVCC-encoded keys), and the passive generation has some store-local key prefix (i.e. we prepend a store-local key prefix to the "active" sibling key). The basic strategy is straightforward: we want the newest data in the active generation, and have most old versions and deletion tombstones in the passive one. If we additionally have a criterion using which a read can decide when it is safe to skip the passive generation, we get the desired benefits if the criterion is usually satisfied for reads at current timestamps. A kv pair is called "live" if it is the most recent version of that key, unless it is a deletion tombstone, in which case it is never live. In other words, where a read at timestamp infinity would see a value, the corresponding kv pair is live. Writes always go to the active generation. Live data is always in the active generation. Non-live kv pairs may live in the passive generation (but requiring that they always are is likely too strict a requirement, see below). When does a kv pair that is non-live migrate from the active to the passive generation? One straightforward idea is to amortize this cost over the write that makes the version non-live: when a new key is written, the writer also moves the previous version (which it reads anyway) into the passive generation (with an important tweak for tombstones, see below). As for the "read criterion", a simple one to decide whether to check the passive generation on a read is to keep track of a maximum timestamp[1] at which any of the keys in the passive generation would be readable. For example, if a kv pair at t=1 is overwritten at t=5, then we would have to bump the max timestamp to t=4 (for a read at 4 would see the value at t=1, a read at 5 wouldn't). If the new write at t=5 is a deletion tombstone, this is the same, but we also get to move the deletion tombstone itself into the passive generation[2]. This seems straightforward, but note that every (overwriting) write updates the max timestamp to the current time. This means that a majority of writes won't get the benefit of skipping the passive generation, and we're back fairly close to square one (though something has been won -- in the absence of overwrites, everything works as expected). To address this, the strategy that comes to mind is to not move immediately after shadowing. Instead, defer this until a few versions accumulate (or until some versions are at least T seconds old) and then rewrite. This makes it quite unattractive to amortize this with the writes themselves, as we're likely not willing to a) burden the write with the large overhead of reading and rewriting potentially many versions or b) keeping more metadata for each write. Instead, what can be done is to trigger the conversion through reads. Reading is where these non-live versions really cause pain, so a read is naturally good at discovering that a cleanup is necessary. This has some desirable properties (but not only):
Interestingly, there is a nice way to a least prototype this: let the scan construct a WriteBatch that, when applied, would rewrite the offending kv pairs. Whenever a WriteBatch is returned from a Scan (with appropriate metadata about the new max timestamp), simply apply it to the underlying engine after returning to the client. (To make this 100% correct, some synchronization is needed between bumping the max_timestamp and in-flight reads still using the previous one, but not required for prototyping). A simple policy is to move values only when that wouldn't bump max_timestamp within (say) 5s of the current time. [2]: are there any issues with this regarding newer writes that could write at t<=5 now? I don't think so, because of the timestamp cache, assuming that gets populated on a deletion -- which it may currently not. |
PS my starting point for this will likely be a |
Nice writeup!
Would you send that WriteBatch to the other replicas? Or would you let each replica migrate keys from active to passive individually? Because the data is logically the same, I don't think we need to synchronize the movement on the replicas, though not doing so might cause performance blips when the leaseholder changes. Ah, re-reading I see you're thinking about this in the context of a prototype. Doing something easy for a prototype seems appropriate.
The MVCCScan benchmark also clearly shows the problem with large numbers of versions, though perhaps that is too low of a level to work at. |
we should have something about this in the output of |
@andreimatei it was slow to get tracing info? Adding this information to |
We have marked this issue as stale because it has been inactive for |
We're in the progress of adding this information to |
cc @sumeerbhola, not sure if this one's useful to keep around any longer but just FYI. |
There has been some thinking specifically on separating older versions inside Pebble, for this purpose of not degrading reads with a large number of versions, but it isn't captured in an issue. I'll take the TODO to create a separate issue for it. Though, if we could do compaction time GC #57260 it isn't clear whether this would be needed. |
Has any thought been given to using a metric along the lines of the MVCC step count from #64503 as a signal into the MVCC GC queue's scoring mechanism? I had an issue recently where there were a lot of rows with primary keys of the form Those SELECTS got very expensive because each had to scan over tens of thousands of garbage MVCC entries, but MVCC GC never kicked in automatically on the range because even though the overwhelming majority of these entries were older than the This particular instance is also obviously fixable with better application schema/query design, which I'll take care of, but it does seem like the amount of MVCC garbage being encountered by queries on a range could be a very valuable signal to the GC queue. That way, the garbage that's actually being encountered most often and wasting the most resources could be prioritized first. On the other hand, if cockroachdb/pebble#1170 actually lands in the near future, then I suppose this will be a moot point. And on a less important but related note, I found that manually triggering an MVCC GC run on the range wasn't enough to reduce the CPU usage, because the MVCC garbage was still in lower levels of the LSM. I had to also trigger a manual LSM compaction to actually speed up the queries. I assume that cockroachdb/pebble#918 would have helped with this latter point. |
Thanks @a-robinson! I filed #98561 to track your idea. |
This is currently expected in v23.1, but is mainly useful for scans where keys are small compared to the values. Any rough estimates for the key and value sizes in the index being scanned in your query? |
~128 byte keys (as represented in the KV/storage layer) and ~512 bytes for each value. |
I strongly suspect that we just ran into this problem at @MaterializeInc. Whenever we scale up our production cluster (in this case from 6 to 12 nodes), one or two nodes will take a disproportionate share of the CPU load, despite small (10m) GC thresholds. I don't have hard proof, but CPU profiles of the affected nodes show a surprising amount of time in Internal support ticket for the curious: https://support.cockroachlabs.com/hc/en-us/requests/16470
I wanted to give this a try to validate the theory, but it seems there's no way to trigger this on a Cockroach Cloud cluster? |
@benesch drive by comment, once #99726 is picked up it will be a bit easier to "conclusively" see whether there are more tombstone accesses on one store vs another. Also, if you can get a stmt trace that you suspect is slow because of tombstones, it will have iterator stats printed into it which contain similar information today already. |
Awesome, thanks @tbg! We're running another load test in a month that I expect to tickle this. Will report back if we manage to get a repro. |
If a workload deletes or updates rows frequently and also scans over those rows, the tombstones left behind by our MVCC cause the scans to get slower and slower over time. This is surprising to users and can render workloads that are perfectly reasonable on a new database impractically slow on a database that's been under load for 24 hours.
This was hit by a user whose workload involved two tables scans in a three-way join in #17075, and has received some previous discussion on our forum. Consider this a tracking issue for the user-visible problem.
Jira issue: CRDB-6045
The text was updated successfully, but these errors were encountered: