-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: unexpected IO overload in admission/follower-overload/presplit-control #82109
Comments
I turned the kv0 workload off to let n1 and n2 get healthy again (they were so unhealthy that they'd periodically miss heartbeats and lose all leases). Doing so was instructive because, uh, what's pebble doing with its newfound freedom? It's certainly not cleaning up L0. The workload stopped at 8:37. My perhaps naive reading is that pebble is working as fast as it can, but it's not clearing out L0, presumably because it needs to clear out the lower levels of the LSM first to make room (while still receiving a small amount of writes to the LSM via the kv-50 workload which writes at 10kb/s, i.e. a negligible amount, via raft appends to n1 and n2) A few minutes later, L0 did clear out, and interestingly for the minutes prior to this, we see compaction throughput rise significantly: My (again, naive) interpretation of this is that possibly some larger compactions were blocking L0->Lbase compactions that could have used more of the concurrency slots. Somehow the LSMs on n1 and n2 spent too much time in a regime where they weren't able to fully use the disks. I assume someone more knowledgeable can see things like that from the lsm log inspections linked in the initial post. |
Update: a few hours in we're back to IO overload. Notably, this is also starting to affect n3, which had previously been unaffected. This makes some sense, since we doubled the IOPS for n1 and n2, so maybe they're slowing down "less" due to their own overload and manage to push n3 over the edge as well. |
Dropping in some comments based on some observations yesterday.
Having all nodes on 250MB/s seemed like it helped for a while, though it sounds like we ended up in a throttling regime again. Understanding the throughput characteristics of the gp3 EBS volumes seems prudent. Current observations seem to indicate that we can drive significantly more throughput on the device than provisioned. |
I've since run multiple iterations of this experiment. For the minute details, I will refer to these slack threads: number one, this one has more stumbling so probably better skipped by most number two which just "sort of" wrapped up as I'm typing this. There were more precursor slacks over the last couple of weeks, but I think they are less important so I'm not going to dig them up. The high-level narrative that seems to emerge is that we were really dealing with gp3 behavior that we didn't properly understand. The experiment initially starts out with a 500gb gp3 disk provisioned to 125mb/s and 3k iops. It turns out though, that you really get a RAID0 of two EBS volumes:
Now why the first one is 93.1G and the second one is 500G, don't ask me. I would love to learn if this setup is somehow accidental or whether that is how it always works. This is how the VMs are created, looks pretty vanilla:
But there is also this so it's going to be the local nvme ssd attached to the That probably already explains everything - we shouldn't be lumping the local SSD into the RAID0 because in doing so, we won't reliably get a speed-up unless both volumes have been provisioned up, and we can't provision up the local SSD (I think). |
I modified your roachtest from #81516 to do something like:
Using the Confirmed that we only get a single EBS vol on the node, so no RAID in the mix to complicate things. I've had the roachtest running for an hour or so, and I'm starting to see the effects of maxing out the EBS volume. Here's a look at read, write and combined throughput. The ceiling is 125 MB/s, as expected. I'm not going to dig in too deep at this point. Just wanted to confirm that we can at least work around the mixed SSD / EBS volume issue by updating the instance type. |
Closing this issue since we understand what's going on, and since #82423 tracks fixing this footgun. |
cc @cockroachdb/replication |
Unfortunately this still occurs on an EBS-only volume. I have been running a cluster which started out on the 125mb/s / 3k iops volumes and experienced IO overload, as expected. I reprovisioned the volumes 500mb/s and 6k iops at 7:38am UTC, which is 8 hours ago. The volumes moved to "optimizing" state (where their performance is somewhere between old and new provisioned) and have been in regulra "in use" state for a few hours. However, there is IO overload again at a throughput ceiling of 144mb/s (matching precisely the old ceiling), and additionally I am unable to squeeze out additional throughput via |
Stopping CRDB and running fio showed something interesting - we were bursting at 500mb/s (the provisioned throughput), but only for a few seconds:
then we immediately dropped back to the baseline we also saw CRDB operate at:
|
Just tried reads - I seem to be getting 500 mb/s read throughput on n1, and have been for 7min:
so I tried the writes again and it seems that we're getting it there now, too:
In the meantime, n2 (which wasn't down) is still constrained. Maybe these volumes need some rest before they can crank out the 500m? Going to stop n2 and restart n1. |
No, n2 is only getting 138MiB/s on either reads or writes. |
We already had the ability to deploy a prometheus instance to a node in the cluster. However, to run experiments / long investigations[^1] we often need a Grafana instance with the dashboards du jour. This commit dramatically cuts down on the manual steps needed to get this set up. All it takes is adding setup like this to the roachtest: ``` clusNodes := c.Range(1, c.Spec().NodeCount-1) workloadNode := c.Node(c.Spec().NodeCount) promNode := workloadNode cfg := (&prometheus.Config{}). WithCluster(clusNodes). WithPrometheusNode(promNode). WithGrafanaDashboard("https://gist.githubusercontent.com/tbg/f238d578269143187e71a1046562225f/raw"). WithNodeExporter(clusNodes). WithWorkload(workloadNode, 2112). WithWorkload(workloadNode, 2113) p, saveSnap, err := prometheus.Init( ctx, *cfg, c, t.L(), repeatRunner{C: c, T: t}.repeatRunE, ) require.NoError(t, err) defer saveSnap(ctx, t.ArtifactsDir()) ``` There has been talk[^2] of adding some of this tooling to `roachprod`. Probably a good idea, but we can pour infinite amount of work into this, and for now I think this is a good stepping stone and satisfies my immediate needs. [^1]: cockroachdb#82109 [^2]: [internal slack](https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1654267035695569?thread_ts=1654153265.215669&cid=CAC6K3SLU) Release note: None
82656: prometheus: improve UX, add grafana, node_exporter, custom dashboards r=erikgrinaker a=tbg We already had the ability to deploy a prometheus instance to a node in the cluster. However, to run experiments / long investigations[^1] we often need a Grafana instance with the dashboards du jour. This commit dramatically cuts down on the manual steps needed to get this set up. All it takes is adding setup like this to the roachtest: ``` clusNodes := c.Range(1, c.Spec().NodeCount-1) workloadNode := c.Node(c.Spec().NodeCount) promNode := workloadNode cfg := (&prometheus.Config{}). WithPrometheusNode(promNode). WithCluster(clusNodes). WithGrafanaDashboard("https://gist.githubusercontent.com/tbg/f238d578269143187e71a1046562225f/raw"). WithNodeExporter(clusNodes). WithWorkload(workloadNode, 2112). WithWorkload(workloadNode, 2113) p, saveSnap, err := prometheus.Init( ctx, *cfg, c, t.L(), repeatRunner{C: c, T: t}.repeatRunE, ) require.NoError(t, err) defer saveSnap(ctx, t.ArtifactsDir()) ``` There has been talk[^2] of adding some of this tooling to `roachprod`. Probably a good idea, but we can pour infinite amount of work into this, and for now I think this is a good stepping stone and satisfies my immediate needs. [^1]: #82109 [^2]: [internal slack](https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1654267035695569?thread_ts=1654153265.215669&cid=CAC6K3SLU) Release note: None Co-authored-by: Tobias Grieger <[email protected]>
Following up on the throughput limit we were hitting above, the reason we were hitting this limit is that the test was using a https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html To achieve a consistent throughput of 500MB/s, we should rerun the test with a |
Closing this one out, now that we understand how EBS works :) |
The admission/follower-overload/presplit-control test is introduced in #81516.
Despite the name, no overload should occur in it. We run two workloads:
The expectation is that this "just runs fine until out of disk". The reality is that a couple of hours in, n1 and n2 end up in admission control due to their LSM inverting.
I've dug at this quite a bit but think that I have ruled out (or at least not seen evidence of)
I recorded my detailed analyses as looms. They're a couple minutes each. There's also a slack thread that might pick up some more minute details as the day goes by but anything major will be recorded here.
part 1: finding out that the cluster is in bad health and looking atr graphs https://www.loom.com/share/5d594ac594f64dd3bd6b50b8b653ca33
look at lsm visualizer for https://gist.github.com/tbg/7d47cc3b6bfc5d9721579822a372447e https://gist.github.com/tbg/c1552f8d92583c91f9996323608c647e https://gist.github.com/tbg/83d49ce2c205121b17b32948de1720b8: https://www.loom.com/share/2e236668c1bb4a67b52df5b64e0c231f
with both the IOPS upped and n1,n2 restarted with compaction concurrency of 8, both of them still seem overloaded: https://www.loom.com/share/8ac8b33d082645ce9aff780eeedd00cb
It's too early to really tell, though, since the leases still have to balance out. I will comment in a few hours when a new steady state has been reached.
Related to #79215
Jira issue: CRDB-16213
Epic CRDB-15069
The text was updated successfully, but these errors were encountered: