-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queries take >10s when updating same row concurrently #17883
Comments
@jordanlewis can you have a look? |
My original report of no forward progress appears to be incorrect, I adjusted the reporting in the test program (also edited above) and I'm now seeing at least one query per minute. Edit: well maybe I'm seeing no forward progress after all, just with more concurrent processes. |
Tried again with 1.0.5, hoping #17385 would help, but no improvement. |
I followed your instructions with the latest alpha and couldn't reproduce what you were seeing - running the program with concurrency 60 and num_updates 10 finished in under a minute. Same with v1.0.5, actually. Did you run the program for a long time before being able to reproduce the behavior you saw? |
I've found it very easy to reproduce on different systems. Have you tried increasing the concurrency? On a 2-core system 60 is all that's needed but on a 4-core I doubled it to 120. I can probably setup an AMI that exhibits this behavior, would that help? If so, send me your AWS account ID. |
Doubling the concurrency to 120 does produce some interesting behavior. It's pretty easy to reproduce this with our
Latencies creep up to 10 seconds and never improve. A CPU profile of both workloads shows that the lion's share of time is spent in This workload is so heavily contended that it's not surprising that query latency is quite bad. It's possible that there are some improvements we can make, but since CockroachDB runs in serializable mode by default, such a contended workload inevitably means that all of the operations must be run in serial and will encounter a lot of waiting and/or retries. |
Haven't looked at this in detail, but perhaps related to
#17121.
|
Just as another data point, we accidentally caused exactly this performance degradation during a restore. A goroutine dump showed about 60 routines hammering on the same row in the |
Reassigning to @nvanbenschoten as this is likely a core performance issue, but feel free to kick it back. |
Yeah, this looks very similar to #20448. Hopefully the changes we make there will translate cleanly to wins on this issue. If not, we'll have to investigate further. |
I was able to reproduce exactly what @jethrogb saw on As expected, recent changes to better handle high contention (notably #25014) have greatly improved the situation. Now on |
I have a table like this:
I have tens of concurrent processes running this in a loop:
This starts out relatively fast but queries can take 10s of seconds after a bit. After killing these processes, operations on this table (even as simple as
SELECT count(*) FROM testtable
) remain slow for a very long time.I'm running single-node but I've also observed similar behavior on clusters. Config header:
Test program below. Run with:
If it doesn't get slow enough, try increasing the
num_updates
variable in the code or the max-processes number on the command line. Withnum_updates=1
it usually finishes in a couple of minutes. Withnum_updates=10
it can take a very long time to finish.The text was updated successfully, but these errors were encountered: