-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INSERT ON CONFLICT causes huge latency when executed in parallel #28487
Comments
just saw: #28461 related? |
hey @petermattis , i updated and scaled the pods up again manually: First part is one pod, the second pod did not have that much of an impact, however when adding the third one the latency went over 2 seconds again, finally i removed one pod again, and latency going down again where it would be okay. During the execution of 3 pods i got the following TX error:
A performance increase is definitely noticeable. |
@nvanbenschoten for triage. @andreimatei, you might be interested in the TransactionStatusError. |
Log excerpt, the number is the record count to be inserted. (if you need more details, let me know)
|
Hi @cgebe, thanks for the report. First off, do you mind sharing an example of what the queries you're issuing look like? You mentioned that you're running One way that we can try to isolate where the issues lie is to perform the exact same experiment but without any conflicts. Peter mentioned that we have made large improvements in 2.1 to address contention, but I still suspect that its causing issues here. Another thing we should try is dropping each statement down to only operating on a single data point. Batching is a recommended optimization to improve throughput in cases where there are no conflicts, but it's less clear of a win when there are. Why don't we try reducing the statements to all only operate on a single data point at first and see if we still observe large latency spikes when adding in the third pod. The |
hey @nvanbenschoten, the query looks like the following, with a combined primary key:
I am building batch queries this way (map due to #21360):
Data usually looks like Currently, i solve the error with the following func in a deferred call, interestingly the func is executed directly after the error with exactly the same insert, and it does not fail at all (did not reach the single insert block yet). It seems the error can only occur in a very small time window.
I observed the error being thrown later and less often when fewer pods are running concurrently and inserting the same data (so instead of 3 workers inserting 6 data points each, 2 workers inserting 9 data points each). If i start more pods it is likely to occure earlier and more often. The error first appeared after the update to Concerning the actual latency issue: Since the latency got better with the update, i am not quite sure whether this error is somehow related. Do you suggest to do single inserts instead, despite the rising amount of data points? I trust in shifting the conflict resolution to the db system, but i might be better off doing it upfront. Right now, i am fine, but i fear worse. |
I experienced a sudden drop in latency, due to growing count of ranges? With two pods, i currently have a P99 < 150ms and P50 at 50ms. |
From #30164. The latest beta version is
|
I tried out the following:
All strategies leading to query times between 3s and 15s for one single query. I stopped all my parallel insert jobs and tried to execute a single insert with a single client and also getting 4s as query time. This can only mean that my cluster is in a corrupted state. Is there a way to repair my cluster state? my debug zip: https://transfer.sh/bzm6Q/log.zip looking at |
Turning |
Is this a bug report or a feature request?
bug
BUG REPORT
I have a cockroach cluster running inside kubernetes with 3 nodes. They are running in the same data center on 3 different machines.
Now, i have 3 pods (distributed across the machines), each running 3 workers inside. One worker is inserting data points via batch inserts into 1 specific database (3 databases in total) every second (usually its about 50 data points per INSERT). The data points inserted by 1 worker into 1 database can be in conflict with data points (the same points) inserted by another worker of another pod. In addition, one worker should be able to insert duplicate data points (i solved that issue already by discarding duplicates upfront due to to #21360), however there are still conflicts across workers from different pods which immensely impact the insert time ~10 seconds.
This issue only occurs when i start the pods at the very same time via the deployment. When i manually scale up the deployment, the insert time frames between pods are shifted therefore not going into conflict and keeping the latency in acceptable frames.
v2.0.4
Started INSERT ON CONFLICT batch queries with ~50 data points from different pods.
The latency staying normal and conflicts not having that much of an impact.
Until 10:00 i had the three pods running which i scaled up manually. At 10:00 i restarted the whole deployment leading to all worker nearly inserting at the same point in time every second
Queries taking too long
The text was updated successfully, but these errors were encountered: