-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: replication reports generation takes 20% of a node's CPU for 200k ranges #41609
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
Comments
andreimatei
added
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
A-kv-replication
Relating to Raft, consensus, and coordination.
labels
Oct 15, 2019
I hope to have a fix for this tomorrow. |
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 18, 2019
This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes cockroachdb#41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters.
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 18, 2019
This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. visitRanges() now figures out that runs of consecutive ranges belong to the same zone. This is done through a new zoneResolver struct, that's optimized for the case where it's asked to resolve ranges in key order. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes cockroachdb#41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters.
andreimatei
added a commit
to andreimatei/cockroach
that referenced
this issue
Oct 21, 2019
This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. visitRanges() now figures out that runs of consecutive ranges belong to the same zone. This is done through a new zoneResolver struct, that's optimized for the case where it's asked to resolve ranges in key order. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes cockroachdb#41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters.
craig bot
pushed a commit
that referenced
this issue
Oct 21, 2019
41684: importccl: add 19.2 version gate check r=miretskiy a=miretskiy Ensure the cluster is fully upgraded when running import. Release note: ensure cluster fully upgraded when running import. 41711: storage/report: don't deserialize zone configs over and over r=andreimatei a=andreimatei This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes #41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters. 41761: storage: write to local storage before updating liveness r=bdarnell a=irfansharif Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in #24591 (+ #33122). This was undone in #32978 (which introduced a stricter version of a similar check). #32978 was later disabled by default in #36484, leaving us without the protections first introduced in #24591. This PR re-adds the logic from #24591. Part of #41683. Release note: None. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: irfan sharif <[email protected]>
arulajmani
pushed a commit
to arulajmani/cockroach
that referenced
this issue
Oct 29, 2019
This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. visitRanges() now figures out that runs of consecutive ranges belong to the same zone. This is done through a new zoneResolver struct, that's optimized for the case where it's asked to resolve ranges in key order. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes cockroachdb#41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-performance
Perf of queries or internals. Solution not expected to change functional behavior.
On a TPCC-100k run, with partitioning, the following profile shows report generation (at the left) taking a bunch of CPU. The generation of each report seems to take minutes.
We're unmarshalling the same zone protos over and over again.
The text was updated successfully, but these errors were encountered: