-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/report: don't deserialize zone configs over and over #41711
Conversation
6ada7a5
to
e1741bc
Compare
I still need to write some new tests, but PTAL |
e1741bc
to
4072dad
Compare
One message was using an ugly &rngDesc. Getting that pointer is no longer require since I made rangeDesc values implement the stringer interface. Release note: None
Before this patch, errors encountered by the report-producing range visitors were either swallowed or fataled. This patch makes the error bubble up to the reporting infra, which then stops using a visitor that returned an error and skips persisting its report. This is particularly useful for an upcoming commit: visitors will become more stateful and so continuing iteration after an error is increasingly dubious. Release note: None
4072dad
to
a704750
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me. Mostly just nits.
Reviewed 2 of 5 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @andreimatei)
pkg/storage/reports/constraint_stats_report.go, line 445 at r3 (raw file):
defer func() { if retErr != nil {
super nit: v.visitErr = retErr != nil
?
pkg/storage/reports/critical_localities_report.go, line 318 at r3 (raw file):
defer func() { if retErr != nil {
super nit: v.visitErr = retErr != nil
pkg/storage/reports/critical_localities_report.go, line 349 at r3 (raw file):
) (retErr error) { defer func() { if retErr != nil {
...
pkg/storage/reports/reporter.go, line 345 at r3 (raw file):
} // Check that the subzone, if any, is the same. if r.lastZone == nil {
I don't really get this case. Are we saying that sometimes we call setZone
with nil
?
pkg/storage/reports/reporter.go, line 555 at r3 (raw file):
// visitorError is returned by visitRanges when one or more visitors failed. type visitorError struct {
style question which I could go either way on: how about type visitorError []error
pkg/storage/reports/reporter.go, line 635 at r3 (raw file):
// Sanity check - v.failed() should return an error now (the same as err above). if !v.failed() { return errors.Errorf("expected visitor %T to have failed() after error: %s", v, err)
Would this be a reasonable use case for errors.NewAssertionErrorWithWrappedErrf
?
a704750
to
de72424
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a better structure now and added a bunch of tests. PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/reports/constraint_stats_report.go, line 445 at r3 (raw file):
Previously, ajwerner wrote…
super nit:
v.visitErr = retErr != nil
?
done
pkg/storage/reports/critical_localities_report.go, line 318 at r3 (raw file):
Previously, ajwerner wrote…
super nit:
v.visitErr = retErr != nil
done
pkg/storage/reports/critical_localities_report.go, line 349 at r3 (raw file):
Previously, ajwerner wrote…
...
done
pkg/storage/reports/reporter.go, line 345 at r3 (raw file):
Previously, ajwerner wrote…
I don't really get this case. Are we saying that sometimes we call
setZone
withnil
?
this was some nonsense
pkg/storage/reports/reporter.go, line 555 at r3 (raw file):
Previously, ajwerner wrote…
style question which I could go either way on: how about
type visitorError []error
meh
pkg/storage/reports/reporter.go, line 635 at r3 (raw file):
Previously, ajwerner wrote…
Would this be a reasonable use case for
errors.NewAssertionErrorWithWrappedErrf
?
I don't think so. Here the original err is not a "cause" of the assertion failure; this is not a usual wrapping situation
44ee9b3
to
52bd230
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something unsatisfying about the copy-pasta on each of the visitor implementations but I can live with it.
Reviewed 4 of 9 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/storage/reports/reporter.go, line 354 at r4 (raw file):
) (ZoneKey, error) { objectID, _ := config.DecodeKeyIntoZoneIDAndSuffix(rd.StartKey) first := true
super nit: add commentary to make it clearer to the reader that visitZones walks from the bottom up. The logic is clear once you know that fact.
pkg/storage/reports/reporter.go, line 389 at r4 (raw file):
// example, if the zoneChecker was previously configured for a range starting at // /Table/51 and is now queried for /Table/52, it will say that the zones don't // match even if in fact they do ( because neither table defines its own zone
nit: extra space.
This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. visitRanges() now figures out that runs of consecutive ranges belong to the same zone. This is done through a new zoneResolver struct, that's optimized for the case where it's asked to resolve ranges in key order. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes cockroachdb#41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters.
52bd230
to
e854f99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/storage/reports/reporter.go, line 354 at r4 (raw file):
Previously, ajwerner wrote…
super nit: add commentary to make it clearer to the reader that visitZones walks from the bottom up. The logic is clear once you know that fact.
done
pkg/storage/reports/reporter.go, line 389 at r4 (raw file):
Previously, ajwerner wrote…
nit: extra space.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
41684: importccl: add 19.2 version gate check r=miretskiy a=miretskiy Ensure the cluster is fully upgraded when running import. Release note: ensure cluster fully upgraded when running import. 41711: storage/report: don't deserialize zone configs over and over r=andreimatei a=andreimatei This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes #41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters. 41761: storage: write to local storage before updating liveness r=bdarnell a=irfansharif Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in #24591 (+ #33122). This was undone in #32978 (which introduced a stricter version of a similar check). #32978 was later disabled by default in #36484, leaving us without the protections first introduced in #24591. This PR re-adds the logic from #24591. Part of #41683. Release note: None. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: irfan sharif <[email protected]>
Build succeeded |
This patch is expected to speedup reports generation considerably by not
unmarshalling zone config protos for every range. Instead, the
report-geneating visitors now keep state around the zone that a visited
range is in and reuse that state if they're told that the following
range is in the same zone. The range iteration infrastructure was
enhanced to figure out when zones change from range to range and tell
that to the visitors.
Without this patch, generating the reports was pinning a core for
minutes for a cluster with partitioning and 200k ranges. The profiles
showed that it's all zone config unmarshalling.
Fixes #41609
Release note (performance improvement): The performance of generating
the system.replication_* reports was greatly improved for large
clusters.