storage/report: don't deserialize zone configs over and over #41711

andreimatei · 2019-10-17T22:36:33Z

This patch is expected to speedup reports generation considerably by not
unmarshalling zone config protos for every range. Instead, the
report-geneating visitors now keep state around the zone that a visited
range is in and reuse that state if they're told that the following
range is in the same zone. The range iteration infrastructure was
enhanced to figure out when zones change from range to range and tell
that to the visitors.

Without this patch, generating the reports was pinning a core for
minutes for a cluster with partitioning and 200k ranges. The profiles
showed that it's all zone config unmarshalling.

Fixes #41609

Release note (performance improvement): The performance of generating
the system.replication_* reports was greatly improved for large
clusters.

cockroach-teamcity · 2019-10-17T22:36:40Z

This change is

andreimatei · 2019-10-17T22:37:55Z

I still need to write some new tests, but PTAL

One message was using an ugly &rngDesc. Getting that pointer is no longer require since I made rangeDesc values implement the stringer interface. Release note: None

Before this patch, errors encountered by the report-producing range visitors were either swallowed or fataled. This patch makes the error bubble up to the reporting infra, which then stops using a visitor that returned an error and skips persisting its report. This is particularly useful for an upcoming commit: visitors will become more stateful and so continuing iteration after an error is increasingly dubious. Release note: None

darinpp

Looks good.

ajwerner

Seems fine to me. Mostly just nits.

Reviewed 2 of 5 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @andreimatei)

pkg/storage/reports/constraint_stats_report.go, line 445 at r3 (raw file):

	defer func() {
		if retErr != nil {

super nit: v.visitErr = retErr != nil?

pkg/storage/reports/critical_localities_report.go, line 318 at r3 (raw file):

	defer func() {
		if retErr != nil {

super nit: v.visitErr = retErr != nil

pkg/storage/reports/critical_localities_report.go, line 349 at r3 (raw file):

) (retErr error) {
	defer func() {
		if retErr != nil {

...

pkg/storage/reports/reporter.go, line 345 at r3 (raw file):

	}
	// Check that the subzone, if any, is the same.
	if r.lastZone == nil {

I don't really get this case. Are we saying that sometimes we call setZone with nil?

pkg/storage/reports/reporter.go, line 555 at r3 (raw file):

// visitorError is returned by visitRanges when one or more visitors failed.
type visitorError struct {

style question which I could go either way on: how about type visitorError []error

pkg/storage/reports/reporter.go, line 635 at r3 (raw file):

				// Sanity check - v.failed() should return an error now (the same as err above).
				if !v.failed() {
					return errors.Errorf("expected visitor %T to have failed() after error: %s", v, err)

Would this be a reasonable use case for errors.NewAssertionErrorWithWrappedErrf?

andreimatei

I've got a better structure now and added a bunch of tests. PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/storage/reports/constraint_stats_report.go, line 445 at r3 (raw file):

Previously, ajwerner wrote…

super nit: v.visitErr = retErr != nil?

done

pkg/storage/reports/critical_localities_report.go, line 318 at r3 (raw file):

Previously, ajwerner wrote…

super nit: v.visitErr = retErr != nil

done

pkg/storage/reports/critical_localities_report.go, line 349 at r3 (raw file):

Previously, ajwerner wrote…

...

done

pkg/storage/reports/reporter.go, line 345 at r3 (raw file):

Previously, ajwerner wrote…

I don't really get this case. Are we saying that sometimes we call setZone with nil?

this was some nonsense

pkg/storage/reports/reporter.go, line 555 at r3 (raw file):

Previously, ajwerner wrote…

style question which I could go either way on: how about type visitorError []error

meh

pkg/storage/reports/reporter.go, line 635 at r3 (raw file):

Previously, ajwerner wrote…

Would this be a reasonable use case for errors.NewAssertionErrorWithWrappedErrf?

I don't think so. Here the original err is not a "cause" of the assertion failure; this is not a usual wrapping situation

ajwerner

There's something unsatisfying about the copy-pasta on each of the visitor implementations but I can live with it.

Reviewed 4 of 9 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)

pkg/storage/reports/reporter.go, line 354 at r4 (raw file):

) (ZoneKey, error) {
	objectID, _ := config.DecodeKeyIntoZoneIDAndSuffix(rd.StartKey)
	first := true

super nit: add commentary to make it clearer to the reader that visitZones walks from the bottom up. The logic is clear once you know that fact.

pkg/storage/reports/reporter.go, line 389 at r4 (raw file):

// example, if the zoneChecker was previously configured for a range starting at
// /Table/51 and is now queried for /Table/52, it will say that the zones don't
// match even if in fact they do ( because neither table defines its own zone

nit: extra space.

This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. visitRanges() now figures out that runs of consecutive ranges belong to the same zone. This is done through a new zoneResolver struct, that's optimized for the case where it's asked to resolve ranges in key order. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes cockroachdb#41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)

pkg/storage/reports/reporter.go, line 354 at r4 (raw file):

Previously, ajwerner wrote…

super nit: add commentary to make it clearer to the reader that visitZones walks from the bottom up. The logic is clear once you know that fact.

done

pkg/storage/reports/reporter.go, line 389 at r4 (raw file):

Previously, ajwerner wrote…

nit: extra space.

done

andreimatei

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)

41684: importccl: add 19.2 version gate check r=miretskiy a=miretskiy Ensure the cluster is fully upgraded when running import. Release note: ensure cluster fully upgraded when running import. 41711: storage/report: don't deserialize zone configs over and over r=andreimatei a=andreimatei This patch is expected to speedup reports generation considerably by not unmarshalling zone config protos for every range. Instead, the report-geneating visitors now keep state around the zone that a visited range is in and reuse that state if they're told that the following range is in the same zone. The range iteration infrastructure was enhanced to figure out when zones change from range to range and tell that to the visitors. Without this patch, generating the reports was pinning a core for minutes for a cluster with partitioning and 200k ranges. The profiles showed that it's all zone config unmarshalling. Fixes #41609 Release note (performance improvement): The performance of generating the system.replication_* reports was greatly improved for large clusters. 41761: storage: write to local storage before updating liveness r=bdarnell a=irfansharif Previously a disk stall could allow a node to continue heartbeating its liveness record and prevent other nodes from taking over its leases, despite being completely unresponsive. This was first addressed in #24591 (+ #33122). This was undone in #32978 (which introduced a stricter version of a similar check). #32978 was later disabled by default in #36484, leaving us without the protections first introduced in #24591. This PR re-adds the logic from #24591. Part of #41683. Release note: None. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: irfan sharif <[email protected]>

craig · 2019-10-21T16:07:15Z

Build succeeded

GitHub CI (Cockroach)

andreimatei requested a review from darinpp October 17, 2019 22:36

andreimatei force-pushed the reports.speedup branch from 6ada7a5 to e1741bc Compare October 17, 2019 22:37

andreimatei force-pushed the reports.speedup branch from e1741bc to 4072dad Compare October 17, 2019 22:46

andreimatei requested a review from ajwerner October 17, 2019 23:53

andreimatei added 2 commits October 18, 2019 12:00

storage/replication: minor improvements to error messages

1f7aeb5

One message was using an ugly &rngDesc. Getting that pointer is no longer require since I made rangeDesc values implement the stringer interface. Release note: None

andreimatei force-pushed the reports.speedup branch from 4072dad to a704750 Compare October 18, 2019 16:02

darinpp approved these changes Oct 18, 2019

View reviewed changes

ajwerner reviewed Oct 18, 2019

View reviewed changes

andreimatei force-pushed the reports.speedup branch from a704750 to de72424 Compare October 18, 2019 22:14

andreimatei commented Oct 18, 2019

View reviewed changes

andreimatei force-pushed the reports.speedup branch 2 times, most recently from 44ee9b3 to 52bd230 Compare October 21, 2019 14:27

ajwerner approved these changes Oct 21, 2019

View reviewed changes

andreimatei force-pushed the reports.speedup branch from 52bd230 to e854f99 Compare October 21, 2019 14:38

andreimatei mentioned this pull request Oct 21, 2019

release-19.2: storage/report: don't deserialize zone configs over and over #41757

Merged

andreimatei commented Oct 21, 2019

View reviewed changes

craig bot merged commit e854f99 into cockroachdb:master Oct 21, 2019

andreimatei deleted the reports.speedup branch October 21, 2019 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/report: don't deserialize zone configs over and over #41711

storage/report: don't deserialize zone configs over and over #41711

andreimatei commented Oct 17, 2019 •

edited

Loading

cockroach-teamcity commented Oct 17, 2019

andreimatei commented Oct 17, 2019

darinpp left a comment

ajwerner left a comment

andreimatei left a comment

ajwerner left a comment

andreimatei left a comment

andreimatei left a comment

craig bot commented Oct 21, 2019

storage/report: don't deserialize zone configs over and over #41711

storage/report: don't deserialize zone configs over and over #41711

Conversation

andreimatei commented Oct 17, 2019 • edited Loading

cockroach-teamcity commented Oct 17, 2019

andreimatei commented Oct 17, 2019

darinpp left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Oct 21, 2019

Build succeeded

andreimatei commented Oct 17, 2019 •

edited

Loading