Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/report: don't deserialize zone configs over and over #41711

Merged
merged 3 commits into from
Oct 21, 2019

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Oct 17, 2019

This patch is expected to speedup reports generation considerably by not
unmarshalling zone config protos for every range. Instead, the
report-geneating visitors now keep state around the zone that a visited
range is in and reuse that state if they're told that the following
range is in the same zone. The range iteration infrastructure was
enhanced to figure out when zones change from range to range and tell
that to the visitors.

Without this patch, generating the reports was pinning a core for
minutes for a cluster with partitioning and 200k ranges. The profiles
showed that it's all zone config unmarshalling.

Fixes #41609

Release note (performance improvement): The performance of generating
the system.replication_* reports was greatly improved for large
clusters.

@andreimatei andreimatei requested a review from darinpp October 17, 2019 22:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

I still need to write some new tests, but PTAL

One message was using an ugly &rngDesc. Getting that pointer is no
longer require since I made rangeDesc values implement the stringer
interface.

Release note: None
Before this patch, errors encountered by the report-producing range
visitors were either swallowed or fataled. This patch makes the error
bubble up to the reporting infra, which then stops using a visitor that
returned an error and skips persisting its report.
This is particularly useful for an upcoming commit: visitors will become
more stateful and so continuing iteration after an error is increasingly
dubious.

Release note: None
Copy link
Contributor

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me. Mostly just nits.

Reviewed 2 of 5 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @andreimatei)


pkg/storage/reports/constraint_stats_report.go, line 445 at r3 (raw file):

	defer func() {
		if retErr != nil {

super nit: v.visitErr = retErr != nil?


pkg/storage/reports/critical_localities_report.go, line 318 at r3 (raw file):

	defer func() {
		if retErr != nil {

super nit: v.visitErr = retErr != nil


pkg/storage/reports/critical_localities_report.go, line 349 at r3 (raw file):

) (retErr error) {
	defer func() {
		if retErr != nil {

...


pkg/storage/reports/reporter.go, line 345 at r3 (raw file):

	}
	// Check that the subzone, if any, is the same.
	if r.lastZone == nil {

I don't really get this case. Are we saying that sometimes we call setZone with nil?


pkg/storage/reports/reporter.go, line 555 at r3 (raw file):

// visitorError is returned by visitRanges when one or more visitors failed.
type visitorError struct {

style question which I could go either way on: how about type visitorError []error


pkg/storage/reports/reporter.go, line 635 at r3 (raw file):

				// Sanity check - v.failed() should return an error now (the same as err above).
				if !v.failed() {
					return errors.Errorf("expected visitor %T to have failed() after error: %s", v, err)

Would this be a reasonable use case for errors.NewAssertionErrorWithWrappedErrf?

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a better structure now and added a bunch of tests. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/reports/constraint_stats_report.go, line 445 at r3 (raw file):

Previously, ajwerner wrote…

super nit: v.visitErr = retErr != nil?

done


pkg/storage/reports/critical_localities_report.go, line 318 at r3 (raw file):

Previously, ajwerner wrote…

super nit: v.visitErr = retErr != nil

done


pkg/storage/reports/critical_localities_report.go, line 349 at r3 (raw file):

Previously, ajwerner wrote…

...

done


pkg/storage/reports/reporter.go, line 345 at r3 (raw file):

Previously, ajwerner wrote…

I don't really get this case. Are we saying that sometimes we call setZone with nil?

this was some nonsense


pkg/storage/reports/reporter.go, line 555 at r3 (raw file):

Previously, ajwerner wrote…

style question which I could go either way on: how about type visitorError []error

meh


pkg/storage/reports/reporter.go, line 635 at r3 (raw file):

Previously, ajwerner wrote…

Would this be a reasonable use case for errors.NewAssertionErrorWithWrappedErrf?

I don't think so. Here the original err is not a "cause" of the assertion failure; this is not a usual wrapping situation

@andreimatei andreimatei force-pushed the reports.speedup branch 2 times, most recently from 44ee9b3 to 52bd230 Compare October 21, 2019 14:27
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

There's something unsatisfying about the copy-pasta on each of the visitor implementations but I can live with it.

Reviewed 4 of 9 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/storage/reports/reporter.go, line 354 at r4 (raw file):

) (ZoneKey, error) {
	objectID, _ := config.DecodeKeyIntoZoneIDAndSuffix(rd.StartKey)
	first := true

super nit: add commentary to make it clearer to the reader that visitZones walks from the bottom up. The logic is clear once you know that fact.


pkg/storage/reports/reporter.go, line 389 at r4 (raw file):

// example, if the zoneChecker was previously configured for a range starting at
// /Table/51 and is now queried for /Table/52, it will say that the zones don't
// match even if in fact they do ( because neither table defines its own zone

nit: extra space.

This patch is expected to speedup reports generation considerably by not
unmarshalling zone config protos for every range. Instead, the
report-geneating visitors now keep state around the zone that a visited
range is in and reuse that state if they're told that the following
range is in the same zone. The range iteration infrastructure was
enhanced to figure out when zones change from range to range and tell
that to the visitors.

visitRanges() now figures out that runs of consecutive ranges belong to
the same zone. This is done through a new zoneResolver struct, that's
optimized for the case where it's asked to resolve ranges in key order.

Without this patch, generating the reports was pinning a core for
minutes for a cluster with partitioning and 200k ranges. The profiles
showed that it's all zone config unmarshalling.

Fixes cockroachdb#41609

Release note (performance improvement): The performance of generating
the system.replication_* reports was greatly improved for large
clusters.
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/storage/reports/reporter.go, line 354 at r4 (raw file):

Previously, ajwerner wrote…

super nit: add commentary to make it clearer to the reader that visitZones walks from the bottom up. The logic is clear once you know that fact.

done


pkg/storage/reports/reporter.go, line 389 at r4 (raw file):

Previously, ajwerner wrote…

nit: extra space.

done

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)

craig bot pushed a commit that referenced this pull request Oct 21, 2019
41684: importccl: add 19.2 version gate check r=miretskiy a=miretskiy

Ensure the cluster is fully upgraded when running import.

Release note: ensure cluster fully upgraded when running import.

41711: storage/report: don't deserialize zone configs over and over r=andreimatei a=andreimatei

This patch is expected to speedup reports generation considerably by not
unmarshalling zone config protos for every range. Instead, the
report-geneating visitors now keep state around the zone that a visited
range is in and reuse that state if they're told that the following
range is in the same zone. The range iteration infrastructure was
enhanced to figure out when zones change from range to range and tell
that to the visitors.

Without this patch, generating the reports was pinning a core for
minutes for a cluster with partitioning and 200k ranges. The profiles
showed that it's all zone config unmarshalling.

Fixes #41609

Release note (performance improvement): The performance of generating
the system.replication_* reports was greatly improved for large
clusters.

41761: storage: write to local storage before updating liveness r=bdarnell a=irfansharif

Previously a disk stall could allow a node to continue heartbeating its
liveness record and prevent other nodes from taking over its leases,
despite being completely unresponsive.

This was first addressed in #24591 (+ #33122). This was undone in #32978
(which introduced a stricter version of a similar check). #32978 was
later disabled by default in #36484, leaving us without the protections
first introduced in #24591. This PR re-adds the logic from #24591.

Part of #41683.

Release note: None.

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@craig
Copy link
Contributor

craig bot commented Oct 21, 2019

Build succeeded

@craig craig bot merged commit e854f99 into cockroachdb:master Oct 21, 2019
@andreimatei andreimatei deleted the reports.speedup branch October 21, 2019 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: replication reports generation takes 20% of a node's CPU for 200k ranges
4 participants