Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: make the span stats fan-out more fault tolerant #108456

Merged

Conversation

zachlite
Copy link
Contributor

@zachlite zachlite commented Aug 9, 2023

This commit adds improved fault tolerance to the span stats fan-out:

  1. Errors encountered during the fan-out will not invalidate the
    entire request. Now, a span stats fan-out will always return a
    roachpb.SpanStatsResponse that has been updated by values from nodes that
    service their requests without error. In the extreme case where there's
    a failure encountered on every node, an empty response is returned.

    Errors that are encountered are logged, and then appended to the response
    in the newly added Errors field.

  2. Nodes must service requests within the timeout passed to iterateNodes.
    For span stats, the value comes from a new cluster setting:
    'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves #106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.

@zachlite zachlite requested a review from a team as a code owner August 9, 2023 16:11
@zachlite zachlite requested a review from a team August 9, 2023 16:11
@zachlite zachlite requested a review from a team as a code owner August 9, 2023 16:11
@zachlite zachlite requested review from a team and nkodali and removed request for a team August 9, 2023 16:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@zachlite zachlite added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Aug 9, 2023
@zachlite zachlite force-pushed the 230809.spanstats-fanout-fault-tolerance branch from cc968fd to bc57125 Compare August 9, 2023 16:28
Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nkodali and @zachlite)


-- commits line 9 at r1:
What is the reason for removing the error from the response? The error in the response allows users to see what the issue is and to avoids ambiguity with a node missing.

Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nkodali and @zachlite)


-- commits line 11 at r1:
How about? Errors that are encountered are logged, and also returned on a property in the response along with the successful requests.


pkg/roachpb/span_stats.go line 41 at r1 (raw file):

	time.Minute,
	settings.NonNegativeDuration,
).WithPublic()

Does this really need to be public? We shouldn't make cluster setting public unless we expect users to change it.

@zachlite zachlite force-pushed the 230809.spanstats-fanout-fault-tolerance branch from bc57125 to 30c0898 Compare August 10, 2023 17:00
Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the response to include the errors encountered on each node. I've also updated the test to make sure that in the extreme case of failure on each node, stats for each span requested is still accessible.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @nkodali)


-- commits line 11 at r1:

Previously, j82w (Jake) wrote…

How about? Errors that are encountered are logged, and also returned on a property in the response along with the successful requests.

Done.


pkg/roachpb/span_stats.go line 41 at r1 (raw file):

Previously, j82w (Jake) wrote…

Does this really need to be public? We shouldn't make cluster setting public unless we expect users to change it.

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 14 files at r1, 5 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @nkodali, and @zachlite)


pkg/server/span_stats_server.go line 137 at r2 (raw file):

	}

	res.Errors = strings.Join(errorMessages, ",")

Not in love with this. Why not make the proto a repeated string errors?

@zachlite zachlite force-pushed the 230809.spanstats-fanout-fault-tolerance branch from 30c0898 to e8d6e16 Compare August 11, 2023 13:46
Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, and @nkodali)


pkg/server/span_stats_server.go line 137 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Not in love with this. Why not make the proto a repeated string errors?

You're right. Done!

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 14 files at r1, 2 of 7 files at r2, 4 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, @nkodali, and @zachlite)


pkg/server/span_stats_test.go line 200 at r3 (raw file):

	defer log.Scope(t).Close(t)
	ctx := context.Background()
	const numNodes = 5

For the test to be more complete, at least of of the nodes should return something valid, to make sure that it's working and we get some info.

Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, @maryliag, and @nkodali)


pkg/server/span_stats_test.go line 200 at r3 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

For the test to be more complete, at least of of the nodes should return something valid, to make sure that it's working and we get some info.

I see what you're saying. This test is meant to verify the specific code path where there's never a valid result returned from any node. All of the other tests in this file already test the error-free code paths.

I can add an additional test that expects a mix of some errors, some valid results.

@zachlite zachlite force-pushed the 230809.spanstats-fanout-fault-tolerance branch from e8d6e16 to 5713064 Compare August 14, 2023 15:57
Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w, @knz, @maryliag, and @nkodali)


pkg/server/span_stats_test.go line 200 at r3 (raw file):

Previously, zachlite wrote…

I see what you're saying. This test is meant to verify the specific code path where there's never a valid result returned from any node. All of the other tests in this file already test the error-free code paths.

I can add an additional test that expects a mix of some errors, some valid results.

Done.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit, otherwise :lgtm:

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @j82w, @knz, @nkodali, and @zachlite)


pkg/server/span_stats_test.go line 290 at r4 (raw file):

				require.Equal(t, true, containsError(res.Errors, "node 4 timed out"))

				// There should not be any errors for node 2 or node 5.

you already tested that the length of error message is 3 and you checked for all of them, so there is no need to check these 2, specially because is not checking with right thing, there could have been a different error type on node 2 that is not dialing (same for node 5).

Copy link
Contributor Author

@zachlite zachlite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @j82w, @knz, @maryliag, and @nkodali)


pkg/server/span_stats_test.go line 290 at r4 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

you already tested that the length of error message is 3 and you checked for all of them, so there is no need to check these 2, specially because is not checking with right thing, there could have been a different error type on node 2 that is not dialing (same for node 5).

Discussed offline. Keeping for documentation purposes.

@zachlite
Copy link
Contributor Author

bors r+

@j82w
Copy link
Contributor

j82w commented Aug 14, 2023

bors r-

@craig
Copy link
Contributor

craig bot commented Aug 14, 2023

Canceled.

@yuzefovich
Copy link
Member

FYI the failure on Extended CI:

* ERROR: a panic has occurred!
* use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:890
*   | [...repeated from below...]
* Wraps: (2) assertion failure
* Wraps: (3) attached stack trace
*   -- stack trace:
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).detectUseAfterFinish
*   | 	github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:182
*   | github.com/cockroachdb/cockroach/pkg/util/tracing.(*Span).Tracer
*   | 	github.com/cockroachdb/cockroach/pkg/util/tracing/span.go:225
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).createTracingSpans
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:154
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaDecoder).DecodeAndBind
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_decoder.go:63
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).Decode
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:142
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:878
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:728
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:689
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:418
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
*   | runtime.goexit
*   | 	src/runtime/asm_amd64.s:1594
* Wraps: (4) use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>
* Error types: (1) *withstack.withStack (2) *assert.withAssertionFailure (3) *withstack.withStack (4) *errutil.leafError
*
panic: use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish> [recovered]
	panic: use of Span after Finish. Span: /cockroach.roachpb.Internal/Batch. Finish previously called at: <stack not captured. Set debugUseAfterFinish>

is a dup of #108534.

@zachlite zachlite force-pushed the 230809.spanstats-fanout-fault-tolerance branch from 5713064 to b742181 Compare August 14, 2023 18:38
This commit adds improved fault tolerance to the span stats fan-out:

1. Errors encountered during the fan-out will not invalidate the
entire request. Now, a span stats fan-out will always return a
roachpb.SpanStatsResponse that has been updated by values from nodes that
service their requests without error. In the extreme case where there's
a failure encountered on every node, an empty response is returned.

Errors that are encountered are logged, and then appended to the response
in the newly added `Errors` field.

2. Nodes must service requests within the timeout passed to `iterateNodes`.
For span stats, the value comes from a new cluster setting:
'server.span_stats.node.timeout', with a default value of 1 minute.

Resolves cockroachdb#106097
Epic: none
Release note (ops change): Span stats requests will return a partial
result if the request encounters any errors. Errors that would have
previously terminated the request are now included in the response.
@zachlite zachlite force-pushed the 230809.spanstats-fanout-fault-tolerance branch from b742181 to 5ddcdd1 Compare August 14, 2023 21:35
@zachlite
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 15, 2023

Build succeeded:

@craig craig bot merged commit abf61bb into cockroachdb:master Aug 15, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 15, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5ddcdd1 to blathers/backport-release-23.1-108456: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: SHOW RANGES does not work with offline node
6 participants