-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: make --global flag for demo public #62435
Conversation
b3a9fb5
to
95a1ae4
Compare
not sure how to debug that test failure, can't repro it locally or on roachprod :\ |
ah got one! @knz would you know what this is about:
i'll try look at this more tomorrow, but curious to have pointers as to where these goroutines get created |
0d78d29
to
d08ef61
Compare
here are the goroutines during the leak:
is it...
? |
I wonder if this is because the leak detector takes 5s to detect leaks, but due to the RPCs being slowed down & this being longer, that's not enough time. in which case, we should ignore the leak detector. do you reckon this is feasible or nah? i'm still curious why this only reproduces under |
changing the leaktest timeout to 60s seems to unblock some things, but introduce a panic during the stop phase: https://gist.github.com/otan/12e1e6d7aab9236e3590660ef570aee8 |
Yeah spent too long on this. Maybe one day... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 2 of 4 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @otan)
pkg/cli/demo_cluster.go, line 129 at r1 (raw file):
latencyMapWaitCh := make(chan struct{}) // serverReadyCh is used .
nit: sentence in comment
pkg/cli/demo_cluster.go, line 190 at r1 (raw file):
// We force a wait for all servers until they are ready. nodeReadyCh := make(chan struct{}, 1)
the channel handling throughout this function is super confusing (and probably confused)
I would encourage you to do the following. For each channel:
-
document at the point the channel is created, what are the readers and what are the writers, and the exact conditions under which channels are read and written
-
where it makes sense, use
close(chan)
on the writer to ensure that the number of readers doesn't matter afterwards.
pkg/cli/demo_cluster.go, line 202 at r1 (raw file):
err := serv.Start() nodeErrCh <- err errCh <- err
this channel handling is messed up. when err !=nil, the write to errCh
will block and this goroutine will remain running.
pkg/cli/demo_cluster.go, line 407 at r3 (raw file):
func (c *transientCluster) DrainAndShutdown(nodeID roachpb.NodeID) error { if demoCtx.simulateLatency { return errors.Errorf("shutting down nodes is not supported in --global configurations")
here and throughout the code:
-
it's super confusing to have the field in the ctx struct named differently than the flag. I may encourage you to find a way to make them match.
-
please factor the check and the definition of the error message in a single function and call it from the multiple places
-
the proper way to construct an error message is
errors.Errorf("... --%s ...", cliflags.YourFlag.Name)
(orerrors.Newf
)
pkg/cli/demo_test.go, line 127 at r6 (raw file):
defer log.Scope(t).Close(t) // Set the leak test timeout to be 60s, as due to the delay
That should not be necessary, if all the close / cleanup calls are performed properly. If you think you need this, that means that a close function is not being called, or a goroutine is left to hang for longer than necessary.
As I'm done with this for now, hints for the next person.
|
pkg/cli/demo_cluster.go, line 202 at r1 (raw file): Previously, knz (kena) wrote…
both channels are buffered, so i don't think we ever block. |
Ok thank you for all your work and your investigative steps. |
this is now resolved. just the leaking goroutine left. |
There are multiple goroutines leaked, but this is clearly the smoking gun:
In other words, there's this goroutine launched under I already mentioned that the pattern of interlocking channels was problematic in my previous review. I will now investigate this further—exactly like I suggested in my previous review: by analyzing exactly where they get written and read, and discover which synchronization point is missing—and report my findings. |
@knz --insecure is deprecated in demo, so I don't understand why you are insisting we should try to fix the global latencies for it, since global latencies is a feature only for demo. |
Oops, I will re-create this comment on the issue you created. |
See this comment: #63033 (comment) We can now make progress on this, thanks to #63853 being merged. |
thanks for the heads up! do we still want to fix SHOW ALL CLUSTER QUERIES in --insecure before making it public? FWIW, from experimentation it's just that one that doesn't respect latency but i'm not sure why :\ |
I think we can make |
We previously had no tests simulating the latency in demo using the --global flag. Release note: None
In addition to the release note, reworded the add node (which already errored beforehand), and adding a clause in RestartNode. Note there is no code path that allows a restart, since shutdown is required before a node restart. Release note (cli change): Previously, `\demo shutdown <node_idx>` would error if `--global` was set. This will now error gracefully as an unsupported behavior.
Release note (cli change): The --global flag for cockroach demo is now advertised. This flag simulates latencies in multi-node demo clusters when the nodes are set in different regions to simulate real-life global latencies.
ok, i've pulled your commit and added one on top @knz! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so this is now on the finish line.
Let's discuss some specifics:
-
I'm not too fond of the last commit. Even if we accept the command-line flag's name for the sake of PR and marketing, we still want our code to be descriptive. A boolean field should indicate what the boolean is about. I don't believe that the word "global" is descriptive of anything. In comparison, "simulate latencies" is better. I'd say "simulateGeoLatencies" would even be better.
-
I'm curious if there is a way to announce the fact the cluster is not "local" when the session starts. Maybe via the prompt? or a banner when the shell starts?
Reviewed 15 of 15 files at r16.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @otan)
i'd be happy to rename it if we also changed the flag as appropriate, but that would also ideally involve some PM input. (the original complaint i was addressing was the flag name and variable name did not match) |
I may have missed that - where was this complaint voiced? (my apologies if I was the person who voiced it, I likely have forgotten) We have this situation in a couple of other places, and I agree we probably want to do something about it, but maybe we can do this in a principled manner all at once.
if there's an issue filed for it, yes that would make sense. |
Release note (cli change): There will now be a message upon start-up on cockroach demo --global indicating latencies between nodes will simulate real world latencies.
would be from #62435 (review) here and throughout the code: it's super confusing to have the field in the ctx struct named differently than the flag. I may encourage you to find a way to make them match.
alrighty. #64269, removed the commit.
I've added some help text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r17.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @otan)
thanks for the guidance and all the help! bors r=knz |
Build succeeded: |
See individual commits for details.
Refs: #62025