-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: roachprod: Remove check for Pebble CURRENT file in favour of marker.* file #101254
roachtest: roachprod: Remove check for Pebble CURRENT file in favour of marker.* file #101254
Conversation
6d5928b
to
ce69f9d
Compare
There is also a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way that we can verify the dead node detector ran? I'm looking at the logs for the run you did on this branch and couldn't easily find anything.
IMO, it would be nice to make this step explicit by writing to teardown.log and also writing progress/output of the monitor script to roachprod.log
.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @smg260 and @srosenberg)
pkg/roachprod/install/cluster_synced.go
line 643 at r2 (raw file):
snippet := ` {{ if .IgnoreEmpty }} if ls {{.Store}}/marker.* 1> /dev/null 2>&1; then
Any particular reason we're using ls
and redirecting output instead of using [ -f ...]
like before?
ce69f9d
to
9960672
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NodeMonitorInfo
for each node can be logged from assertDeadNode
. It's being ignored at the moment
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/roachprod/install/cluster_synced.go
line 643 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Any particular reason we're using
ls
and redirecting output instead of using[ -f ...]
like before?
-f
expects only a single filename and will fail when presented with a glob or prefix like above.
I was trying to see if there was a single file that could be used; the second comment on this PR shows the other files created by Pebble, but I don't have enough insight into knowing whether they would suffice (e.g. LOCK
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/roachprod/install/cluster_synced.go
line 643 at r2 (raw file):
Previously, smg260 (Miral Gadani) wrote…
-f
expects only a single filename and will fail when presented with a glob or prefix like above.I was trying to see if there was a single file that could be used; the second comment on this PR shows the other files created by Pebble, but I don't have enough insight into knowing whether they would suffice (e.g.
LOCK
)
Ah, got it.
I think it makes sense to use the marker files. I still think we should provide some visibility that this check has run -- is there already some side effect that indicates the check passed that I'm missing?
of marker.* files Epic: none Fixes: cockroachdb#95170 Release note: None
9960672
to
f299c7f
Compare
Logging will now show the result of checking whether node(s) are alive in test teardown.
|
Note: There is a issue inn the case where a node has died; the post validations will hang (within the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/roachprod/install/cluster_synced.go
line 643 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Ah, got it.
I think it makes sense to use the marker files. I still think we should provide some visibility that this check has run -- is there already some side effect that indicates the check passed that I'm missing?
Done.
Logging of node statuses added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see the output of the dead node detector!
IMO, we should merge just the first commit in this PR. The post-validation checks hanging is a known issue and it's not introduced by the changes here, so they can be fixed separately.
As a side-comment, as noted in #99684, one thing to try is to set a connection timeout instead of a statement timeout.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also run some roachtests on this branch (in TC) now that we're actually looking for dead nodes, to be sure.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a known issue, but is also now replicable (I'm not sure it was before). I don't see why this would not be useful, since
statement_timeout
works in the case of the query hanging, and propagates the cancellation correctly and- is also used in the consistency check directly after it
- without it, the code is still prone to hang and we miss out on the artifacts.
Also, the code doesn't hang when trying to connect to a live node, it's directly after when attempting to execute SQL.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying it wouldn't be useful (it definitely would, the lack of artifacts is very frustrating). I guess I misunderstood your previous message when you said "This PR does not address that as investigation is ongoing"; I thought the statement_timeout
wasn't enough to fix the hanging, and you were still looking into it. All I was saying was for us to merge this sooner while that investigation happened.
If we're in the clear re: hanging issues, then let's merge this!
Nit: update PR description to close both issues at the same time. (Fixes A, B
doesn't work, unfortunately. It needs to be Fixes A, fixes B
).
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poorly worded on my part. What I meant is the investigation to find the root cause is ongoing.
Will update description- thanks!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)
Epic: none Fixes: cockroachdb#99684 Release note: None
f299c7f
to
831e830
Compare
Acceptance roachtests. Teardown logs show node checks. |
TFTR bors r=renatolabs |
Build succeeded: |
@erikgrinaker yep thanks. I'm adding an opt out for dead node validation. |
roachtest: roachprod: Remove check for Pebble CURRENT file in favour
of marker.* files
roachtest:
set_timeout
forCheckInvalidDescriptors
. Without theset_timeout
, a cancellation due to timeout is not respected and can cause the step to hang.Epic: none
Fixes: #95170
Fixes: #99684
Release note: None
Combined with #101222 TC run here