Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: roachprod: Remove check for Pebble CURRENT file in favour of marker.* file #101254

Merged
merged 2 commits into from
Apr 21, 2023

Conversation

smg260
Copy link
Contributor

@smg260 smg260 commented Apr 11, 2023

roachtest: roachprod: Remove check for Pebble CURRENT file in favour
of marker.* files

roachtest: set_timeout for CheckInvalidDescriptors. Without the set_timeout, a cancellation due to timeout is not respected and can cause the step to hang.

Epic: none
Fixes: #95170
Fixes: #99684

Release note: None

Combined with #101222 TC run here

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@smg260 smg260 changed the title 95170 remove pebble current check roachtest: roachprod: Remove check for Pebble CURRENT file in favour of marker.* file Apr 11, 2023
@smg260 smg260 force-pushed the 95170_remove_pebble_CURRENT_check branch 2 times, most recently from 6d5928b to ce69f9d Compare April 14, 2023 15:37
@smg260 smg260 marked this pull request as ready for review April 14, 2023 15:38
@smg260 smg260 requested a review from a team as a code owner April 14, 2023 15:38
@smg260 smg260 requested review from herkolategan and srosenberg and removed request for a team April 14, 2023 15:38
@smg260
Copy link
Contributor Author

smg260 commented Apr 14, 2023

There is also a LOCK file created on DB open [1], which could be feasible to check instead.

-rw-r----- 1 ubuntu ubuntu 60359944 Apr 11 17:53 000916.log
-rw-r----- 1 ubuntu ubuntu 60433928 Apr 11 17:57 000922.log
-rw-r----- 1 ubuntu ubuntu 60417732 Apr 11 18:01 000928.log
-rw-r----- 1 ubuntu ubuntu  4128012 Apr 11 17:57 000930.sst
-rw-r----- 1 ubuntu ubuntu  4150121 Apr 11 17:57 000931.sst
-rw-r----- 1 ubuntu ubuntu  4172418 Apr 11 17:57 000932.sst
-rw-r----- 1 ubuntu ubuntu  2096265 Apr 11 17:57 000933.sst
-rw-r----- 1 ubuntu ubuntu       16 Apr 11 13:56 CURRENT
-rw-r----- 1 ubuntu ubuntu        0 Apr 11 13:56 LOCK
-rw-r----- 1 ubuntu ubuntu    74104 Apr 11 17:57 MANIFEST-000001
-rw-r----- 1 ubuntu ubuntu     2539 Apr 11 13:56 OPTIONS-000003
-rw-r----- 1 ubuntu ubuntu        8 Apr 11 13:56 STORAGE_MIN_VERSION
drwxr-x--- 3 ubuntu ubuntu     4096 Apr 11 13:56 auxiliary
drwxr-xr-x 3 ubuntu ubuntu     4096 Apr 11 13:56 cockroach-temp2897837119
-rw-r----- 1 ubuntu ubuntu       17 Apr 11 13:56 cockroach.advertise-addr
-rw-r----- 1 ubuntu ubuntu       17 Apr 11 13:56 cockroach.advertise-sql-addr
-rw-r----- 1 ubuntu ubuntu       17 Apr 11 13:56 cockroach.http-addr
-rw-r----- 1 ubuntu ubuntu       10 Apr 11 13:56 cockroach.listen-addr
-rw-r----- 1 ubuntu ubuntu       10 Apr 11 13:56 cockroach.sql-addr
-rw-r----- 1 ubuntu ubuntu        0 Apr 11 13:56 marker.format-version.000012.013
-rw-r----- 1 ubuntu ubuntu        0 Apr 11 13:56 marker.manifest.000001.MANIFEST-000001
-rw-r----- 1 ubuntu ubuntu       46 Apr 11 13:56 temp-dirs-record.txt

[1] - https://github.com/cockroachdb/pebble/blob/124486b0f85d75a0c695cb65703c5495eec66c99/open.go#L90

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @srosenberg)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way that we can verify the dead node detector ran? I'm looking at the logs for the run you did on this branch and couldn't easily find anything.

IMO, it would be nice to make this step explicit by writing to teardown.log and also writing progress/output of the monitor script to roachprod.log.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @smg260 and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 643 at r2 (raw file):

			snippet := `
{{ if .IgnoreEmpty }}
if ls {{.Store}}/marker.* 1> /dev/null 2>&1; then

Any particular reason we're using ls and redirecting output instead of using [ -f ...] like before?

@smg260 smg260 force-pushed the 95170_remove_pebble_CURRENT_check branch from ce69f9d to 9960672 Compare April 17, 2023 20:08
Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NodeMonitorInfo for each node can be logged from assertDeadNode. It's being ignored at the moment

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 643 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Any particular reason we're using ls and redirecting output instead of using [ -f ...] like before?

-f expects only a single filename and will fail when presented with a glob or prefix like above.

I was trying to see if there was a single file that could be used; the second comment on this PR shows the other files created by Pebble, but I don't have enough insight into knowing whether they would suffice (e.g. LOCK)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 643 at r2 (raw file):

Previously, smg260 (Miral Gadani) wrote…

-f expects only a single filename and will fail when presented with a glob or prefix like above.

I was trying to see if there was a single file that could be used; the second comment on this PR shows the other files created by Pebble, but I don't have enough insight into knowing whether they would suffice (e.g. LOCK)

Ah, got it.

I think it makes sense to use the marker files. I still think we should provide some visibility that this check has run -- is there already some side effect that indicates the check passed that I'm missing?

of marker.* files

Epic: none
Fixes: cockroachdb#95170

Release note: None
@smg260 smg260 force-pushed the 95170_remove_pebble_CURRENT_check branch from 9960672 to f299c7f Compare April 20, 2023 20:58
@smg260
Copy link
Contributor Author

smg260 commented Apr 20, 2023

Logging will now show the result of checking whether node(s) are alive in test teardown.

teardown: 20:09:33 cluster.go:1342: checking for dead nodes
teardown: 20:09:33 cluster.go:1367: node 1: err=<nil>,msg=dead (exit status unknown)
teardown: 20:09:33 cluster.go:1367: node 3: err=<nil>,msg=3839243
teardown: 20:09:33 cluster.go:1367: node 2: err=<nil>,msg=3839178
teardown: 20:09:33 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (test_runner.go:1087).func1: 1 dead node(s) detected
teardown: 20:09:33 test_runner.go:1101: running validation checks on node 2 (<10m)
teardown: 20:09:33 cluster.go:1412: checking for invalid descriptors

@smg260
Copy link
Contributor Author

smg260 commented Apr 20, 2023

Note: There is a issue inn the case where a node has died; the post validations will hang (within the set_timeout). This is likely the cause of various timeouts resulting in no TC artifacts. This PR does not address that as investigation is ongoing.

Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 643 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Ah, got it.

I think it makes sense to use the marker files. I still think we should provide some visibility that this check has run -- is there already some side effect that indicates the check passed that I'm missing?

Done.
Logging of node statuses added.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see the output of the dead node detector!

IMO, we should merge just the first commit in this PR. The post-validation checks hanging is a known issue and it's not introduced by the changes here, so they can be fixed separately.

As a side-comment, as noted in #99684, one thing to try is to set a connection timeout instead of a statement timeout.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also run some roachtests on this branch (in TC) now that we're actually looking for dead nodes, to be sure.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)

Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a known issue, but is also now replicable (I'm not sure it was before). I don't see why this would not be useful, since

  • statement_timeout works in the case of the query hanging, and propagates the cancellation correctly and
  • is also used in the consistency check directly after it
  • without it, the code is still prone to hang and we miss out on the artifacts.

Also, the code doesn't hang when trying to connect to a live node, it's directly after when attempting to execute SQL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying it wouldn't be useful (it definitely would, the lack of artifacts is very frustrating). I guess I misunderstood your previous message when you said "This PR does not address that as investigation is ongoing"; I thought the statement_timeout wasn't enough to fix the hanging, and you were still looking into it. All I was saying was for us to merge this sooner while that investigation happened.

If we're in the clear re: hanging issues, then let's merge this!

Nit: update PR description to close both issues at the same time. (Fixes A, B doesn't work, unfortunately. It needs to be Fixes A, fixes B).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)

Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poorly worded on my part. What I meant is the investigation to find the root cause is ongoing.

Will update description- thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @srosenberg)

Epic: none
Fixes: cockroachdb#99684

Release note: None
@smg260 smg260 force-pushed the 95170_remove_pebble_CURRENT_check branch from f299c7f to 831e830 Compare April 21, 2023 17:40
@smg260
Copy link
Contributor Author

smg260 commented Apr 21, 2023

Acceptance roachtests. Teardown logs show node checks.

@smg260
Copy link
Contributor Author

smg260 commented Apr 21, 2023

TFTR

bors r=renatolabs

@craig
Copy link
Contributor

craig bot commented Apr 21, 2023

Build succeeded:

@craig craig bot merged commit bc5db23 into cockroachdb:master Apr 21, 2023
@erikgrinaker
Copy link
Contributor

I think this may have broken a bunch of roachtests. These tests all succeed, but then fail with dead node(s) detected. The tests stop nodes on their own, on purpose. See e.g. #102054 and #102057. Haven't dug into these to confirm though, but they all started failing when this merged so it lines up.

@smg260
Copy link
Contributor Author

smg260 commented Apr 24, 2023

@erikgrinaker yep thanks. I'm adding an opt out for dead node validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants