Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add another error to jepsen allowlist #50333

Closed
wants to merge 1 commit into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Jun 17, 2020

Release note: None

@tbg tbg requested a review from andreimatei June 17, 2020 13:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/cmd/roachtest/jepsen.go, line 253 at r1 (raw file):

				// lock=true stat=PENDING rts=1592384065.123218152 ,0 wto=false
				// max=1592384065.128230720,0
				`-e "TransactionRetryWithProtoRefreshError"`,

mmm for this one it might be worth asking ben if it's easy to put a retry to whatever schema change causes it

@tbg
Copy link
Member Author

tbg commented Jun 17, 2020

cc @bdarnell

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/cmd/roachtest/jepsen.go, line 253 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

mmm for this one it might be worth asking ben if it's easy to put a retry to whatever schema change causes it

It's really important that simple schema changes like the ones jepsen does not be subject to random retry errors like this. We've put a lot of work into this in the past. Have we had a regression in this area? (ABORT_REASON_NEW_LEASE_PREVENTS_TXN is new to me)

@andreimatei
Copy link
Contributor

I was complaining about some apparent regression in the area in #37751 (comment)
Don't know if it's exactly what we're seeing here or not... Maybe not.
The question here is why didn't automatic retries kick in, if Jepsen only does simple things.

The underlying reasons for ABORT_REASON_NEW_LEASE_PREVENTS_TXN we've always had. The reason code is new because I've carved out the case where the timestamp cache reset was caused by a new lease, as opposed to other reasons.

@bdarnell
Copy link
Contributor

The question here is why didn't automatic retries kick in, if Jepsen only does simple things.

Yeah. Specifically, I'm pretty sure jepsen only does its schema changes in implicit transactions (or maybe as the first and only statement in an explicit transaction) so it should be retried on the server side.

Assuming this was discovered in #49360, multiregister's DDL statements are trivial

@tbg
Copy link
Member Author

tbg commented Jun 19, 2020

Yes, it was discovered there. Not frequently - a few times over >400 runs.

@tbg
Copy link
Member Author

tbg commented Jun 19, 2020

Do we even know that this was a schema change? Could the error have come from a regular txn operation in jepsen or would they not surface to that level?

@bdarnell
Copy link
Contributor

During normal operation jepsen catches nearly all errors. Most of the time when an error bubbles up to this level, it's from a schema change. But the jepsen logs should tell us exactly where it's coming from. Do you have a pointer to a log showing this problem?

@tbg tbg closed this Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants