-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: TestHalfOnlineLossOfQuorumRecovery failed [data race in asyncWriteToOtelAndSystemEventsTable
]
#103698
Comments
Data race in the event log:
|
@knz This seems to be a legit race in cockroach/pkg/sql/event_log.go Line 533 in 89d69ef
Concurrently with cockroach/pkg/sql/event_log.go Line 711 in 89d69ef
We'll probably have to pass owned struct copies rather than pointers to avoid races here. |
asyncWriteToOtelAndSystemEventsTable
asyncWriteToOtelAndSystemEventsTable
asyncWriteToOtelAndSystemEventsTable
]
I have investigated this and sadly it is surprising hard to troubleshoot. But maybe there's something else. I'm running a stress to see. |
Ok I got it - the same event object is used across multiple calls from server/decommission.go. I'll send a PR. |
106396: server: de-flake a decommission test race condition r=abarganier a=knz Informs #103698 (will fix when backported to 23.1). Epic: CRDB-28893 When a Decommission request is sent that addresses multiple nodes simultaneously, a race condition existed in the code that logs the decommission event to the event log. This is because the `sql.InsertEventRecords` API expects to take ownership over the events. The `Decommission` handler was violating the expectation by passing the same event references to multiple subsequent calls. This was not visible in practice however, because the racy writes were always overwriting the same value to the same field. This patch fixes it by allocating a new event for each subsequent node decommission. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>
fixed in #106401 |
Epic: CRDB-28893
cli.TestHalfOnlineLossOfQuorumRecovery failed with artifacts on release-23.1 @ 1443aa2b0a4b906ba2b252da53e04097ed75051a:
Parameters:
TAGS=bazel,gss,race
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-28142
The text was updated successfully, but these errors were encountered: