-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: general test flakiness due to Pebble close error #51544
Comments
Hi @knz, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
cc @tbg @petermattis for triage. |
I am going to bisect this and see when it came from. |
edit: that selection was incorrect |
I have tried very hard to use It may be that our Makefile / Go dependency tracking is not right so that when I switch branches, I am not re-building the vendor dir properly. |
I will not investigate this further myself. |
(I'm re-running my bisect session with 'make clean' in-between each step, just to be sure) |
Here's my latest bisect log:
This points to #51168 as the culprit, but I don't understand how it causes the problem. Maybe someone else needs to re-run the bisect to confirm. |
Raphael I have a PR out about this. It doesn't fix all of the causes but
provides a tool for doing so.
…On Fri, Jul 17, 2020, 13:18 kena ***@***.***> wrote:
The problem with this bisect result is the following:
- the merge commit a9e9c9c
<a9e9c9c>
triggers the bug
- however the only commit inside that branch, d86e265
<d86e265>,
does not.
(This is why I wrote it's incosnsitent)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#51544 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZGTGUCGKKLVIYWUTOLR4AXP5ANCNFSM4O54A7BQ>
.
|
Looking at this with Andrei. Like you suggested it's probably https://github.com/cockroachdb/cockroach/pull/51310/files exposing an existing problem. |
Hopefully this was fixed enough by #51413 |
I hit this on a freshly rebased branch here: https://teamcity.cockroachdb.com/viewLog.html?buildId=2161675&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test |
One thing I'll note is that we seem to lack synchronization that all of our outstanding grpc requests have concluded when the stopper stops. This seems like a problem. It seems like we should add something to the server shutdown code that prevents new requests and wait for outstanding requests on certain grpc services. In particular |
I'll get new energy on #51566 |
This patch makes the Store reject requests once its stopper is quiescing. Before this patch, we didn't seem to have good protection against requests not running after the stopper has been stopped. We've seen this in some tests, where requests were racing with the engine closing. Running after the stopper has stopped is generally pretty undefined behavior, so let's avoid it. I think the reason why we didn't see a lot of errors from such races is that we're stopping the gRPC server even before we start quiescing, so at least for requests coming from remote nodes we had some draining window. This is a lighter version of cockroachdb#51566. That patch was trying to run requests as tasks, so that they properly synchronize with server shutdown. This patch still allows races between requests that started evaluating the server started quiescing and server shutdown. Touches cockroachdb#51544 Release note: None
This patch runs some infrequent operations that might use the storage engine as tasks, and thus synchronizes them with server shutdown. In cockroachdb#51544 we've seen one of these cause a crash when executing after Pebble was shut down. Release note: None
…tasks Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches cockroachdb#51544 Release note: None
…eation This commit optimizes the Stopper for task creation by ripping out the existing heavyweight task tracking in production builds. I realized that my biggest concern with most of the proposals (cockroachdb#52843 and cockroachdb#51566) being floated to address cockroachdb#51544 was that they bought more into the inefficient tracking in the Stopper, not that they were doing anything inherently wrong themselves. Before this change, creating a task acquired an exclusive mutex and then wrote to a hashmap. At high levels of concurrency, this would have become a performance chokepoint. After this change, the cost of launching a Task is three atomic increments – one to acquire a read lock, one to register with a WaitGroup, and one to release the read lock. When no one is draining the Stopper, these are all wait-free operations, which means that task creation becomes wait-free. With a change like this, I would feel much more comfortable pushing on Stopper tasks to solve cockroachdb#51544.
…tasks Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches cockroachdb#51544 Release note: None
52844: kvserver,rangefeed: ensure that iterators are only constructed under tasks r=andreimatei a=ajwerner Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches #51544 Release note: None 52996: partialidx: prove implication for comparisons with two variables r=RaduBerinde a=mgartner This commit adds support for proving partial index predicates are implied by query filters when they contain comparison expressions with two variables and they are not identical expressions. Below are some examples where the left expression implies (=>) the right expression. The right is guaranteed to contain the left despite both expressions having no constant values. a > b => a >= b a = b => a >= b b < a => a >= b a > b => a != b Release note: None 53113: roachprod: introduce --skip-init to `roachprod start` r=irfansharif a=irfansharif ..and `roachprod init`. I attempted to originally introduce this flag in \#51329, and ultimately abandoned it. I still think it's a good idea to have such a thing, especially given now we're writing integration tests that want to control `init` behaviour. It's much easier to write them with this --skip-init flag than it is to work around roachprod's magical auto-init behavior. To do what's skipped when using --skip-init, we introduce a `roachprod init` sub command. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: irfan sharif <[email protected]>
…tasks Prior to this change, it was possible for a rangefeed request to be issued concurrently with shutting down which could lead to an iterator being constructed after the engine has been closed. Touches cockroachdb#51544 Release note: None
The vast majority of these seem to have to do with range merges happening after shutdown. |
We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as cockroachdb#58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in issue cockroachdb#58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes cockroachdb#52894. Release note: None
We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as cockroachdb#58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in issue cockroachdb#58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes cockroachdb#52894. Release note: None
We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as cockroachdb#58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in issue cockroachdb#58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes cockroachdb#52894. Release note: None
We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as cockroachdb#58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in issue cockroachdb#58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes cockroachdb#52894. Release note: None
59647: stop: rip out expensive task tracking r=knz a=tbg First commit was put up for PR separately, ignore it here. ---- We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as #58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in #51544 and makes sense for all of the reasons mentioned in issue #58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes #52894. Release note: None 59732: backupccl: add an owner column behind the WITH PRIVILEGES option r=pbardea a=Elliebababa Previously, when users perform RESTORE, they are ignorant of the original owner. This PR gives ownership data as a column behind privileges. Resolves: #57906. Release note: None. 59746: opt: switch checks to use CrdbTestBuild instead of RaceEnabled r=RaduBerinde a=RaduBerinde The RaceEnabled flag is not very useful for checks; e.g. apparently execbuilder tests aren't run routinely in race mode. These checks are now "live" in any test build, using the crdb_test build tag. Release note: None 59747: tree: correct StatementTag of ALTER TABLE ... LOCALITY r=ajstorm a=otan Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: elliebababa <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Oliver Tan <[email protected]>
Closing out as unactionable. I think most instances have been fixed throughout the cockroach repo. |
Describe the problem
make stress PKG=./pkg/kv/kvclient/kvcoord
is failing reliably with the following stack trace:To Reproduce
Jira issue: CRDB-4032
The text was updated successfully, but these errors were encountered: