-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpce/c=5000/nodes=3 failed #61065
Comments
Doesn't seem like a blocker. |
@pbardea how should we think about this |
(roachtest).tpce/c=5000/nodes=3 failed on master@6601d827b814d4e85a1081b03bf2562d8ac2a4ab:
|
Yes, that would indicate that either the statement is running before the import completed, or the import is failing to transition the table to PUBLIC. Either way, this is not expected, I can take a pass at diagnosing what's going on here. (Didn't have bandwidth today, will look early next week.) |
(roachtest).tpce/c=5000/nodes=3 failed on master@9595a158f0233e1c3d86786ec4462dd39c7beb20:
|
(roachtest).tpce/c=5000/nodes=3 failed on master@7d1324fa42732f482329a524b0166db8dd7365e6:
|
Starting to take a look at this now. This error appears when an OFFLINE table is trying to be accessed. Looking at the logs for the test I see:
Indicating that the error was hit while we were still importing the dataset. So, it doesn't look like the IMPORT is finished (which is when it marks the tables PUBLIC again). The only other table access that I see are the schema changes issued during schema initialization. I wonder if the schema change jobs are executing asynchronously and are racing with the IMPORT INTO when accessing the table descriptor, and some schema change jobs try to access the table descriptor after IMPORT has taken them offline. Going to try and see if this repros in a smaller test case. |
(roachtest).tpce/c=5000/nodes=3 failed on master@9ba48738bc511ad6954682cab41e23b8492facd8:
|
(roachtest).tpce/c=5000/nodes=3 failed on master@b703e663da8ededaee2e28fc39a24e3880ae54cf:
|
(roachtest).tpce/c=5000/nodes=3 failed on master@15a185606d5e80b47d9fdd0ed4f54cfe29c527c6:
|
(roachtest).tpce/c=5000/nodes=3 failed on master@a69e6549a71f5a0e83eb13509001f4d7351050fb:
|
I just reproduced this and found that the IMPORT is finishing. We're then hitting the issue afterward. I'm not sure if this helps, but it does indicate that this is a different issue than the slowdowns we've been seeing on tpc-c recently. |
Thanks! That is very useful. To confirm, you're seeing the IMPORT go to a successful status and then we're seeing the error? I would have only expected this error to happen when trying to read from the table before the IMPORT completes. Given that description I'm adding the ga-blocker tag for now. Looking at the failures above, I see most of them log |
Yes, that's what I'm seeing. However, I haven't been able to determine exactly which statement is returning the error. I'll try to determine that. |
I was able to confirm that the IMPORT statements themselves are returning these errors. For instance, in my most recent test, I saw the following three statements all return errors:
One potentially interesting thing to note is that we are performing the IMPORTs in parallel. Another thing to note is that we perform these imports immediately after performing a large series of schema changes on the tables to install duplicate indexes. |
That's very useful! That narrows down what it could be. I'm still not quite able to repro on a smaller scale but given that the failure is from some descriptor access inside IMPORT, and it's failing to access the same descriptor it's importing is interesting. I'll double check the PRs that merged around the time we started seeing this for any erroneous descriptor accesses. Given that it's the IMPORTs that are hitting the error, I don't suspect the schema changes to effect this but that's good to keep in mind. |
It's also interesting that you saw the imports finish and then this error being hit, but it's also the import statements themselves that are returning the error. How did you determine that the import statements were finishing? Were you seeing all of them finish or only some of them? I would be very surprised if they were returning an error after being marked as successful. |
(roachtest).tpce/c=5000/nodes=3 failed on master@4b98115dfda02a9498f566958bd915c45ec7e449:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
(roachtest).tpce/c=5000/nodes=3 failed on master@4d44ddf24153d8ef8e0a996fdbe75ac5607f9574:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
I determined that the imports were succeeding on the admin UI's jobs page. |
(roachtest).tpce/c=5000/nodes=3 failed on master@bdff5338ca725bf1cfddf7e3f648bbf02ab42999:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
(roachtest).tpce/c=5000/nodes=3 failed on master@e09b93fe62541c3a94f32a723778660b528a0792:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
Looking at the latest failures again.
State:
Will continue to stress this today. |
Update: a few of my runs have resulted in import jobs failing with |
(roachtest).tpce/c=5000/nodes=3 failed on master@e9387a6e5dfdad71c74ccd0a07c907632613fa3e:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
I've been able to reproduce this most recent failure. The error is that the table descriptor was unexpectedly modified before setting the descriptors OFFLINE. Adding a bit of logging to the error message reveals the following diff between the tables:
Both tables were in the Note that IMPORT INTO invalidates FKs, but does so when transitioning the tables from OFFLINE to PUBLIC. |
The most recent failure is explained by there being 2 import jobs importing into the same table. E.g. in https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/2780695:id/tpce/c%3D5000/nodes%3D3/run_1/debug.zip!/debug/system.jobs.txt 2 IMPORT INTO tpce.public.trade jobs have been created and are being run in parallel. The first job modifies the table during the execution of the second. Looking at https://github.com/cockroachlabs/tpc-e/blob/master/tier-a/src/schema.rs#L822, I'm not yet seeing why the test is issuing multiple import requests. |
(roachtest).tpce/c=5000/nodes=3 failed on master@597e4a8c487e3c23d64885563d608a692b59055c:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
Had a bit more time to look at this this afternoon. My hypothesis is that both of these failures can be explained if IMPORT jobs are being double-created. The latest failures that we're seeing is explained by:
The earlier failure mode can be explained by:
There were a few jobs-related changes that went in around the time we started seeing this failure (e.g. fe6377c#diff-bbc44b1b8225066d6d73cb8b4efce341bfec316008c10d42cc53dd58010ad781 which are now suspect). I'm going through those changes to see if one explains the double-creation of the job records (and thus these races we're seeing) |
(roachtest).tpce/c=5000/nodes=3 failed on master@ee9f47b9ec9476a693464e2dcd09a01bf9d39ad2:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
(roachtest).tpce/c=5000/nodes=3 failed on master@893643b63ea0b1cfa4888c6b73b5c68a9c100c3a:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash |
(roachtest).tpce/c=5000/nodes=3 failed on master@ec011620c7cf299fdbb898db692b36454defc4a2:
More
Artifacts: /tpce/c=5000/nodes=3
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: