-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/logictest: TestLogic/local/aggregate/other times out under race #54685
Comments
Took an initial cursory look. It ran with a 2h timeout and some extraordinary repeated log messages:
@ajwerner / @nvanbenschoten what do these log messages mean? |
This seems alarming. We need the rangefeed to catch up in order to get updates for schema leases and what not.
This one I don't know about |
The rangefeed slowdown is pretty easily reproducible with:
I'm not sure if this is expected or not during race. Who would be able to look further into this? |
slowdown isn't necessarily a problem but getting stuck is. does the amount that the rangefeed is behind keep increasing? |
It does increase:
|
Alright, this is bad and indicates that the closed timestamp is stuck |
or rather, the resolved timestamp |
cc @nvanbenschoten can you triage? This failure is happening very often |
@aayushshah15 do you have free cycles to take a look at this? It looks right up your alley. |
@andreimatei just want to raise the visibility on this as you proceed to work on understanding some other closed timestamp issues. |
FWIW, I was wondering if the logictest executes test files in parallel, but it looks like that's disabled under race cockroach/pkg/sql/logictest/logic.go Line 2513 in 5025c70
I'd be curious to understand why the slow tests report that I mucked with back in the day doesn't seem to work right here. For example in #55246 it just says
Maybe there's some particular file culprits that we should be aware off... |
I can bump it to 24h if we think that will fix things... otherwise we could just try to disable this specific test under race.
Why do you think it's not working? |
Oh... And it's also the first test to run? |
It's not -- we've had up to ~4000 tests passing with this one failing: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_SqlRaceLogicTests/2341042 |
Ah, I wish that report would tell us how many non-slow tests passed. |
Discussed with @andreimatei offline. Summarizing my thoughts/findings:
We should try to understand why this test takes so long under race, and if it makes sense, raise the timeout or skip this particular test. I'm removing the blocker label and passing it back to someone from Exec. |
Do you all know about this stack trace, which I saw in another timeout?
|
@tbg hey, it looks like your name is on the git blame for this error - any idea what it's about? |
This looks like fallout from 8585416. We're calling There are two things here:
|
Opened #55310 to bump the test timeout to 24 hours. If that doesn't help I can try to disable this particular test under race. |
@nvanbenschoten jogging my memory to remind myself what's going on here. Say we're GC'ing a key Say the versions are these (sorted in ascending engine order)
We'll look at the meta first and then step forward a few times hoping to get to the first gc'able version version, but here there are too many non-gcable ones so we'll end up somewhere here:
We'll then
Then we step the iterator forward one and begin removing keys. Ok, so what does the SpanSet not like? Your comment indicates that its What I don't fully understand is why we're catching the assertion here. We're GC'ing a single key, so the iterator is bounded on Ah, looking into c2ecc15 a bit more I see why. We enforce the spans against the seeked key as long as the iter is valid after the seek. It seems that what we want here is to check the current position of the iterator in that case. If we actually seeked to below k - that's a problem. If it's cockroach/pkg/kv/kvserver/spanset/batch.go Lines 88 to 93 in 0236638
|
Friendly ping, @nvanbenschoten. |
(sql/logictest).TestLogic/local/aggregate/other failed on master@d438437fe2b9dbec3980f9265a36c9dfb527ba8b:
More
Related:
sql/logictest: TestLogic/local/aggregate/other timed out #54608 sql/logictest: TestLogic/local/aggregate/other timed out C-test-failure O-robot branch-release-20.2
sql/logictest: TestLogic/local/aggregate/other timed out #54601 sql/logictest: TestLogic/local/aggregate/other timed out C-test-failure O-robot branch-release-20.1
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: