-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: alterpk-tpcc failed #45812
Comments
(roachtest).alterpk-tpcc failed on master@752dea867f3aeb142e98c22f8d320ce19041aa8d:
|
(roachtest).alterpk-tpcc failed on master@dfa5bd527ae7d7373dd03c62118df87a87a77130:
|
cc @lucy-zhang is this failure related to the new sc + jobs PR? Looking at the logs, it seems like there is some deadlock while trying to run the schema change. Each node has these repeated logs:
and no progress is made for 10 hours |
954fe69 doesn't have those changes (and I rolled them back on master anyway, for the time being, to deal with a problem with migrations). That log message is coming from here: cockroach/pkg/sql/schema_changer.go Line 1026 in 954fe69
So it's a problem with trying to acquire a schema change lease, which hopefully should be solved once I do merge that PR. (I don't know what caused the deadlock, though.) Let's stress this test again once that PR is merged. |
(roachtest).alterpk-tpcc failed on master@c473f40078994551cebcbe00fdbf1fa388957658:
|
I'm investigating this issue a little more and don't understand whats happening in the logs. We see here
That a backfill has started for our index. However, after that we see that all the nodes start trying to grab the lease: Then, we see that validation for the index backfills have completed:
However, it then doesn't look like we make it past
But it also doesn't look like we started to rollback the schema change either. |
@rohany Now that the new schema change job is merged, I think we should wait a few days and see if we get any more failures. The thing that's happening here (being unable to acquire a schema change lease) shouldn't be possible anymore, since schema change leases won't exist, though maybe there's some deeper underlying problem that would also cause problems with the new schema change job. |
(roachtest).alterpk-tpcc failed on master@72c4a1bd411f2f82bf9aaa22883821a946614148:
|
I think this failure above still doesn't have the thew sc jobs commit yet. |
(roachtest).alterpk-tpcc failed on master@5570c01402796edb7cd06eb8ce7f615371f22d42:
|
This PR adds some extra logging to the backfiller and alterpk roachtest to have more insight on where cockroachdb#45812 is failing. Release justification: non production code change Release note: None
Its unclear from these logs what exactly is going wrong, though the situation still seems similar to the logs before the schema changes jobs pr. One thing is that 500 warehouses is probably too big to be running on a 3 node 4 cpu cluster, but it doesn't explain the hang. |
46001: roachtest: add extra logging to alterpk-tpcc roachtest r=yuzefovich a=rohany This PR adds some extra logging to the backfiller and alterpk roachtest to have more insight on where #45812 is failing. Release justification: non production code change Release note: None Co-authored-by: Rohan Yadav <[email protected]>
(roachtest).alterpk-tpcc failed on master@793a9200c16693aff32aa6a4dd9d8bbcbddb30aa:
|
(roachtest).alterpk-tpcc failed on master@69dc87d68addedf2fabfb2b14c098cfb35b5f3d0:
|
Hmm, latest failure just looks like a random node failure. Haven't seen the deadlock in the past 2 runs... |
Eh, the OOM can be reliably reproduced on a 3 node roachprod cluster with default Here is an explain analyze (on an occasion it didn't crash) of the query for the context on amount of data flying around: We seem to be estimating the memory usage of the aggregators relatively good now. For example, here is one of the heap profiles when first stage aggregator finishes and the second one is about to:
In all of the crashes I've observed we successfully get to the point where hash join is being performed and actually get pretty far in the join's execution. However, RSS jumps quite significantly (say from I'm tired of looking at this issue and wondering whether someone else should take a stab at it. Here are a few suggestions:
|
Hi @yuzefovich -- sorry for my slow reply to you earlier -- since you've disabled auto stats, does this mean you've ruled out histograms as the cause? (You can also explicitly disable histograms using |
I still think there is memory misaccounting while performing automatic stats collection, but the crashes occur (seems like with lower likelihood) with auto stats disabled, so I don't think they are the root of cause of these OOMs. |
Ok thanks! I'll investigate the memory accounting of stats collection on Monday (unless there is some urgency to do it sooner...). |
Thanks! I don't think it's urgent since it seems to me that generally we're accounting for "permanently" used memory while automatically collecting stats, but I'm concerned about "temporary" allocations that could spike - I just don't know how big the spikes can be. |
(roachtest).alterpk-tpcc failed on master@82fec00c83d4bfe35b906264ccb568568cec15b7:
More
Artifacts: /alterpk-tpcc
See this test on roachdash |
(roachtest).alterpk-tpcc failed on master@b1a0b989bbfef500075a485edc762fe42ca7b32a:
More
Artifacts: /alterpk-tpcc
See this test on roachdash |
(roachtest).alterpk-tpcc failed on master@beac4a53e0e2e2236eb5957f67abc1bf476ad1b6:
More
Artifacts: /alterpk-tpcc
See this test on roachdash |
Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None
Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None
Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None
Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None
Unfortunately, this crash seems like a regression. Using 19.2.5 I created 3 node cluster with default node sizes, imported 500 warehouses, waited for auto stats to be collected, and then ran first query from 3.3.2.6 check manually twice, and both times it succeeded in about 4 minutes. Then I stopped the cluster and restarted it using the current master, but when I executed the query, it crashed on the first try. This issue deserves more investigation. cc @jordanlewis |
(roachtest).alterpk-tpcc failed on master@2032dafccfa311c7538960e974953cb9dc1d4e50:
More
Artifacts: /alterpk-tpcc
See this test on roachdash |
47106: rowexec: account for some additional memory used by stats collection r=rytaft a=rytaft Prior to this commit, we did not account for the memory used in the `sampleAggregator` when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs #45812 Release note: None Co-authored-by: Rebecca Taft <[email protected]>
Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None
Prior to this commit, we did not account for the memory used in the sampleAggregator when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs cockroachdb#45812 Release note: None
48342: release-19.2: rowexec: account for some additional memory used by stats collection r=rytaft a=rytaft Backport 1/1 commits from #47106. /cc @cockroachdb/release --- Prior to this commit, we did not account for the memory used in the `sampleAggregator` when we copy all samples into a new slice before generating a histogram. This commit adds some additional memory accounting for this overhead. Informs #45812 Release note: None Co-authored-by: Rebecca Taft <[email protected]>
(roachtest).alterpk-tpcc failed on master@954fe69d554162aec0fbc001aad1fe5103d8df13:
More
Artifacts: /alterpk-tpcc
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: