-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: stack overflow in InternalExecutor.Exec{Ex} #109197
Comments
Closing as unactionable. |
We saw this problem again, CCT 23.2 cluster, running alpha.6. Here are the stack dumps of 7 nodes that crashed due to the stack overflow overflow.zip. Interestingly, on n1 we have 5 goroutines that had their stack elided, on node 6 two (out of which 1 was unrelated), and on all others just one. The unfortunate thing is that only in go 1.21 we get the tail of the stacks, so we currently don't know what was the query that triggered this, and 23.2 uses 1.20. My hypothesis is that it's |
I have a somewhat far-fetched theory that the stack overflow is caused not by an infinite recursion but by a somewhat bounded recursion which depends on whether we hit too many retries or not. I have been focusing on node 1 crash since it was the first one. In the stack trace I fished out the following goroutine
which almost matches the query plan for Plan
The only difference is that in this plan (that I got in a demo) we have The interesting bit is that utility function uses Now, both of these reproductions happened during cluster-to-cluster streaming (TODO: confirm that CCT cluster had this), and - I'm speculating a bit here - perhaps C2C needs to read I still don't see a fundamental problem with c09860b; however, it seems prudent to introduce a limit on the number of internal retries performed by |
Owing to this 1.21 runtime change?
In addition to the above enhancement, would it make sense to have an explicit upper bound on the depth of |
Yes, precisely.
Yeah, I like that, thanks. I'll replace the limit on the number of retries with the depth limit (the former can be expressed via the latter). EDIT: replacing seems a bit tricky since a single retry corresponds to several levels in recursion, so I'll just add a constant limit on depth. |
Over in https://cockroachlabs.slack.com/archives/C04U1BTF8/p1692639696886179, Lidor and Steven observed that when running cluster-to-cluster replication, after two out of five nodes crashed due to out-of-disk, the remaining three nodes soon crashed with the following:
This was on 9849680, and it seems likely that c09860b has some bug in it.
Jira issue: CRDB-30817
The text was updated successfully, but these errors were encountered: