-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpchvec/disk failed #87510
Comments
From teardown.log in the artifacts:
From test.log in the artifacts:
Example from node 1 stack trace:
@yuzefovich do you think it could be related to any of the refactoring you've done recently? |
87563: colmem: fix some issues with the memory limiting r=yuzefovich a=yuzefovich **colexec: some fixes of the external sort** This commit fixes some of the relatively benign issues in the external sort: - previously we forgot to unset the partition info for all partitions that are being merged as part of the repeated merging process (we would only reset the first one because it is overwritten by the newly created "merged" partition) - we incorrectly estimated "min output batch size" for the repeated merging (the calculation was as if all current partitions were being merged rather than `n`) - we incorrectly computed the memory size of the enqueued batch. It is possible that the batch is a "window" or doesn't use the whole capacity, and previously we were using the total memory footprint. However, we need to only include the "proportional" size according to the length of the batch. The issues are relatively benign since they would mostly make the verbose logging incorrect as well as over-estimate the "max batch mem size" (which would mean that we'd merge the partitions sooner or with a smaller output batch size). Release justification: low-risk bug fix. Release note: None **colmem: fix some issues with the memory limiting** This commit fixes a couple of issues with how we do memory-limiting of batches by the footprint. In particular, the allocator will now estimate the memory footprint of a batch before allocating a new one and will clamp the capacity so that the batch stays under the limit. Previously, we could allocate a batch that would exceed the limit even when all types are fixed length. This behavior has been present since long time ago. Additionally, this commit fixes a recent regression in how `SetAccountingHelper` uses the capacity of the batch. Previously, if a new batch is allocated (when variable-width types are present) and exceeds the memory limit, then the first call to `AccountForSet` would artificially clamp the used capacity at 1, so the batch might have a lot of unused capacity. Now the helper will memorize the full capacity right after the batch is allocated. This regression was introduced in a recent refactor of the `SetAccountingHelper`. In particular, it could lead to the external sort (which uses the ordered synchronizer internally which uses the `SetAccountingHelper`) becoming excruciatingly slow with "unlucky" low memory limits. The limit would be "unlucky" if it is such that a batch with capacity `c` doesn't exceed it, but the batch with capacity `2 * c` would exceed the limit by less than a factor of two. In such a scenario previously we would allocate the batch of capacity `2 * c` yet would always use only a single row (because in `AccountForSet` we would set the max capacity at 1). Addresses: #87510. Release justification: bug fix. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich should this have been closed by #87563? |
I think only after the backports are merged because the regression from #85440 has already been backported. |
roachtest.tpchvec/disk failed with artifacts on release-22.2 @ 1bfe9bcda653f55ed3b4216610433b51b2ef0d8f:
Parameters: |
The backports have been merged, and no new issues came up, closing. |
roachtest.tpchvec/disk failed with artifacts on master @ 2372698da1dfacb90f60c6a63f2c1298d1db16b8:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-19386
The text was updated successfully, but these errors were encountered: