Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1.0: colexec: fix sort chunks with disk spilling in very rare circumstances #80715

Merged
merged 1 commit into from
May 2, 2022

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Apr 28, 2022

Backport 1/1 commits from #80679 on behalf of @yuzefovich.

/cc @cockroachdb/release


This commit fixes a long-standing but very rare bug which could result
in some rows being dropped when sort chunks ("segmented sort") spills
to disk.

The root cause is that a deselector operator is placed on top of the
input to the sort chunks op (because its "chunker" spooler assumes no
selection vector on batches), and that deselector uses the same
allocator as the sort chunks. If the allocator's budget is used up, then
an error is thrown, and it is caught by the disk-spilling infrastructure
that is wrapping this whole sort chunks -> chunker -> deselector
chain; the error is then suppressed, and spilling to disk occurs.
However, crucially, it was always assumed that the error occurred in
chunker, so only that component knows how to properly perform the
fallover. If the error occurs in the deselector, the deselector might
end up losing a single input batch.

We worked around this by making a fake allocation in the deselector
before reading the input batch. However, if the stars align, and the
error occurs after reading the input batch in the deselector, that
input batch will be lost, and we might get incorrect results.

For the bug to occur a couple of conditions need to be met:

  1. The "memory budget exceeded" error must occur for the sort chunks
    operation. It is far more likely that it will occur in the "chunker"
    because that component can buffer an arbitrarily large number of tuples
    and because we did make that fake allocation.
  2. The input operator to the chain must be producing batches with
    selection vectors on top - if this is not the case, then the deselector
    is a noop. An example of such an input is a table reader with a filter
    on top.

The fix is quite simple - use a separate allocator for the deselector
that has an unlimited budget. This allows us to still properly track the
memory usage of an extra batch created in the deselector without it
running into these difficulties with disk spilling. This also makes it
so that if a "memory budget exceeded" error does occur in the deselector
(which is possible if --max-sql-memory has been used up), it will not
be caught by the disk-spilling infrastructure and will be propagate to
the user - which is the expected and desired behavior in such
a scenario.

There is no explicit regression test for this since our existing unit
tests already exercise this scenario once the fake allocation in the
deselector is removed.

Fixes: #80645.

Release note (bug fix): Previously, in very rare circumstances
CockroachDB could incorrectly evaluate queries with ORDER BY clause when
the prefix of ordering was already provided by the index ordering of the
scanned table.


Release justification: low risk bug fix.

This commit fixes a long-standing but very rare bug which could result
in some rows being dropped when sort chunks ("segmented sort") spills
to disk.

The root cause is that a deselector operator is placed on top of the
input to the sort chunks op (because its "chunker" spooler assumes no
selection vector on batches), and that deselector uses the same
allocator as the sort chunks. If the allocator's budget is used up, then
an error is thrown, and it is caught by the disk-spilling infrastructure
that is wrapping this whole `sort chunks -> chunker -> deselector`
chain; the error is then suppressed, and spilling to disk occurs.
However, crucially, it was always assumed that the error occurred in
`chunker`, so only that component knows how to properly perform the
fallover. If the error occurs in the deselector, the deselector might
end up losing a single input batch.

We worked around this by making a fake allocation in the deselector
before reading the input batch. However, if the stars align, and the
error occurs _after_ reading the input batch in the deselector, that
input batch will be lost, and we might get incorrect results.

For the bug to occur a couple of conditions need to be met:
1. The "memory budget exceeded" error must occur for the sort chunks
operation. It is far more likely that it will occur in the "chunker"
because that component can buffer an arbitrarily large number of tuples
and because we did make that fake allocation.
2. The input operator to the chain must be producing batches with
selection vectors on top - if this is not the case, then the deselector
is a noop. An example of such an input is a table reader with a filter
on top.

The fix is quite simple - use a separate allocator for the deselector
that has an unlimited budget. This allows us to still properly track the
memory usage of an extra batch created in the deselector without it
running into these difficulties with disk spilling. This also makes it
so that if a "memory budget exceeded" error does occur in the deselector
(which is possible if `--max-sql-memory` has been used up), it will not
be caught by the disk-spilling infrastructure and will be propagate to
the user - which is the expected and desired behavior in such
a scenario.

There is no explicit regression test for this since our existing unit
tests already exercise this scenario once the fake allocation in the
deselector is removed.

Release note (bug fix): Previously, in very rare circumstances
CockroachDB could incorrectly evaluate queries with ORDER BY clause when
the prefix of ordering was already provided by the index ordering of the
scanned table.
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-22.1.0-80679 branch from b02399a to de35ca1 Compare April 28, 2022 15:57
@blathers-crl blathers-crl bot requested review from cucaroach, msirek and rytaft April 28, 2022 15:57
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Apr 28, 2022
@blathers-crl
Copy link
Author

blathers-crl bot commented Apr 28, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: (and also add to #release-backports)

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)

Copy link
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @cucaroach)

@yuzefovich yuzefovich merged commit 120fbbb into release-22.1.0 May 2, 2022
@yuzefovich yuzefovich deleted the blathers/backport-release-22.1.0-80679 branch May 2, 2022 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants