Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.2: colflow: release disk resources in hash router in all cases #81555

Merged
merged 1 commit into from
May 22, 2022

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented May 19, 2022

Backport 1/1 commits from #81491 on behalf of @yuzefovich.

/cc @cockroachdb/release


Previously, it was possible for the disk-backed spilling queue used
by the hash router outputs to not be closed when the hash router exited.
Namely, this could occur if the router output was not fully exhausted
(i.e. it could still produce more batches, but the consumer of the
router output was satisfied and called DrainMeta). In such a scenario,
routerOutput.closeLocked was never called because a zero-length batch
was never given to addBatch nor the output was canceled due to an
error. The flow cleanup also didn't save us because the router outputs
are not added into ToClose slice.

The bug is now fixed by closing the router output in DrainMeta. This
behavior is acceptable because the caller is not interested in any more
data, and closing the output can be done multiple times (it is a no-op
on all calls except for the first one). There is no regression test
since it's quite tricky to come up with given that the behavior of
router outputs is non-deterministic, and I don't think it's worth
introducing special knobs inside of DrainMeta / Next for this.

The impact of not closing the spilling queue is that it might lead to
leaking a file descriptor until the node restarts. Although the
temporary directory is deleted on the flow cleanup, the bug would result
in a leak of the disk space which is also "fixed" by the node restarts.

Fixes: #81490.

Release note: None


Release justification: bug fix.

Previously, it was possible for the disk-backed spilling queue used
by the hash router outputs to not be closed when the hash router exited.
Namely, this could occur if the router output was not fully exhausted
(i.e. it could still produce more batches, but the consumer of the
router output was satisfied and called `DrainMeta`). In such a scenario,
`routerOutput.closeLocked` was never called because a zero-length batch
was never given to `addBatch` nor the output was canceled due to an
error. The flow cleanup also didn't save us because the router outputs
are not added into `ToClose` slice.

The bug is now fixed by closing the router output in `DrainMeta`. This
behavior is acceptable because the caller is not interested in any more
data, and closing the output can be done multiple times (it is a no-op
on all calls except for the first one). There is no regression test
since it's quite tricky to come up with given that the behavior of
router outputs is non-deterministic, and I don't think it's worth
introducing special knobs inside of `DrainMeta` / `Next` for this.

The impact of not closing the spilling queue is that it might lead to
leaking a file descriptor until the node restarts. Although the
temporary directory is deleted on the flow cleanup, the bug would result
in a leak of the disk space which is also "fixed" by the node restarts.

Release note: None
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-21.2-81491 branch from 131794d to 39944c3 Compare May 19, 2022 21:55
@blathers-crl
Copy link
Author

blathers-crl bot commented May 19, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot requested review from cucaroach and michae2 May 19, 2022 21:56
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels May 19, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)

@yuzefovich yuzefovich merged commit 4e40ba1 into release-21.2 May 22, 2022
@yuzefovich yuzefovich deleted the blathers/backport-release-21.2-81491 branch May 22, 2022 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants