Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-24.2.3-rc: changefeedccl: fix memory leak in cloud storage sink with fast gzip #130625

Conversation

rharding6373
Copy link
Collaborator

Backport 3/3 commits from #130204.

/cc @cockroachdb/release


When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine calling Flush cleared the files before the
async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

See individual commits for more info.

Co-authored by: wenyihu6

Epic: none
Fixes: #129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using a
cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.

Release justification: Fixes a bug that could cause OOMs in changefeed cloud storage sinks.

This commit adds a new changefeed testing knob, AsyncFlushSync, which
can be used to introduce a synchronization point between goroutines
during an async flush. It's currently only used in the cloud storage
sink.

Epic: none

Release note: none
Adds a test that reproduces a memory leak from pgzip, the library used
for fast gzip compression for changefeeds using cloud storage sinks. The
leak was caused by a race condition between Flush/flushTopicVerions and
the async flusher: if the Flush clears files before the async flusher
closes the compression codec as part of flushing the files, and the
flush returns an error, the compression codec will not be closed
properly. This test uses the AsyncFlushSync testing knob to introduce
synchronization points between these two goroutines to trigger the
regression.

Co-authored by: wenyihu6

Epic: none

Release note: none
@rharding6373 rharding6373 requested a review from a team as a code owner September 12, 2024 22:39
Copy link

blathers-crl bot commented Sep 12, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Sep 12, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rharding6373 rharding6373 force-pushed the backport24.2.3-rc-130204 branch from b93ae37 to 0be2878 Compare September 20, 2024 17:40
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
@rharding6373 rharding6373 force-pushed the backport24.2.3-rc-130204 branch from 0be2878 to 1ef47a9 Compare September 20, 2024 17:48
Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preemptive lgtm

@rharding6373
Copy link
Collaborator Author

TFTR! I confirmed that the data race seen in tests is pre-existing, see #130651 (comment). I added a skip race to the backport.

@rharding6373 rharding6373 merged commit 4764661 into cockroachdb:release-24.2.3-rc Sep 20, 2024
19 of 20 checks passed
@rharding6373 rharding6373 deleted the backport24.2.3-rc-130204 branch September 20, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants