Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc/cloudstorage: potential memory leak in cloud storage #129947

Closed
wenyihu6 opened this issue Aug 30, 2024 · 4 comments · Fixed by #130204
Closed

cdc/cloudstorage: potential memory leak in cloud storage #129947

wenyihu6 opened this issue Aug 30, 2024 · 4 comments · Fixed by #130204
Assignees
Labels
A-cdc Change Data Capture branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-23.2.12-rc branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.1.5-rc branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.2.2-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-cdc

Comments

@wenyihu6
Copy link
Contributor

wenyihu6 commented Aug 30, 2024

Describe the problem

We clear s.files right after we call s.flushFile on every files (no errors returned yet since we were able to send all files into the asyncflusherCh)

s.files.Ascend(func(i btree.Item) (wantMore bool) {
err = s.flushFile(ctx, i.(*cloudStorageSinkFile))
return err == nil
})
if err != nil {
return err
}
s.files.Clear(true /* addNodesToFreeList */)
. We then clear all files s.Clear() since no error returned
case s.asyncFlushCh <- flushRequest{file: file, dest: dest}:
return nil
. But this is async_flush and it is possible that they later return an error
if err != nil {
log.Errorf(ctx, "error flushing file to storage: %s", err)
s.asyncFlushErr = err
return err
}
. We then left with some of the files uncleaned. And s.Close() were not able to clean the files properly since files were removed in s.Clear() already.

See more in https://github.com/cockroachlabs/support/issues/3068.
Repro: #129935

Jira issue: CRDB-41791

@wenyihu6 wenyihu6 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 30, 2024
Copy link

blathers-crl bot commented Aug 30, 2024

Hi @wenyihu6, please add branch-* labels to identify which branch(es) this C-bug affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6 wenyihu6 changed the title cdc/cloudstorage: cdc/cloudstorage: potential memory leak in cloud storage Aug 30, 2024
@rharding6373 rharding6373 added A-cdc Change Data Capture T-cdc labels Aug 30, 2024
Copy link

blathers-crl bot commented Aug 30, 2024

cc @cockroachdb/cdc

@rharding6373 rharding6373 added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Aug 30, 2024
@rharding6373 rharding6373 added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 labels Sep 3, 2024
@rharding6373 rharding6373 assigned rharding6373 and unassigned wenyihu6 Sep 4, 2024
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 6, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine calling Flush cleared the files before the
async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using a
cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 6, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine calling Flush cleared the files before the
async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using a
cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 11, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 12, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
craig bot pushed a commit that referenced this issue Sep 12, 2024
130204: changefeedccl: fix memory leak in cloud storage sink with fast gzip r=wenyihu6,msbutler a=rharding6373

When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine calling Flush cleared the files before the
async flusher had cleaned up the compression codec and received the
error from the sink.
    
This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

See individual commits for more info.
    
Co-authored by: wenyihu6
    
Epic: none
Fixes: #129947
    
Release note(bug fix): Fixes a potential memory leak in changefeeds using a
cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.


Co-authored-by: rharding6373 <[email protected]>
@craig craig bot closed this as completed in fbf19a0 Sep 12, 2024
Copy link

blathers-crl bot commented Sep 12, 2024

Based on the specified backports for linked PR #130204, I applied the following new label(s) to this issue: branch-release-24.1.5-rc, branch-release-24.2.2-rc. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl bot pushed a commit that referenced this issue Sep 12, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: #129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
Copy link

blathers-crl bot commented Sep 12, 2024

Based on the specified backports for linked PR #130204, I applied the following new label(s) to this issue: branch-release-23.2.12-rc. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 20, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 20, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 20, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 20, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
rharding6373 added a commit to rharding6373/cockroach that referenced this issue Sep 20, 2024
When using the cloud storage sink with fast gzip and async flush
enabled, changefeeds could leak memory from the pgzip library if a write
error to the sink occurred. This was due to a race condition when
flushing, if the goroutine initiating the flush cleared the files before
the async flusher had cleaned up the compression codec and received the
error from the sink.

This fix clears the files after waiting for the async flusher to finish
flushing the files, so that if an error occurs the files can be closed
when the sink is closed.

Co-authored by: wenyihu6

Epic: none
Fixes: cockroachdb#129947

Release note(bug fix): Fixes a potential memory leak in changefeeds using
a cloud storage sink. The memory leak could occur if both
changefeed.fast_gzip.enabled and
changefeed.cloudstorage.async_flush.enabled are true and the changefeed
received an error while attempting to write to the cloud storage sink.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-23.2.12-rc branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.1.5-rc branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.2.2-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants