-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDC as Export / Job Never Completes #86828
Comments
Logs showed changefeed processors waiting hours for their memory quota, a high count of ranges whose backfill never completed, and some messages still in the buffer. The issue goes away if you increase |
cc @cockroachdb/cdc |
I think I know what the underlying root cause is; and I suspect that the issue is pretty serious, and most likely
|
Ensure that out of quota events are not lost and propagated if necessary to the consumer. Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued. Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued. Fixes cockroachdb#86828 Release justification: bug fix Release note: None
Ensure that out of quota events are not lost and propagated if necessary to the consumer. Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued. Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued. Fixes cockroachdb#86828 Release justification: bug fix Release note: None
86734: kvserver: avoid race in preSplitApply r=erikgrinaker a=tbg When `splitPreApply` has to handle a right-hand side replica that is newer than the split, the split needs to throw the "snapshot" it was going to install into the right-hand side away. It does so by deleting all data in the RHS and replacing the raft state bits. It is using the RHS replica's stateloader to that effect, but didn't actually hold the raftMu to make this safe. The mutex acquisition has been added. Fixes #86669. Fixes #86734. No release note since the bug shouldn't be visible to end users (it is very rare in the first place, and having noticeable effect even rarer), and if so it would likely look like unspecific Raft corruption that will be hard to trace back to this race. Release justification: this will merge on master only after branch cut. Release note: None 87385: roachtest: update a comment r=renatolabs a=tbg Release justification: changes a comment in testing code. Release note: None 87464: kvevent: Ensure out of quota events correctly handled r=miretskiy a=miretskiy Ensure that out of quota events are not lost and propagated if necessary to the consumer. Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued. Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued. Fixes #86828 Release justification: bug fix Release note: None 87511: authors: add angeladietz to authors r=angeladietz a=angeladietz Release note: None Release justification: non-production code change Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Angela Dietz <[email protected]>
Ensure that out of quota events are not lost and propagated if necessary to the consumer. Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued. Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued. Fixes cockroachdb#86828 Release justification: bug fix Release note: None
Ensure that out of quota events are not lost and propagated if necessary to the consumer. Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued. Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued. Fixes cockroachdb#86828 Release justification: bug fix Release note: None
Ensure that out of quota events are not lost and propagated if necessary to the consumer. Prior to this change, it was possible for an out of quota notification to be "lost" because "blocked" bit would be cleared out when an event was enqueued. Instead of relying on a boolean bit, we now keep track of the number of consumers currently blocked, and issue flush request if there are non-zero blocked consumers with zero events currently queued. Fixes cockroachdb#86828 Release justification: bug fix Release note: None
Describe the problem
Attempting to perform CDC as an EXPORT with format=csv, initial_scan_only, compression=gzip. The job is created, output is produced in the S3 bucket, but the job never completes.
To Reproduce
Create a table:
Imported data
Created a CDC Change feed:
The change feed runs successfully and produces the S3 output:
The size and number of files in the S3 bucket has stopped changing which indicates to me, the process export of data is complete.
Size of S3 Exports: 205.2MB
Number of Files: 89
Size of table in CRDB: 83.3GiB
Number of Ranges: 692
I have attempted the process several times and the number of files and size of bucket is similar between executions. I am able to view the data in the csv files in S3 (using S3 Query) and the data is valid.
Expected behavior
I expect the "CDC Export" to write the contents of the table, in csv format, to my S3 bucket. Since the
with intitial_scan_only
was supplied, I expect the CDC job to complete when the data has been output to the S3 bucket.Environment:
cockroach sql
debug zip is being upload to GoogleDrive.
Jira issue: CRDB-18949
Epic CRDB-19123
The text was updated successfully, but these errors were encountered: