Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamingccl: don't complete producer job on cutover #117515

Merged
merged 1 commit into from
Jan 9, 2024

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Jan 8, 2024

Previously on cutover, the producer job would complete, removing the producer
side protected timestamp. To ensure a smooth fast failback to any timestamp
greater or equal to the cutover timestamp, however, the original producer side
should keep the pts around.

This patch retains the producer side timestamp by delaying the completion of
the producer job until stream_replication.job_liveness_timeout (default 3
days) elapses after the final heartbeat during the replication job. A future PR
will deprecate this setting and allow users to directly set this producer side
retention period.

Epic: none

Release note: none

Copy link

blathers-crl bot commented Jan 8, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler force-pushed the butler-pts-c2c branch 9 times, most recently from 8c3a7f2 to 4440ff8 Compare January 9, 2024 00:06
Previously on cutover, the producer job would complete, removing the producer
side protected timestamp. To ensure a smooth fast failback to any timestamp
greater or equal to the cutover timestamp, however, the original producer side
should keep the pts around.

This patch retains the producer side timestamp by delaying the completion of
the producer job until `stream_replication.job_liveness_timeout` (default 3
days) elapses after the final heartbeat during the replication job. A future PR
will deprecate this setting and allow users to directly set this producer side
retention period.

Epic: none

Release note: none
@msbutler msbutler marked this pull request as ready for review January 9, 2024 14:08
@msbutler msbutler requested a review from a team as a code owner January 9, 2024 14:08
@msbutler msbutler requested review from adityamaru, dt and stevendanna and removed request for a team and adityamaru January 9, 2024 14:08
@dt
Copy link
Member

dt commented Jan 9, 2024

default 3 days

This seems like a long time to me to keep accumulating history on a source cluster if the dev/test/other cluster decides it is done replicating and goes off on its way (it also seems like a long time if it just vanishes, but I don't need to litigate the default here).

Should we, in this success case, move the expiration to t+12h instead?

@msbutler
Copy link
Collaborator Author

msbutler commented Jan 9, 2024

The timeout is currently set by the cluster setting stream_replication.job_liveness_timeout. I was planning to punt manipulating the default timeout once i deprecated this setting in point 3 described here: https://cockroachlabs.slack.com/archives/C03JCUUSCD6/p1704745263201289?thread_ts=1704733221.079619&cid=C03JCUUSCD6

@msbutler
Copy link
Collaborator Author

msbutler commented Jan 9, 2024

TFTR!

bors r=stevendanna

@craig
Copy link
Contributor

craig bot commented Jan 9, 2024

Build succeeded:

@craig craig bot merged commit 2c3b07e into cockroachdb:master Jan 9, 2024
8 of 10 checks passed
msbutler added a commit to msbutler/cockroach that referenced this pull request Jan 11, 2024
A previous pr cockroachdb#117515 lowered the producer job liveness timout in the test to
allow the producer job to succeed quickly on completion. The PR lowered it to
100ms which is too low, causing the job to fail on liveness timeouts. This
patch bumps the timout to 1s, preventing the liveness timeout.

Fixes cockroachdb#117605

Release note: none
craig bot pushed a commit that referenced this pull request Jan 11, 2024
117258: sql: use high priority for populating RoleMemberCache r=Xiang-Gu a=Xiang-Gu

Previously, when the RoleMemberCache is invalid, it launches a new txn in a singleflight to read from `system.role_members` table to populate the cache. If, however, the original txn has previously laid a write intent on the same system table, then we end up having a deadlock: original txn waits for this new txn; this new txn waits for original txn. 

Fixes #117144
Release note (bug fix): Fixed a bug where concurrent GRANTs can cause deadlocks.

117571: build,bazci: add `pkg/build/engflow` package r=rail a=rickystewart

... and extract logic for extracting test results into a helper function.

Release note: None
Epic: CRDB-8308

117687: streamingccl: deflake TestPartitionedStreamClient r=dt a=msbutler

A previous pr #117515 lowered the producer job liveness timout in the test to allow the producer job to succeed quickly on completion. The PR lowered it to 100ms which is too low, causing the job to fail on liveness timeouts. This patch bumps the timout to 1s, preventing the liveness timeout.

Fixes #117605

Release note: none

117691: roachtest: pull tpc-e image from GAR r=srosenberg,renatolabs a=rail

Previously, we pulled the tpc-e image from Docker Hub. Now that we move most of our CI images to GAR, this image will be pulled from GAR as well

Epic: RE-539
Release note: None

Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants