Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: enable time travel by default #18854

Merged
merged 8 commits into from
Oct 11, 2024
Merged

Conversation

zwang28
Copy link
Contributor

@zwang28 zwang28 commented Oct 11, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR modifies several configurations to enable time travel by default, ensuring regular batch query remains functional even if there is a missing recent version in compute node.

  • min_sst_retention_time_sec is reduced from 86400 to 10800 (3 hour). Since after enabling time travel, the deletion of SSTs is entirely dependent on full GC. A smaller min_sst_retention_time_sec permits the earlier deletion of SST. However it cannot be too small, because it also determines the maximum permitted barrier latency and compaction latency.
  • full_gc_interval_sec is reduced from 86400 to 600, aiming to trigger full gc more frequently.
  • The system param time_travel_retention_ms is increased from 0 to 600000 (10 minutes).
    • Since min_sst_retention_time_sec will keep recent SSTs anyway, time_travel_retention_ms can actually be as large as min_sst_retention_time_sec without introducing object store overhead. However it will introduce meta store overhead.
    • Later PR will limit the minimum value of min_sst_retention_time_sec.

A new configuration time_travel_version_cache_capacity is added.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@zwang28
Copy link
Contributor Author

zwang28 commented Oct 11, 2024

This PR can only be merged after etcd is deprecated.

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 11, 2024

With min_sst_retention_time_sec=3600, the maximum permitted barrier latency is 1 hour.

Given that SST sync is triggered after collecting barriers, the main issue here is for spilled SSTs, which can happen at any time before barrier collection. In other words, if a spilling happens 1hour before the barrier is completed, commit_epoch will fail. Otherwise if the streaming pipeline is stuck by compaction lag or backpressure with no spilling happens, commit_epoch can still succeed even if barrier latency is high.

However, we do see in production that when join amp is high for some use cases, spilling can happen and barrier latency can go higher than 1 hour occasionally. Prior to this PR, the occasional spilling won't be an issue but after this PR, I believe their clusters will enter a recovery loop, which makes me a little bit concerned.

Brainstorm idea: now that we have bidi-streaming RPC for barrier injection/collection and the spilled SST information is maintained in a single place in UploaderData. I am thinking we can simply mark the spilled SST illegible for GC in-memory (reset on recovery). Not sure whether there are other complexities here.

@zwang28
Copy link
Contributor Author

zwang28 commented Oct 11, 2024

Brainstorm idea: now that we have bidi-streaming RPC for barrier injection/collection and the spilled SST information is maintained in a single place in UploaderData. I am thinking we can simply mark the spilled SST illegible for GC in-memory (reset on recovery)

Sounds good to me. The marked SSTs are filtered out for both GC and timestamp verification of commit_epoch. I'll work on it.

Before that is ready, what about increasing min_sst_retention_time_sec to 3 hours? Note that it can still be modified on demand for exisiting clusters, when larger barrier latency is expected.

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 11, 2024

Brainstorm idea: now that we have bidi-streaming RPC for barrier injection/collection and the spilled SST information is maintained in a single place in UploaderData. I am thinking we can simply mark the spilled SST illegible for GC in-memory (reset on recovery)

Sounds good to me. The marked SSTs are filtered out for both GC and timestamp verification of commit_epoch. I'll work on it.

Before that is ready, what about increasing min_sst_retention_time_sec to 3 hours? Note that it can still be modified on demand for exisiting clusters, when larger barrier latency is expected.

SGTM

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 11, 2024

This PR can only be merged after etcd is deprecated.

The time_travel_enabled check will ignore time_travel_retention_ms if the meta backend is etcd. Does it mean that this PR should not be blocked by etcd deprecation?

Copy link

gitguardian bot commented Oct 11, 2024

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
9425213 Triggered Generic Password 6940309 e2e_test/source/tvf/postgres_query.slt View secret
9425213 Triggered Generic Password 6940309 e2e_test/source/tvf/postgres_query.slt View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@zwang28 zwang28 requested a review from hzxa21 October 11, 2024 08:04
@zwang28 zwang28 enabled auto-merge October 11, 2024 08:05
@zwang28 zwang28 added this pull request to the merge queue Oct 11, 2024
Merged via the queue into main with commit b1ea9e4 Oct 11, 2024
31 of 33 checks passed
@zwang28 zwang28 deleted the wangzheng/enable_time_travel branch October 11, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants