Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose WAL format versioning and change strategy. #40

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

bwplotka
Copy link
Member

@bwplotka bwplotka commented Nov 26, 2024

Fixes prometheus/prometheus#15200

Also join #prometheus-wal-dev on Slack for the sync discussion!

bwplotka and others added 2 commits December 2, 2024 10:31
proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
* changes that merge records
* sharding?

### Maintain two WALs (well four, with WBL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have separate WAL and WBL ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because WBL is a separate WAL on disk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a re-statement of the fact. I am asking what the reason behind it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, not sure if there's any efficiency reason (e.g. easier to find OOO records) cc @codesome

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing out-of-order, the WAL had all the out-of-order samples in them (both the ingested and uningested out-of-order samples). So we needed a way to know exactly what out-of-order samples were ingested, so that we can restore only those samples. We could only know if a sample went into a chunk at the time of writing to the chunk (i.e. after writing to WAL), so we introduced WBL to log the out-of-order samples ingested, since it was logged after samples went into the chunk. Data in WBL only goes into the out-of-order head.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be merged together by writing separate OOO records in WAL. But the replay needs to be changed carefully so that things don't break.

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved
@bwplotka
Copy link
Member Author

bwplotka commented Dec 4, 2024

Ok, so from the comments so far and slack seems to me like there is a sentiment to:

  1. Switch to versioning segments, not directories.
  2. Force last LTS to support the new version BEFORE changing the default to write new version; consider backports if the risk is small.

WDYT @krajorama @bboreham should I switch my proposal to those?

bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 10, 2024
bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 10, 2024
bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 10, 2024
@krajorama
Copy link
Member

Ok, so from the comments so far and slack seems to me like there is a sentiment to:

1. Switch to versioning segments, not directories.

Yes, segments already provide ordering and would help us in upgrade->downgrade->upgrade scenarios.
On a practical side, segments don't have a header, so I don't know how we'd do it, but that's another question.

2. Force last LTS to support the new version BEFORE changing the default to write new version; consider backports if the risk is small.

Yes. I think there's a class of user that would very much expect this.

WDYT @krajorama @bboreham should I switch my proposal to those?

@bwplotka
Copy link
Member Author

bwplotka commented Dec 10, 2024

Updated to segment based versioning using filename semantics e.g. 00000001-v2. I started implementation to see how it looks like and it looks promising. Changes are relatively trivial, but have a large surface in the WAL code (~1500 LOC).

Also updated to now ensure LTS release before switching.

This proposal is ready for a second look! cc @bboreham @krajorama @carrieedwards - thanks for the prompt reviews so far!

bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 11, 2024
bwplotka and others added 3 commits December 11, 2024 13:32
Co-authored-by: Nicolas Takashi <[email protected]>
Signed-off-by: Bartlomiej Plotka <[email protected]>
Signed-off-by: bwplotka <[email protected]>
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this! I just have a few questions for clarification

Historically, we didn't hit major problems because we were only adding new semantic data (e.g. exemplars, metadata, new native histograms) as new records.
However, these days, we need to add features to existing data (e.g. custom buckets that will replace classic histograms, cleanup of histogram records or created timestamps to samples). Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**.

For the TSDB changes (see the [context](#context-tsdb-format-changes), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For the TSDB changes (see the [context](#context-tsdb-format-changes), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.
For the TSDB changes (see the [context](#context-tsdb-format-changes)), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, WAL data spans up to 3hrs right before head compaction. When m-mapped chunks are all fine (i.e. not corrupted), only then the last 30mins is the relevant bit with the given 15s interval.

There are two reasons for the flag:

* Allows users to get the new features sooner and skip the safety mechanism.
* It simplifies the compatibility guarantees as the flag default mechanism guides users and devs in the rollout and revert procedures e.g:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a quick static check on startup if the binary supports all the WAL files present on the disk.

* Mentioning Write-Before-Log (WBL) or checkpoints, both uses WAL format internally.
* To reduce the scope we don't mention [memory snapshot format](https://github.com/prometheus/prometheus/blob/fd5ea4e0b590d57df1e2ef41d027f2a4f640a46c/tsdb/docs/format/memory_snapshot.md#L1) for now.

## How
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you mentioned development velocity as one of the goal, I see going from say v3 to v6 between two LTS releases (say X and Y) is a possibility. Where LTS X supports v1 v2 v3, writes v3, and LTS Y writes v3 and supports reading v1-6, whereas release Y+1 will default to writing v6 according to your example. So if a user wanted to enable any of the v4 v5 v6 between releases X and Y, is it the responsibility of the user to make sure they can rollback the version using the documented WAL versions?

We recommend the [Two-Fold Migration Strategy](#how-two-fold-migration-strategy) with the two important additions:

* A new flag that tells Prometheus what WAL version to write.
* There can be multiple "forward compatible" versions, but the official minimum is **the previous [Long-Term-Support (LTS) release](https://prometheus.io/docs/introduction/release-cycle/)**. The forward compatibility can be optionally backported to the LTS, instead of waiting up to a 1y, depending on the risk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we backport the forward compatibility to the last LTS, will we also bump the default write version in the upcoming release before the next LTS?

If no: then why backport in the first place.

If yes: what about other "breaking" features that are in the latest releases that is not in the last LTS - you can't really rollback to the last LTS anymore because of that. Maybe this is the first step in the long term goal of all breaking additions can have forward compatibility in the last LTS, but I don't see it practically happening soon (maybe we enjoy this phase until we decide to break some other part of TSDB).

Copy link

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

came across this, thanks for putting this together.
Left some comments.

* [metadata fields are arbitrary labels in metadata record](https://github.com/prometheus/prometheus/blob/e410a215fbe89b67f0e8edef9de25ede503ea4e0/tsdb/record/record.go#L608).

Historically, we didn't hit major problems because we were only adding new semantic data (e.g. exemplars, metadata, new native histograms) as new records.
However, these days, we need to add features to existing data (e.g. custom buckets that will replace classic histograms, cleanup of histogram records or created timestamps to samples). Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how we can do better than Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**, even with the proposed changes, an old Prometheus version cannot do better than treating the new format data/records as unknown. as part of the forward compatibility.
So, in all cases, the old Prometheus version will lose that information as it doesn't know how to interpret it.

* Demotivating for format changes (long feedback loop)
* Harder to communicate what exactly changed in each Prometheus version or even implement backward compatibility?

### Alternative: Use feature flag instead

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the linked issues and efforts motivating the segments versioning, and I believe all of them can be managed with the two-fold strategy/migration and a feature flag, so maybe I'm missing sth or maybe more concrete scenarios are missing in the proposal.

Unless we intend to change the record's header format (which I think we can also do without having to verson the WAL segments), I think the current format is generic enough to handle all other changes. It's a map of record type/code -> how to handle it, I think we can easily customize that mapping (and improve it if needed) without having to version the segments.

For example, If I were to propose an optimized encoding for samples, I would add a SampleOptimized record type (with its reader/writer) and a --feature-flag=optimized-wal-encoding or sth to trigger reading and then writing in another version. With the two-fold strategy and proper documentation/assistance on potential downgrades [1], as suggested above, and on feature gate deactivation [2], I believe we could roll out and test that SampleOptimized. This approach aligns with how we've proceeded so far, as you mentioned.

While versioning WAL segments provides more visibility into the version in use, what does a v3 segment version really mean to a user? Does it mean native histograms are supported? Does it mean optimized encoding is enabled? I think it would only add confusion.

Also, having such vague versions would create additional confusion and necessitate more synchronization between developers working concurrently on changes that require "incrementing the WAL segment version".


[1] In addition to the documentation, we could implement a built-in safeguard mechanism for downgrades: store the minimum downgrade version on disk and refuse to start up unless the file is manually deleted or another action is taken.
Also, if we want to raise awareness about the "dangers of unsupported downgrading", we may need to reconsider how record.Unknown is handled.

[2] We could leverage this proposal to improve the discoverability of built-in feature gates (I think there is an issue for this somewhere).

Copy link
Member Author

@bwplotka bwplotka Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fair points. Sounds like you propose a feature flags with feature names, which semantically might be close to "record versioning" instead of a single "segment version" proposed right now. It seems the main difference is organizing features into version umbrella or deliver them in separation. Definitely something to consider, let's start unpacking pros & cons.

While versioning WAL segments provides more visibility into the version in use, what does a v3 segment version really mean to a user? Does it mean native histograms are supported? Does it mean optimized encoding is enabled? I think it would only add confusion. Also, having such vague versions would create additional confusion and necessitate more synchronization between developers working concurrently on changes that require "incrementing the WAL segment version".

Yes. It will means immediately nothing what it changes to user and that's on purpose. This versioning is optimized for telling users about important data compatibility and discovering what version of Prometheus supports what versions of WAL, that's it.

Imagine we change WAL and finally use proper nativehistogram records (instead of 4 different records now) as a wal-nhcleanup feature, then new sample with CT with wal-ctsample feature, finally let's say structured summary in future with wal-summary feature. Now you deploy new Prometheus version that (after a fold) defaults to wal-nhcleanup, then user manually opt-in into wal-ctsample and wal-summary is still opt-out. Let's say user reverts to the previous Prometheus release it was running (e.g. LTS or let's say 3 versions back) and some data are completely missing because of WAL incompatibility. Finding what feature actually is incompatible and what version of Prometheus supports what features is not easy. Talking in segment versions is a bit more consistent and obvious when ls-ing your directory or logging on unknown segment version.

But.. maybe that user experience is not needed. Maybe asking those tricky features to be in LTS version before switching to opt-in would help (I would avoid having to wait for v4.x Prometheus etc), better docs could be added too.

WDYT?

Copy link

@machine424 machine424 Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and some data are completely missing because of WAL incompatibility

As I mentioned earlier, segment versioning isn't a magic solution for this issue. If a user downgrades to a Prometheus version that can't decode v2 segments, the data within those segments will be lost. However, if users follow the recommended downgrade paths, the feature flag approach should be able to handle those scenarios without data loss.

Finding what feature actually is incompatible and what version of Prometheus supports what features is not easy. Talking in segment versions is a bit more consistent and obvious when ls-ing your directory or logging on unknown segment version.

I think adding a new dimension (segment version) to the mix would just add to the confusion here, especially since users will still need to consult documentation of other tools to figure out the differences between segment v3 and v2.

I think well-documented feature flags and a well-documented compatibility matrix should be enough. The most important thing, as you mentioned, is to provide a well-tested fallback (through the two-fold strategy).

So, what I think we could still do through this proposal is:

  • Officially document the two-fold strategy, with examples and schemas ;)
  • Start requiring such "invasive/critical" (or maybe all of them) feature flags to:
    • Ensure a fallback/LTS version exists and is mentioned in the docs before shipping the "invasive/feature flag in write mode".
    • Explicitly document what happens when the feature flag is disabled in the relevant versions.
  • Add the needed utils/infra to allow e2e tests concerning those scenarios, in addition to those e2e tests, the feature flag can always be tested on long running stage envs even though an LTS fallback version isn't ready yet, as long as the user is aware of that...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WAL/WBL: Make iterating on format schema easier; consider versioning & forward compatibility
7 participants