Propose WAL format versioning and change strategy. #40

bwplotka · 2024-11-26T21:37:53Z

Fixes prometheus/prometheus#15200

Also join #prometheus-wal-dev on Slack for the sync discussion!

proposals/2024-11-25_changing_wal_format.md

Signed-off-by: bwplotka <[email protected]>

proposals/2024-11-25_changing_wal_format.md

Co-authored-by: George Krajcsovits <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>

Signed-off-by: bwplotka <[email protected]>

proposals/2024-11-25_changing_wal_format.md

bboreham · 2024-12-02T11:15:03Z

proposals/2024-11-25_changing_wal_format.md

+  * changes that merge records 
+  * sharding?
+
+### Maintain two WALs (well four, with WBL)


why do we have separate WAL and WBL ?

Because WBL is a separate WAL on disk

That's a re-statement of the fact. I am asking what the reason behind it is.

Good question, not sure if there's any efficiency reason (e.g. easier to find OOO records) cc @codesome

When implementing out-of-order, the WAL had all the out-of-order samples in them (both the ingested and uningested out-of-order samples). So we needed a way to know exactly what out-of-order samples were ingested, so that we can restore only those samples. We could only know if a sample went into a chunk at the time of writing to the chunk (i.e. after writing to WAL), so we introduced WBL to log the out-of-order samples ingested, since it was logged after samples went into the chunk. Data in WBL only goes into the out-of-order head.

This could probably be merged together by writing separate OOO records in WAL. But the replay needs to be changed carefully so that things don't break.

proposals/2024-11-25_changing_wal_format.md

Signed-off-by: bwplotka <[email protected]>

bwplotka · 2024-12-04T12:27:47Z

Ok, so from the comments so far and slack seems to me like there is a sentiment to:

Switch to versioning segments, not directories.
Force last LTS to support the new version BEFORE changing the default to write new version; consider backports if the risk is small.

WDYT @krajorama @bboreham should I switch my proposal to those?

Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>

krajorama · 2024-12-10T14:35:07Z

Ok, so from the comments so far and slack seems to me like there is a sentiment to:
1. Switch to versioning segments, not directories.

Yes, segments already provide ordering and would help us in upgrade->downgrade->upgrade scenarios.
On a practical side, segments don't have a header, so I don't know how we'd do it, but that's another question.

2. Force last LTS to support the new version BEFORE changing the default to write new version; consider backports if the risk is small.

Yes. I think there's a class of user that would very much expect this.

WDYT @krajorama @bboreham should I switch my proposal to those?

Signed-off-by: bwplotka <[email protected]>

bwplotka · 2024-12-10T15:29:54Z

Updated to segment based versioning using filename semantics e.g. 00000001-v2. I started implementation to see how it looks like and it looks promising. Changes are relatively trivial, but have a large surface in the WAL code (~1500 LOC).

Also updated to now ensure LTS release before switching.

This proposal is ready for a second look! cc @bboreham @krajorama @carrieedwards - thanks for the prompt reviews so far!

proposals/2024-11-25_changing_wal_format.md

Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>

Co-authored-by: Nicolas Takashi <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>

Signed-off-by: bwplotka <[email protected]>

codesome

Thanks for writing this! I just have a few questions for clarification

codesome · 2024-12-16T23:02:17Z

proposals/2024-11-25_changing_wal_format.md

+Historically, we didn't hit major problems because we were only adding new semantic data (e.g. exemplars, metadata, new native histograms) as new records. 
+However, these days, we need to add features to existing data (e.g. custom buckets that will replace classic histograms, cleanup of histogram records or created timestamps to samples). Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**.
+
+For the TSDB changes (see the [context](#context-tsdb-format-changes), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.


Suggested change

For the TSDB changes (see the [context](#context-tsdb-format-changes), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.

For the TSDB changes (see the [context](#context-tsdb-format-changes)), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.

Also, WAL data spans up to 3hrs right before head compaction. When m-mapped chunks are all fine (i.e. not corrupted), only then the last 30mins is the relevant bit with the given 15s interval.

codesome · 2024-12-16T23:42:14Z

proposals/2024-11-25_changing_wal_format.md

+There are two reasons for the flag:
+
+* Allows users to get the new features sooner and skip the safety mechanism.
+* It simplifies the compatibility guarantees as the flag default mechanism guides users and devs in the rollout and revert procedures e.g:


It would be good to have a quick static check on startup if the binary supports all the WAL files present on the disk.

codesome · 2024-12-16T23:58:33Z

proposals/2024-11-25_changing_wal_format.md

+* Mentioning Write-Before-Log (WBL) or checkpoints, both uses WAL format internally.
+* To reduce the scope we don't mention [memory snapshot format](https://github.com/prometheus/prometheus/blob/fd5ea4e0b590d57df1e2ef41d027f2a4f640a46c/tsdb/docs/format/memory_snapshot.md#L1) for now.
+
+## How


Since you mentioned development velocity as one of the goal, I see going from say v3 to v6 between two LTS releases (say X and Y) is a possibility. Where LTS X supports v1 v2 v3, writes v3, and LTS Y writes v3 and supports reading v1-6, whereas release Y+1 will default to writing v6 according to your example. So if a user wanted to enable any of the v4 v5 v6 between releases X and Y, is it the responsibility of the user to make sure they can rollback the version using the documented WAL versions?

codesome · 2024-12-17T00:01:53Z

proposals/2024-11-25_changing_wal_format.md

+We recommend the [Two-Fold Migration Strategy](#how-two-fold-migration-strategy) with the two important additions:
+
+* A new flag that tells Prometheus what WAL version to write.
+* There can be multiple "forward compatible" versions, but the official minimum is **the previous [Long-Term-Support (LTS) release](https://prometheus.io/docs/introduction/release-cycle/)**. The forward compatibility can be optionally backported to the LTS, instead of waiting up to a 1y, depending on the risk.


If we backport the forward compatibility to the last LTS, will we also bump the default write version in the upcoming release before the next LTS?

If no: then why backport in the first place.

If yes: what about other "breaking" features that are in the latest releases that is not in the last LTS - you can't really rollback to the last LTS anymore because of that. Maybe this is the first step in the long term goal of all breaking additions can have forward compatibility in the last LTS, but I don't see it practically happening soon (maybe we enjoy this phase until we decide to break some other part of TSDB).

machine424

came across this, thanks for putting this together.
Left some comments.

machine424 · 2024-12-25T13:00:20Z

proposals/2024-11-25_changing_wal_format.md

+* [metadata fields are arbitrary labels in metadata record](https://github.com/prometheus/prometheus/blob/e410a215fbe89b67f0e8edef9de25ede503ea4e0/tsdb/record/record.go#L608).
+
+Historically, we didn't hit major problems because we were only adding new semantic data (e.g. exemplars, metadata, new native histograms) as new records. 
+However, these days, we need to add features to existing data (e.g. custom buckets that will replace classic histograms, cleanup of histogram records or created timestamps to samples). Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**.


I don't see how we can do better than Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**, even with the proposed changes, an old Prometheus version cannot do better than treating the new format data/records as unknown. as part of the forward compatibility.
So, in all cases, the old Prometheus version will lose that information as it doesn't know how to interpret it.

machine424 · 2024-12-25T13:06:23Z

proposals/2024-11-25_changing_wal_format.md

+* Demotivating for format changes (long feedback loop)
+* Harder to communicate what exactly changed in each Prometheus version or even implement backward compatibility?
+
+### Alternative: Use feature flag instead


I reviewed the linked issues and efforts motivating the segments versioning, and I believe all of them can be managed with the two-fold strategy/migration and a feature flag, so maybe I'm missing sth or maybe more concrete scenarios are missing in the proposal.

Unless we intend to change the record's header format (which I think we can also do without having to verson the WAL segments), I think the current format is generic enough to handle all other changes. It's a map of record type/code -> how to handle it, I think we can easily customize that mapping (and improve it if needed) without having to version the segments.

For example, If I were to propose an optimized encoding for samples, I would add a SampleOptimized record type (with its reader/writer) and a --feature-flag=optimized-wal-encoding or sth to trigger reading and then writing in another version. With the two-fold strategy and proper documentation/assistance on potential downgrades [1], as suggested above, and on feature gate deactivation [2], I believe we could roll out and test that SampleOptimized. This approach aligns with how we've proceeded so far, as you mentioned.

While versioning WAL segments provides more visibility into the version in use, what does a v3 segment version really mean to a user? Does it mean native histograms are supported? Does it mean optimized encoding is enabled? I think it would only add confusion.

Also, having such vague versions would create additional confusion and necessitate more synchronization between developers working concurrently on changes that require "incrementing the WAL segment version".

[1] In addition to the documentation, we could implement a built-in safeguard mechanism for downgrades: store the minimum downgrade version on disk and refuse to start up unless the file is manually deleted or another action is taken.
Also, if we want to raise awareness about the "dangers of unsupported downgrading", we may need to reconsider how record.Unknown is handled.

[2] We could leverage this proposal to improve the discoverability of built-in feature gates (I think there is an issue for this somewhere).

Thanks, fair points. Sounds like you propose a feature flags with feature names, which semantically might be close to "record versioning" instead of a single "segment version" proposed right now. It seems the main difference is organizing features into version umbrella or deliver them in separation. Definitely something to consider, let's start unpacking pros & cons.

While versioning WAL segments provides more visibility into the version in use, what does a v3 segment version really mean to a user? Does it mean native histograms are supported? Does it mean optimized encoding is enabled? I think it would only add confusion. Also, having such vague versions would create additional confusion and necessitate more synchronization between developers working concurrently on changes that require "incrementing the WAL segment version".

Yes. It will means immediately nothing what it changes to user and that's on purpose. This versioning is optimized for telling users about important data compatibility and discovering what version of Prometheus supports what versions of WAL, that's it.

Imagine we change WAL and finally use proper nativehistogram records (instead of 4 different records now) as a wal-nhcleanup feature, then new sample with CT with wal-ctsample feature, finally let's say structured summary in future with wal-summary feature. Now you deploy new Prometheus version that (after a fold) defaults to wal-nhcleanup, then user manually opt-in into wal-ctsample and wal-summary is still opt-out. Let's say user reverts to the previous Prometheus release it was running (e.g. LTS or let's say 3 versions back) and some data are completely missing because of WAL incompatibility. Finding what feature actually is incompatible and what version of Prometheus supports what features is not easy. Talking in segment versions is a bit more consistent and obvious when ls-ing your directory or logging on unknown segment version.

But.. maybe that user experience is not needed. Maybe asking those tricky features to be in LTS version before switching to opt-in would help (I would avoid having to wait for v4.x Prometheus etc), better docs could be added too.

WDYT?

and some data are completely missing because of WAL incompatibility

As I mentioned earlier, segment versioning isn't a magic solution for this issue. If a user downgrades to a Prometheus version that can't decode v2 segments, the data within those segments will be lost. However, if users follow the recommended downgrade paths, the feature flag approach should be able to handle those scenarios without data loss.

Finding what feature actually is incompatible and what version of Prometheus supports what features is not easy. Talking in segment versions is a bit more consistent and obvious when ls-ing your directory or logging on unknown segment version.

I think adding a new dimension (segment version) to the mix would just add to the confusion here, especially since users will still need to consult documentation of other tools to figure out the differences between segment v3 and v2.

I think well-documented feature flags and a well-documented compatibility matrix should be enough. The most important thing, as you mentioned, is to provide a well-tested fallback (through the two-fold strategy).

So, what I think we could still do through this proposal is:

Officially document the two-fold strategy, with examples and schemas ;)

Start requiring such "invasive/critical" (or maybe all of them) feature flags to:

Ensure a fallback/LTS version exists and is mentioned in the docs before shipping the "invasive/feature flag in write mode".

Explicitly document what happens when the feature flag is disabled in the relevant versions.

Add the needed utils/infra to allow e2e tests concerning those scenarios, in addition to those e2e tests, the feature flag can always be tested on long running stage envs even though an LTS fallback version isn't ready yet, as long as the user is aware of that...

This was referenced Nov 26, 2024

WAL/WBL: Make iterating on format schema easier; consider versioning & forward compatibility prometheus/prometheus#15200

Open

feat(nhcb): support custom buckets in native histograms in the WAL/WBL prometheus/prometheus#15467

Open

bwplotka assigned bwplotka and unassigned bwplotka Nov 26, 2024

bwplotka requested review from bboreham, codesome and carrieedwards November 26, 2024 21:41

krajorama reviewed Nov 28, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved

krajorama reviewed Nov 28, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Show resolved Hide resolved

krajorama reviewed Nov 28, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Show resolved Hide resolved

krajorama reviewed Nov 28, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved

Propose WAL format versioning and change strategy.

9dc697b

Signed-off-by: bwplotka <[email protected]>

bwplotka force-pushed the wal branch from 56e17e6 to 9dc697b Compare December 2, 2024 08:55

krajorama reviewed Dec 2, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved

bwplotka and others added 2 commits December 2, 2024 10:31

Update proposals/2024-11-25_changing_wal_format.md

038e9a4

Co-authored-by: George Krajcsovits <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>

Added a con for a flag; clarified deprecation policy.

ca31f30

Signed-off-by: bwplotka <[email protected]>

bboreham reviewed Dec 2, 2024

View reviewed changes

carrieedwards reviewed Dec 2, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved

carrieedwards reviewed Dec 2, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved

bwplotka added the proposal label Dec 4, 2024

Addressed initial comments.

c66ddd4

Signed-off-by: bwplotka <[email protected]>

bwplotka mentioned this pull request Dec 9, 2024

docs: Added native histogram WAL record documentation. prometheus/prometheus#15554

Merged

bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 10, 2024

WIP: Add WAL segment versioning.

8cac4cf

Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>

bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 10, 2024

WIP: Add WAL segment versioning.

0aacf4d

Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>

bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 10, 2024

Add WAL segment versioning; add flag (only v1 allowed).

fda2bf6

Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>

bwplotka mentioned this pull request Dec 10, 2024

Add WAL segment versioning. prometheus/prometheus#15565

Draft

Switched to segment based versioning and LTS.

341baa1

Signed-off-by: bwplotka <[email protected]>

nicolastakashi reviewed Dec 10, 2024

View reviewed changes

proposals/2024-11-25_changing_wal_format.md Outdated Show resolved Hide resolved

bwplotka added a commit to prometheus/prometheus that referenced this pull request Dec 11, 2024

Add WAL segment versioning; add flag (only v1 allowed).

2d3e8d4

Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>

bwplotka and others added 3 commits December 11, 2024 13:32

Update proposals/2024-11-25_changing_wal_format.md

5bd88f1

Co-authored-by: Nicolas Takashi <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>

Adds more info about the alternatives.

865b149

Signed-off-by: bwplotka <[email protected]>

Added another example.

4136c3c

Signed-off-by: bwplotka <[email protected]>

codesome reviewed Dec 17, 2024

View reviewed changes

machine424 reviewed Dec 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propose WAL format versioning and change strategy. #40

Propose WAL format versioning and change strategy. #40

bwplotka commented Nov 26, 2024 •

edited

Loading

bboreham Dec 2, 2024

bwplotka Dec 4, 2024

bboreham Dec 4, 2024

bwplotka Dec 4, 2024

codesome Dec 17, 2024

codesome Dec 17, 2024

bwplotka commented Dec 4, 2024

krajorama commented Dec 10, 2024

bwplotka commented Dec 10, 2024 •

edited

Loading

codesome left a comment

codesome Dec 16, 2024

codesome Dec 16, 2024

codesome Dec 16, 2024

codesome Dec 16, 2024

codesome Dec 17, 2024

machine424 left a comment

machine424 Dec 25, 2024

machine424 Dec 25, 2024

bwplotka Dec 25, 2024 •

edited

Loading

machine424 Dec 26, 2024 •

edited

Loading

	For the TSDB changes (see the [context](#context-tsdb-format-changes), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.
	For the TSDB changes (see the [context](#context-tsdb-format-changes)), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL.

Propose WAL format versioning and change strategy. #40

Are you sure you want to change the base?

Propose WAL format versioning and change strategy. #40

Conversation

bwplotka commented Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka commented Dec 4, 2024

krajorama commented Dec 10, 2024

bwplotka commented Dec 10, 2024 • edited Loading

codesome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwplotka Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

machine424 Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

bwplotka commented Nov 26, 2024 •

edited

Loading

bwplotka commented Dec 10, 2024 •

edited

Loading

bwplotka Dec 25, 2024 •

edited

Loading

machine424 Dec 26, 2024 •

edited

Loading