-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose WAL format versioning and change strategy. #40
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: bwplotka <[email protected]>
Co-authored-by: George Krajcsovits <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Signed-off-by: bwplotka <[email protected]>
* changes that merge records | ||
* sharding? | ||
|
||
### Maintain two WALs (well four, with WBL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have separate WAL and WBL ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because WBL is a separate WAL on disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a re-statement of the fact. I am asking what the reason behind it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, not sure if there's any efficiency reason (e.g. easier to find OOO records) cc @codesome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing out-of-order, the WAL had all the out-of-order samples in them (both the ingested and uningested out-of-order samples). So we needed a way to know exactly what out-of-order samples were ingested, so that we can restore only those samples. We could only know if a sample went into a chunk at the time of writing to the chunk (i.e. after writing to WAL), so we introduced WBL to log the out-of-order samples ingested, since it was logged after samples went into the chunk. Data in WBL only goes into the out-of-order head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could probably be merged together by writing separate OOO records in WAL. But the replay needs to be changed carefully so that things don't break.
Signed-off-by: bwplotka <[email protected]>
Ok, so from the comments so far and slack seems to me like there is a sentiment to:
WDYT @krajorama @bboreham should I switch my proposal to those? |
Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>
Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>
Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>
Yes, segments already provide ordering and would help us in upgrade->downgrade->upgrade scenarios.
Yes. I think there's a class of user that would very much expect this.
|
Signed-off-by: bwplotka <[email protected]>
Updated to segment based versioning using filename semantics e.g. Also updated to now ensure LTS release before switching. This proposal is ready for a second look! cc @bboreham @krajorama @carrieedwards - thanks for the prompt reviews so far! |
Implementation for prometheus/proposals#40 Signed-off-by: bwplotka <[email protected]>
Co-authored-by: Nicolas Takashi <[email protected]> Signed-off-by: Bartlomiej Plotka <[email protected]>
Signed-off-by: bwplotka <[email protected]>
Signed-off-by: bwplotka <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this! I just have a few questions for clarification
Historically, we didn't hit major problems because we were only adding new semantic data (e.g. exemplars, metadata, new native histograms) as new records. | ||
However, these days, we need to add features to existing data (e.g. custom buckets that will replace classic histograms, cleanup of histogram records or created timestamps to samples). Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**. | ||
|
||
For the TSDB changes (see the [context](#context-tsdb-format-changes), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the TSDB changes (see the [context](#context-tsdb-format-changes), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL. | |
For the TSDB changes (see the [context](#context-tsdb-format-changes)), tribally, we use an undocumented ["2-fold" migration strategy](#how-two-fold-migration-strategy). However, WAL data is typically significantly smaller, around ~30m worth of samples (time to gather 120 samples for a chunk, for 15s intervals), plus 2h series records in WAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, WAL data spans up to 3hrs right before head compaction. When m-mapped chunks are all fine (i.e. not corrupted), only then the last 30mins is the relevant bit with the given 15s interval.
There are two reasons for the flag: | ||
|
||
* Allows users to get the new features sooner and skip the safety mechanism. | ||
* It simplifies the compatibility guarantees as the flag default mechanism guides users and devs in the rollout and revert procedures e.g: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have a quick static check on startup if the binary supports all the WAL files present on the disk.
* Mentioning Write-Before-Log (WBL) or checkpoints, both uses WAL format internally. | ||
* To reduce the scope we don't mention [memory snapshot format](https://github.com/prometheus/prometheus/blob/fd5ea4e0b590d57df1e2ef41d027f2a4f640a46c/tsdb/docs/format/memory_snapshot.md#L1) for now. | ||
|
||
## How |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you mentioned development velocity as one of the goal, I see going from say v3 to v6 between two LTS releases (say X and Y) is a possibility. Where LTS X supports v1 v2 v3, writes v3, and LTS Y writes v3 and supports reading v1-6, whereas release Y+1 will default to writing v6 according to your example. So if a user wanted to enable any of the v4 v5 v6 between releases X and Y, is it the responsibility of the user to make sure they can rollback the version using the documented WAL versions?
We recommend the [Two-Fold Migration Strategy](#how-two-fold-migration-strategy) with the two important additions: | ||
|
||
* A new flag that tells Prometheus what WAL version to write. | ||
* There can be multiple "forward compatible" versions, but the official minimum is **the previous [Long-Term-Support (LTS) release](https://prometheus.io/docs/introduction/release-cycle/)**. The forward compatibility can be optionally backported to the LTS, instead of waiting up to a 1y, depending on the risk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we backport the forward compatibility to the last LTS, will we also bump the default write version in the upcoming release before the next LTS?
If no: then why backport in the first place.
If yes: what about other "breaking" features that are in the latest releases that is not in the last LTS - you can't really rollback to the last LTS anymore because of that. Maybe this is the first step in the long term goal of all breaking additions can have forward compatibility in the last LTS, but I don't see it practically happening soon (maybe we enjoy this phase until we decide to break some other part of TSDB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
came across this, thanks for putting this together.
Left some comments.
* [metadata fields are arbitrary labels in metadata record](https://github.com/prometheus/prometheus/blob/e410a215fbe89b67f0e8edef9de25ede503ea4e0/tsdb/record/record.go#L608). | ||
|
||
Historically, we didn't hit major problems because we were only adding new semantic data (e.g. exemplars, metadata, new native histograms) as new records. | ||
However, these days, we need to add features to existing data (e.g. custom buckets that will replace classic histograms, cleanup of histogram records or created timestamps to samples). Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how we can do better than Even if we create a new record for those and use it for new samples, any rollback will **lose that information as they appear unknown in the old version**
, even with the proposed changes, an old Prometheus version cannot do better than treating the new format data/records as unknown. as part of the forward compatibility.
So, in all cases, the old Prometheus version will lose that information as it doesn't know how to interpret it.
* Demotivating for format changes (long feedback loop) | ||
* Harder to communicate what exactly changed in each Prometheus version or even implement backward compatibility? | ||
|
||
### Alternative: Use feature flag instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the linked issues and efforts motivating the segments versioning, and I believe all of them can be managed with the two-fold strategy/migration and a feature flag, so maybe I'm missing sth or maybe more concrete scenarios are missing in the proposal.
Unless we intend to change the record's header format (which I think we can also do without having to verson the WAL segments), I think the current format is generic enough to handle all other changes. It's a map of record type/code -> how to handle it
, I think we can easily customize that mapping (and improve it if needed) without having to version the segments.
For example, If I were to propose an optimized encoding for samples, I would add a SampleOptimized
record type (with its reader/writer) and a --feature-flag=optimized-wal-encoding
or sth to trigger reading and then writing in another version. With the two-fold strategy and proper documentation/assistance on potential downgrades [1], as suggested above, and on feature gate deactivation [2], I believe we could roll out and test that SampleOptimized
. This approach aligns with how we've proceeded so far, as you mentioned.
While versioning WAL segments provides more visibility into the version in use, what does a v3
segment version really mean to a user? Does it mean native histograms are supported? Does it mean optimized encoding is enabled? I think it would only add confusion.
Also, having such vague versions would create additional confusion and necessitate more synchronization between developers working concurrently on changes that require "incrementing the WAL segment version".
[1] In addition to the documentation, we could implement a built-in safeguard mechanism for downgrades: store the minimum downgrade version on disk and refuse to start up unless the file is manually deleted or another action is taken.
Also, if we want to raise awareness about the "dangers of unsupported downgrading", we may need to reconsider how record.Unknown
is handled.
[2] We could leverage this proposal to improve the discoverability of built-in feature gates (I think there is an issue for this somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fair points. Sounds like you propose a feature flags with feature names, which semantically might be close to "record versioning" instead of a single "segment version" proposed right now. It seems the main difference is organizing features into version umbrella or deliver them in separation. Definitely something to consider, let's start unpacking pros & cons.
While versioning WAL segments provides more visibility into the version in use, what does a v3 segment version really mean to a user? Does it mean native histograms are supported? Does it mean optimized encoding is enabled? I think it would only add confusion. Also, having such vague versions would create additional confusion and necessitate more synchronization between developers working concurrently on changes that require "incrementing the WAL segment version".
Yes. It will means immediately nothing what it changes to user and that's on purpose. This versioning is optimized for telling users about important data compatibility and discovering what version of Prometheus supports what versions of WAL, that's it.
Imagine we change WAL and finally use proper nativehistogram
records (instead of 4 different records now) as a wal-nhcleanup
feature, then new sample with CT with wal-ctsample
feature, finally let's say structured summary in future with wal-summary
feature. Now you deploy new Prometheus version that (after a fold) defaults to wal-nhcleanup
, then user manually opt-in into wal-ctsample
and wal-summary
is still opt-out. Let's say user reverts to the previous Prometheus release it was running (e.g. LTS or let's say 3 versions back) and some data are completely missing because of WAL incompatibility. Finding what feature actually is incompatible and what version of Prometheus supports what features is not easy. Talking in segment versions is a bit more consistent and obvious when ls
-ing your directory or logging on unknown segment version.
But.. maybe that user experience is not needed. Maybe asking those tricky features to be in LTS version before switching to opt-in would help (I would avoid having to wait for v4.x Prometheus etc), better docs could be added too.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and some data are completely missing because of WAL incompatibility
As I mentioned earlier, segment versioning isn't a magic solution for this issue. If a user downgrades to a Prometheus version that can't decode v2
segments, the data within those segments will be lost. However, if users follow the recommended downgrade paths, the feature flag approach should be able to handle those scenarios without data loss.
Finding what feature actually is incompatible and what version of Prometheus supports what features is not easy. Talking in segment versions is a bit more consistent and obvious when ls-ing your directory or logging on unknown segment version.
I think adding a new dimension (segment version) to the mix would just add to the confusion here, especially since users will still need to consult documentation of other tools to figure out the differences between segment v3
and v2
.
I think well-documented feature flags and a well-documented compatibility matrix should be enough. The most important thing, as you mentioned, is to provide a well-tested fallback (through the two-fold strategy).
So, what I think we could still do through this proposal is:
- Officially document the two-fold strategy, with examples and schemas ;)
- Start requiring such "invasive/critical" (or maybe all of them) feature flags to:
- Ensure a fallback/LTS version exists and is mentioned in the docs before shipping the "invasive/feature flag in write mode".
- Explicitly document what happens when the feature flag is disabled in the relevant versions.
- Add the needed utils/infra to allow e2e tests concerning those scenarios, in addition to those e2e tests, the feature flag can always be tested on long running stage envs even though an LTS fallback version isn't ready yet, as long as the user is aware of that...
Fixes prometheus/prometheus#15200
Also join
#prometheus-wal-dev
on Slack for the sync discussion!