Optionally compress the WAL using Snappy #609

csmarchbanks · 2019-05-18T20:42:46Z

Compress the WAL records using snappy. Running in test clusters shows that the WAL is half the size with minimal increase in CPU.

Benchmarks:
2012 macbook pro, SATA flash drive

go test ./wal -bench . -benchmem -benchtime=5s
goos: darwin
goarch: amd64
pkg: github.com/prometheus/tsdb/wal
BenchmarkWAL_LogBatched/compress=true-8                   500000             13234 ns/op         154.74 MB/s           0 B/op          0 allocs/op
BenchmarkWAL_LogBatched/compress=false-8                 1000000              5278 ns/op         387.96 MB/s           0 B/op          0 allocs/op
BenchmarkWAL_Log/compress=true-8                          500000             13209 ns/op         155.04 MB/s           0 B/op          0 allocs/op
BenchmarkWAL_Log/compress=false-8                         500000             19728 ns/op         103.81 MB/s           0 B/op          0 allocs/op

5th Gen Lenovo X1 Carbon, NVMe flash drive

go test ./wal -bench . -benchmem -benchtime=5s
goos: linux
goarch: amd64
pkg: github.com/prometheus/tsdb/wal
BenchmarkWAL_LogBatched/compress=true-4                  5000000              1590 ns/op        1287.93 MB/s           0 B/op          0 allocs/op
BenchmarkWAL_LogBatched/compress=false-4                 3000000              6327 ns/op         323.66 MB/s           0 B/op          0 allocs/op
BenchmarkWAL_Log/compress=true-4                         5000000              1555 ns/op        1316.45 MB/s           0 B/op          0 allocs/op
BenchmarkWAL_Log/compress=false-4                        1000000              5862 ns/op         349.36 MB/s           0 B/op          0 allocs/op

Closes #599

krasi-georgiev · 2019-05-18T20:49:43Z

I would say lets have it optional and enabled by default so that if people run into some sort of performance issues can always disable. I guess some people would prefer high performance where others would prefer lower disk usage.

csmarchbanks · 2019-05-20T16:32:45Z

My only concern with having it enabled by default is that a user will lose any compressed WAL data if they roll back their Prometheus version. Perhaps have one release of Prometheus with default false, and then default it to true?

brian-brazil · 2019-05-20T17:32:55Z

Thus far we've enabled this sort of thing by default, we only promise that you can roll forward.

beorn7 · 2019-05-22T12:09:17Z

Thus far we've enabled this sort of thing by default, we only promise that you can roll forward.

Actually, we were very conservative at times with changes that prevent a roll-back. At other time, we were not. The latter has caused some headaches with users. At least initially, I'd always leave the default at the old (roll-back-able) setting. Once a minor release has worked fine with the new setting, we can consider flipping the default.

I do get the promise of not necessarily supporting a roll-back. However, we can still be conservative and introduce a new feature as an opt-in first, simply to avoid the scenario that a bug (which might or might not be related to the new feature) is discovered in the new minor release but then you cannot roll back.

csmarchbanks · 2019-05-23T01:38:14Z

First test of this in Prometheus is going well. Only ingesting 8k samples per second on this instance, but I see no significant increase in CPU usage. The compression ratio for my data is steady at 2.1x.

krasi-georgiev · 2019-05-28T13:37:23Z

In this case a potential bug would cause the user to delete the WAL folder(3h data loss) and starting Prometheus with the the compression disabled so I think this should be enough of an option to not need to roll back to a previous Prometheus version.

I am leaning towards having it enabled by default with an option to disable so that we can have as many people as possible to test it and report bugs.
Once considered stable(after 1-2 releases) make it permanently enabled and remove the option to disable it.

@csmarchbanks is this ready for a review?

csmarchbanks · 2019-05-28T13:41:59Z

This is ready for a review, thanks in advance!

krasi-georgiev

just few nits, overall very good PR and ready to be merged.

the RSS RAM usage has increased so maybe worth it to run some profiling. Maybe the buf is escaping in the heap.
http://prombench.prometheus.io/grafana/d/7gmLoNDmz/prombench?orgId=1&var-RuleGroup=All&var-pr-number=5592&from=1558629647622&to=1558690279046

wal/wal.go

wal/reader.go

wal/wal.go

head_test.go

wal/live_reader.go

beorn7 · 2019-05-29T17:21:41Z

In this case a potential bug would cause the user to delete the WAL folder(3h data loss) and starting Prometheus with the the compression disabled so I think this should be enough of an option to not need to roll back to a previous Prometheus version.

I am leaning towards having it enabled by default with an option to disable so that we can have as many people as possible to test it and report bugs.
Once considered stable(after 1-2 releases) make it permanently enabled and remove the option to disable it.

Sorry for beating this seemingly dead horse again.

I agree that the stakes are not very high in this case. But that's perhaps a good opportunity to practice how we introduce un-rollback-able changes. We can also re-build the trust that we might have lost during the problematic v2.1 to v2.3 releases.

How does the following look to you:

v2.11: Introduce WAL compression as opt-in via flag. If v2.11 has a possibly unrelated bug that requires a rollback, only those are affected that explicitly opted in. We'll still get a fair amount of users testing it, but the blast radius is limited (and in particular, nobody will feel treated unfairly).
v2.12: If things look good in v2.11, make the compression the new default value of the flag. If v2.12 has an unrelated bug, roll-back is not a problem if you set the flag explicitly in v2.11. We will now got almost everybody using to reach the final level of confidence.
v2.13: If there is no reason to ever switch off compression (which we should know by now), make the flag a no-op. (I wouldn't be surprised if some people don't want compression. We have seen users being CPU-bound during WAL replaying, complaining about the long time it took. They will prefer an uncompressed WAL, while those being I/O bound will prefer the compressed WAL.)

This is not much slower (if at all) than what you suggested (make it default and remove flag two releases later).

krasi-georgiev · 2019-05-29T22:11:57Z

yep looks like a good plan.

We'll still get a fair amount of users testing it, but the blast radius is limited

I hope enough people really enable it to report bugs

I wouldn't be surprised if some people don't want compression. We have seen users being CPU-bound during WAL replaying, complaining about the long time it took. They will prefer an uncompressed WAL, while those being I/O bound will prefer the compressed WAL.

so far the tests suggest that CPU usage hasn't increased. Looks like snappy is a very light weight compression.

krasi-georgiev

LGTM

ping @gouthamve

krasi-georgiev · 2019-05-30T10:12:46Z

actually do you think you can also add a test to ensure that the compression actually works.
Maybe:

Open a db with uncompressed WAL.
Write some samples.
Take a reading of the WAL.
Open a db with compressed WAL.
Write some samples.
Compare the size of the WAL.

csmarchbanks · 2019-05-30T15:44:47Z

Added a test and removed the metrics. Let me know if there is anything else you would like me to do!

wal/wal_test.go

SuperQ · 2019-05-31T15:04:42Z

👍 to @beorn7's timeline proposal. Having a quick rollback for users that hit other bugs is a valid reason to not jump right to default compression.

krasi-georgiev · 2019-05-31T22:32:14Z

LGTM
ping @gouthamve @codesome

csmarchbanks · 2019-06-05T17:18:35Z

Rebased this to fix conflicts.

Bump for a second set of eyes: @gouthamve @codesome

codesome

I gave a quick look at the main implementation and it looks good. Just some nits.
I trust @krasi-georgiev's review on the things that I didn't touch.

wal/reader.go

wal/wal.go

csmarchbanks · 2019-06-06T14:00:45Z

Thanks for the review @codesome! I made the changes, and rebased once more due to conflicts in head_test.go.

csmarchbanks · 2019-06-10T17:10:20Z

Ping for one final look from @codesome or @gouthamve since my last updates.

csmarchbanks · 2019-06-18T15:16:36Z

Rebased again, @codesome and @krasi-georgiev would love to get this merged soon!

krasi-georgiev · 2019-06-18T15:44:14Z

@csmarchbanks yes I reminder @codesome to have a final look and should get it in in the next few days.

codesome · 2019-06-19T06:15:14Z

Need to add an entry in CHANGELOG.

In running Prometheus instances, compressing the records was shown to reduce disk usage by half while incurring a negligible CPU cost. Signed-off-by: Chris Marchbanks <[email protected]>

csmarchbanks · 2019-06-19T13:14:33Z

Thanks @codesome! I have added a CHANGELOG entry.

krasi-georgiev · 2019-06-19T13:47:11Z

Thanks, hopefully we didn't brake anything with this compression 😺

krasi-georgiev · 2019-07-20T11:06:51Z

@csmarchbanks I just remembered that we didn't add a test to check the behaviour when starting Prometheus with an existing uncompressed wal. In other words make sure it handles well a WAL file that contains compressed and uncompressed records

csmarchbanks · 2019-07-20T14:45:40Z

@krasi-georgiev Since #608 I don't think there will ever be a mixed WAL file. Also, since the reader is the same for a compressed or uncompressed WAL file (it looks at whether each record is compressed or not), the code path should be tested whenever we test with compress=false.

If you think it is worth an explicit test I can write one soon though.

krasi-georgiev · 2019-07-21T10:55:11Z

you are right #608 is enough. Lets leave as is.

csmarchbanks mentioned this pull request May 18, 2019

Investigate WAL compression #599

Closed

csmarchbanks changed the title ~~Optionally compress the WAL using Snappy~~ WIP: Optionally compress the WAL using Snappy May 18, 2019

csmarchbanks force-pushed the compressed-WAL branch from ee4a2c5 to 238f946 Compare May 18, 2019 20:54

csmarchbanks force-pushed the compressed-WAL branch from 1ae1f89 to 83ac0d2 Compare May 20, 2019 19:51

csmarchbanks force-pushed the compressed-WAL branch from 83ac0d2 to f3af604 Compare May 22, 2019 17:52

csmarchbanks force-pushed the compressed-WAL branch from 2ac1397 to 0c6993f Compare May 23, 2019 03:06

csmarchbanks mentioned this pull request May 23, 2019

Compress WAL prometheus/prometheus#5592

Merged

csmarchbanks force-pushed the compressed-WAL branch 3 times, most recently from 8ea640e to e95e806 Compare May 24, 2019 20:47

csmarchbanks changed the title ~~WIP: Optionally compress the WAL using Snappy~~ Optionally compress the WAL using Snappy May 24, 2019

krasi-georgiev reviewed May 29, 2019

View reviewed changes

wal/wal.go Outdated Show resolved Hide resolved

wal/reader.go Outdated Show resolved Hide resolved

wal/wal.go Outdated Show resolved Hide resolved

head_test.go Show resolved Hide resolved

wal/live_reader.go Show resolved Hide resolved

csmarchbanks force-pushed the compressed-WAL branch from e95e806 to 0d94777 Compare May 29, 2019 16:29

krasi-georgiev reviewed May 30, 2019

View reviewed changes

csmarchbanks force-pushed the compressed-WAL branch from cb4dfe6 to 8627a8e Compare May 30, 2019 15:41

krasi-georgiev reviewed May 31, 2019

View reviewed changes

wal/wal_test.go Outdated Show resolved Hide resolved

wal/wal_test.go Outdated Show resolved Hide resolved

wal/wal_test.go Show resolved Hide resolved

csmarchbanks force-pushed the compressed-WAL branch from 8627a8e to 63101a0 Compare May 31, 2019 15:40

csmarchbanks force-pushed the compressed-WAL branch from 63101a0 to e1eba5f Compare June 5, 2019 17:15

codesome reviewed Jun 6, 2019

View reviewed changes

wal/reader.go Outdated Show resolved Hide resolved

wal/wal.go Outdated Show resolved Hide resolved

wal/wal.go Outdated Show resolved Hide resolved

csmarchbanks force-pushed the compressed-WAL branch from e1eba5f to a5ff5d0 Compare June 6, 2019 13:58

csmarchbanks force-pushed the compressed-WAL branch 2 times, most recently from 6ea65a3 to f32a76f Compare June 18, 2019 15:15

codesome approved these changes Jun 19, 2019

View reviewed changes

Provide option to compress WAL records

86c905c

In running Prometheus instances, compressing the records was shown to reduce disk usage by half while incurring a negligible CPU cost. Signed-off-by: Chris Marchbanks <[email protected]>

csmarchbanks force-pushed the compressed-WAL branch from f32a76f to 86c905c Compare June 19, 2019 13:13

krasi-georgiev merged commit b40cc43 into prometheus-junkyard:master Jun 19, 2019

csmarchbanks deleted the compressed-WAL branch June 19, 2019 14:17

This was referenced Jul 3, 2019

Updated TSDB dependency to 0.9.0. prometheus/prometheus#5733

Closed

Cut v2.11.0-rc.0 prometheus/prometheus#5729

Merged

csmarchbanks mentioned this pull request Sep 15, 2020

Add metadata to WAL, expose to WAL reader and send with remote write prometheus/prometheus#7771

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally compress the WAL using Snappy #609

Optionally compress the WAL using Snappy #609

csmarchbanks commented May 18, 2019 •

edited

Loading

krasi-georgiev commented May 18, 2019

csmarchbanks commented May 20, 2019

brian-brazil commented May 20, 2019

beorn7 commented May 22, 2019

csmarchbanks commented May 23, 2019 •

edited

Loading

krasi-georgiev commented May 28, 2019

csmarchbanks commented May 28, 2019

krasi-georgiev left a comment •

edited

Loading

beorn7 commented May 29, 2019

krasi-georgiev commented May 29, 2019

krasi-georgiev left a comment •

edited

Loading

krasi-georgiev commented May 30, 2019 •

edited

Loading

csmarchbanks commented May 30, 2019

SuperQ commented May 31, 2019

krasi-georgiev commented May 31, 2019

csmarchbanks commented Jun 5, 2019

codesome left a comment

csmarchbanks commented Jun 6, 2019

csmarchbanks commented Jun 10, 2019

csmarchbanks commented Jun 18, 2019

krasi-georgiev commented Jun 18, 2019

codesome commented Jun 19, 2019

csmarchbanks commented Jun 19, 2019

krasi-georgiev commented Jun 19, 2019

krasi-georgiev commented Jul 20, 2019

csmarchbanks commented Jul 20, 2019

krasi-georgiev commented Jul 21, 2019

Optionally compress the WAL using Snappy #609

Optionally compress the WAL using Snappy #609

Conversation

csmarchbanks commented May 18, 2019 • edited Loading

krasi-georgiev commented May 18, 2019

csmarchbanks commented May 20, 2019

brian-brazil commented May 20, 2019

beorn7 commented May 22, 2019

csmarchbanks commented May 23, 2019 • edited Loading

krasi-georgiev commented May 28, 2019

csmarchbanks commented May 28, 2019

krasi-georgiev left a comment • edited Loading

Choose a reason for hiding this comment

beorn7 commented May 29, 2019

krasi-georgiev commented May 29, 2019

krasi-georgiev left a comment • edited Loading

Choose a reason for hiding this comment

krasi-georgiev commented May 30, 2019 • edited Loading

csmarchbanks commented May 30, 2019

SuperQ commented May 31, 2019

krasi-georgiev commented May 31, 2019

csmarchbanks commented Jun 5, 2019

codesome left a comment

Choose a reason for hiding this comment

csmarchbanks commented Jun 6, 2019

csmarchbanks commented Jun 10, 2019

csmarchbanks commented Jun 18, 2019

krasi-georgiev commented Jun 18, 2019

codesome commented Jun 19, 2019

csmarchbanks commented Jun 19, 2019

krasi-georgiev commented Jun 19, 2019

krasi-georgiev commented Jul 20, 2019

csmarchbanks commented Jul 20, 2019

krasi-georgiev commented Jul 21, 2019

csmarchbanks commented May 18, 2019 •

edited

Loading

csmarchbanks commented May 23, 2019 •

edited

Loading

krasi-georgiev left a comment •

edited

Loading

krasi-georgiev left a comment •

edited

Loading

krasi-georgiev commented May 30, 2019 •

edited

Loading