-
Notifications
You must be signed in to change notification settings - Fork 179
Optionally compress the WAL using Snappy #609
Optionally compress the WAL using Snappy #609
Conversation
I would say lets have it optional and enabled by default so that if people run into some sort of performance issues can always disable. I guess some people would prefer high performance where others would prefer lower disk usage. |
ee4a2c5
to
238f946
Compare
My only concern with having it enabled by default is that a user will lose any compressed WAL data if they roll back their Prometheus version. Perhaps have one release of Prometheus with default false, and then default it to true? |
Thus far we've enabled this sort of thing by default, we only promise that you can roll forward. |
1ae1f89
to
83ac0d2
Compare
Actually, we were very conservative at times with changes that prevent a roll-back. At other time, we were not. The latter has caused some headaches with users. At least initially, I'd always leave the default at the old (roll-back-able) setting. Once a minor release has worked fine with the new setting, we can consider flipping the default. I do get the promise of not necessarily supporting a roll-back. However, we can still be conservative and introduce a new feature as an opt-in first, simply to avoid the scenario that a bug (which might or might not be related to the new feature) is discovered in the new minor release but then you cannot roll back. |
83ac0d2
to
f3af604
Compare
First test of this in Prometheus is going well. Only ingesting 8k samples per second on this instance, but I see no significant increase in CPU usage. The compression ratio for my data is steady at 2.1x. |
2ac1397
to
0c6993f
Compare
8ea640e
to
e95e806
Compare
In this case a potential bug would cause the user to delete the WAL folder(3h data loss) and starting Prometheus with the the compression disabled so I think this should be enough of an option to not need to roll back to a previous Prometheus version. I am leaning towards having it enabled by default with an option to disable so that we can have as many people as possible to test it and report bugs. @csmarchbanks is this ready for a review? |
This is ready for a review, thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just few nits, overall very good PR and ready to be merged.
the RSS RAM usage has increased so maybe worth it to run some profiling. Maybe the buf is escaping in the heap.
http://prombench.prometheus.io/grafana/d/7gmLoNDmz/prombench?orgId=1&var-RuleGroup=All&var-pr-number=5592&from=1558629647622&to=1558690279046
e95e806
to
0d94777
Compare
Sorry for beating this seemingly dead horse again. I agree that the stakes are not very high in this case. But that's perhaps a good opportunity to practice how we introduce un-rollback-able changes. We can also re-build the trust that we might have lost during the problematic v2.1 to v2.3 releases. How does the following look to you:
This is not much slower (if at all) than what you suggested (make it default and remove flag two releases later). |
yep looks like a good plan.
I hope enough people really enable it to report bugs
so far the tests suggest that CPU usage hasn't increased. Looks like snappy is a very light weight compression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ping @gouthamve
actually do you think you can also add a test to ensure that the compression actually works.
|
cb4dfe6
to
8627a8e
Compare
Added a test and removed the metrics. Let me know if there is anything else you would like me to do! |
👍 to @beorn7's timeline proposal. Having a quick rollback for users that hit other bugs is a valid reason to not jump right to default compression. |
8627a8e
to
63101a0
Compare
LGTM |
63101a0
to
e1eba5f
Compare
Rebased this to fix conflicts. Bump for a second set of eyes: @gouthamve @codesome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave a quick look at the main implementation and it looks good. Just some nits.
I trust @krasi-georgiev's review on the things that I didn't touch.
e1eba5f
to
a5ff5d0
Compare
Thanks for the review @codesome! I made the changes, and rebased once more due to conflicts in |
Ping for one final look from @codesome or @gouthamve since my last updates. |
6ea65a3
to
f32a76f
Compare
Rebased again, @codesome and @krasi-georgiev would love to get this merged soon! |
@csmarchbanks yes I reminder @codesome to have a final look and should get it in in the next few days. |
Need to add an entry in CHANGELOG. |
In running Prometheus instances, compressing the records was shown to reduce disk usage by half while incurring a negligible CPU cost. Signed-off-by: Chris Marchbanks <[email protected]>
f32a76f
to
86c905c
Compare
Thanks @codesome! I have added a CHANGELOG entry. |
Thanks, hopefully we didn't brake anything with this compression 😺 |
@csmarchbanks I just remembered that we didn't add a test to check the behaviour when starting Prometheus with an existing uncompressed wal. In other words make sure it handles well a WAL file that contains compressed and uncompressed records |
@krasi-georgiev Since #608 I don't think there will ever be a mixed WAL file. Also, since the reader is the same for a compressed or uncompressed WAL file (it looks at whether each record is compressed or not), the code path should be tested whenever we test with compress=false. If you think it is worth an explicit test I can write one soon though. |
you are right #608 is enough. Lets leave as is. |
Compress the WAL records using snappy. Running in test clusters shows that the WAL is half the size with minimal increase in CPU.
Benchmarks:
2012 macbook pro, SATA flash drive
5th Gen Lenovo X1 Carbon, NVMe flash drive
Closes #599