Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post restore reset #545

Merged
merged 11 commits into from
Mar 17, 2023
Merged

Post restore reset #545

merged 11 commits into from
Mar 17, 2023

Conversation

mpalmi
Copy link
Contributor

@mpalmi mpalmi commented Mar 10, 2023

This PR introduces an interface which acts as a handler for a leaky
abstraction in the structure of underlying log stores. In order to
properly handle post-snapshot-restore cleanup for log stores
generically, we need some awareness of whether the underlying store
permits gaps.

Boltdb allows for gaps in log store indexes, but to handle them it
requires a freelist, which is written on every commit. This is costly,
particularly when the freelist is large. By completely resetting the
LogStore after snapshot restore, we grow the size of the freelist, which would result in performance degradation.

The MonotonicLogStore interface is implemented by LogStores with
guarantees of sequential/monotonic indexes, like raft-wal, but reverts
to the old behavior for for LogStores with index holes, like Boltdb.

The interface also requires special handling within LogStore wrappers (like LogCache), to ensure that the type assertion is passed to the underlying store.

We then use the MonotonicLogStore type assertion to delete all
entries from the LogStore after snapshot restore.

log_cache.go Outdated Show resolved Hide resolved
testing.go Outdated Show resolved Hide resolved
raft.go Outdated Show resolved Hide resolved
raft.go Outdated Show resolved Hide resolved
raft_test.go Outdated Show resolved Hide resolved
snapshot.go Outdated Show resolved Hide resolved
snapshot.go Outdated Show resolved Hide resolved
snapshot.go Outdated Show resolved Hide resolved
raft_test.go Outdated Show resolved Hide resolved
This commit introduces an interface which acts as a handler for a leaky
abstraction in the structure of underlying log stores. In order to
properly handle post-snapshot-restore cleanup for log stores
generically, we need some awareness of whether the underlying store
permits gaps.

Boltdb allows for gaps in log store indices, but to handle them it
requires a freelist, which is written on every commit. This is costly,
particularly when the freelist is large. By completely resetting the
LogStore after snapshot, we grow the size of the freelist, which would
result in performance degradation.

The MonotonicLogStore interface is implemented by LogStores with
guarantees of sequential/monotonic indices, like raft-wal, but reverts
to the old behavior for boltdb.

This also requires special handling within LogStore wrappers (like
LogCache), to ensure that the type assertion is passed to the underlying
store.
This commit makes use MonotonicLogStore type assertion to delete all
entries from the LogStore after snapshot restore.
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpalmi Nice. This is also super close, but as it stands it has a huge issue we should fix!

In the last changes we added the log clearing code into the startup restore which is wrong and will delete loose committed data!

Let me know if it's not clear why that is or if I've misunderstood something here - restoring the on-disk snapshot is very different and not at all problematic. We are only trying to fix the problem of restoring from an external snapshot which has the effect of making all the current state invalid and useless!

log.go Outdated Show resolved Hide resolved
log.go Outdated Show resolved Hide resolved
raft_test.go Outdated Show resolved Hide resolved
api.go Outdated Show resolved Hide resolved
@jmurret jmurret marked this pull request as ready for review March 15, 2023 21:11
@jmurret jmurret requested a review from a team as a code owner March 15, 2023 21:11
Copy link
Member

@jmurret jmurret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here by all!

raft_test.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@mpalmi mpalmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple of questions that are not showstoppers, but one is related to a comment from @banks, so might be worth addressing.

raft_test.go Outdated Show resolved Hide resolved
raft_test.go Outdated Show resolved Hide resolved
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @jmurret.

I had some minor comment suggestions and a test naming suggestion in line but nothing blocking.

@dhiaayachi do you want to do a final pass at this?

I had this branch running against Consul's restore integration test yesterday so I feel happy we resolved the issues here (in combination with hashicorp/raft-wal#24),

log.go Outdated Show resolved Hide resolved
log.go Outdated Show resolved Hide resolved
raft_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job All!!
I added a nit but I'm ok with merging this as is.

raft.go Show resolved Hide resolved
@jmurret jmurret merged commit d3128ae into main Mar 17, 2023
@jmurret jmurret deleted the post-restore-reset branch March 17, 2023 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants