-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(tsm1): "snapshot in progress" error during backup #19869
Conversation
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227
fb0c42f
to
23be20b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great first PR. I'd just reconsider passing a bool to CreateSnapshot()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the right track. One nit and one question to consider.
Also need to add a unit test for this new behavior.
This fix adds a skipCacheOk flag to tsdb.Store.CreateShardSnapshot() and tsdb.Shard.CreateSnapshot() to pass to tsdb.Engine.CreateSnapshot() A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227
Test the skipCacheOk flag to tsdb.Shard.CreateSnapshot() and tsdb.Engine.CreateSnapshot() A value of true allows the backup to proceed even if a cache snapshot cannot be taken. influxdata/plutonium#3227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress.This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested.
The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters.
This fix adds a
skipCacheOk
flag totsdb.Engine.CreateSnapshot()
.A value of
true
allows the backup to proceed even if a cache snapshotcannot be taken.
This flag is set to
true
intsm1.Engine.Backup()
, the OSS backup code pathand in
tsdb.Shard.CreateSnapshot()
, the cluster backup code path.This flag is set to
false
intsm1.Engine.Export()