-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for numbered filesets #1720
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a sanity check to seek/reader when they read the info file to make sure that the volume number in the info file matches what we expect it to be? Would give us some extra assurances our logic is right
// and error if neither exist. | ||
func isFirstVolumeLegacy(prefix string, t time.Time, suffix string) (bool, error) { | ||
path := filesetPathFromTimeAndIndex(prefix, t, 0, suffix) | ||
_, err := os.Stat(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to handle the os.IsNotExist
error separately from other types of errors: https://stackoverflow.com/questions/12518876/how-to-check-if-a-file-exists-in-go
I.E
if os.IsNotExist(err) {
return false, nil
}
if err != nil {
return false, err
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I saw that, but I'd argue that we don't really care about the other err
cases. Either one of the types of files exist or neither of them do, in which case that's the error we propagate down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I guess its fine because of how you're using the function (where there is an implicit assumption that a file should exist, regardless of whether its legacy or not. Maybe just add that assumption as a comment above the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do 👍
Codecov Report
@@ Coverage Diff @@
## master #1720 +/- ##
========================================
+ Coverage 71.9% 71.9% +<.1%
========================================
Files 982 982
Lines 82097 82216 +119
========================================
+ Hits 59093 59189 +96
- Misses 19107 19124 +17
- Partials 3897 3903 +6
Continue to review full report at Codecov.
|
legacyPath := filesetPathFromTimeLegacy(prefix, t, suffix) | ||
_, err = os.Stat(legacyPath) | ||
if err == nil { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need to check/verify for a completed checkpoint file? (not sure who relies on this path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. This function should be used solely to figure out whether the first volume is legacy or not before we actually use them, so checking for a complete checkpoint file should come after this function is called. I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume its fine because this is only used by the reader to determine what code path to follow, and if the checkpoint file is not complete the reader will figure that out pretty quickly and error out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, correct.
// files does not use the volume index provided in the prepareOpts. | ||
// Instead, the new volume index is determined by looking at what files | ||
// exist on disk. This means that there can never be a conflict when | ||
// trying to write new snapshot files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of strange, but why does snapshots use one mechanism (i.e. increment new volume index by looking at files on disk) vs fileset volumes using a different one? Richie alluded that it was "hard" but would be interesting to hear what was the difficulty with just making them use similar/same code paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I kind of did a poor job explaining this. There are a couple of subtle complexities that we encountered but basically we realized that we need to maintain some kind of relationship between the flush version and the volume number for a variety of reasons but the one that immediately comes to mind is imagine the SeekerManager
does a lot of checks to see if specific fileset files exist. In order to do that efficiently it needs to be able to go from namespace/shard/blockStart/volume -> expected filename.
Once we decided that we wanted to maintain that relationship the "simplest" thing is to ensure that the flush version and the volume number are always the same. Technically we could do the +1
logic and still maintain this guarantee but that would make it implicit instead of explicit and a brittle assumption. Basically instead of having two sources of truth (flush version in memory and volume numbers on disk) we only have one and if they're out of sync for any reason we'll at least get errors as opposed to silent successes.
So we decided to just tell the PersistManager
/ writer which volume we intended to create and that has some nice properties like if our logic is wrong the PersistManager
can error out and tell us that the volume already exists (as opposed to it just incrementing by 1 and subtly breaking our assumptions).
This is also nice because it forces users of the PersistManager
to think about what volume they're writing and whether or not thats valid.
Sine we took this route we do need to bootstrap the flush version back into the shard, but thats a trivial change since the shard already calls ForEachInfoFile
so all we need to do is pull the volume number out of the info metadata and into the shard memory state.
@justinjc Do you remember any other details? We spent a few hours working through all this and I feel like I may be missing some of the other reasons we couldn't do the incrementing.
@robskillington Thinking about this now though it seems like it might be good to add a check in the persist manager that the volume being writeen is at least as high as the highest existing complete fileset file. I.E if complete versions of 1, 2, 3 exist then writing 3 (if deleteIfExist=true
) and 4 should be allowed, but writing 2 should not. What do you think @justinjc ? Could also add it to your list of follow ups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also support a flag in the PersistManager
that basically tells it to figure out the volume number itself which we would probably need if we wanted to allow incremental peer bootstrapping and flushing to occur in parallel, but not sure we want to go with that approach long term (see my other comment below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinjc just pinging your thoughts on that potential follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation here @richardartoul, you summed it up pretty well. In short, we wanted an efficient way to figure out what is the correct volume for a particular namespace/shard/block, and having to go to disk, parsing filenames and then ordering them each time seemed significantly slower.
I'm not a big fan of the persist manager having the volume number check. The check would similarly involve looking for files in the specified directory, doing file name parsing, and ordering them in order to figure that out, which amounts to additional logic that's essentially double checking that other logic isn't buggy. Also, from a separation of concerns point of view, I don't think the persist manager should need to know that fileset volumes are always increasing. Its job should just be to persist things you tell it to persist. It should be caller of the persist manager that ensures the volume number it passes is correct.
DeleteIfExists bool | ||
// This volume index is only used when preparing for a flush fileset type. | ||
// When opening a snapshot, the new volume index is determined by looking | ||
// at what files exist on disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, does seem strange they don't just both do the same strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed its a little weird, and Justin's initial implementation did do the +1 strategy but we ran into some shard edges with that approach that pushed us towards the other direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking mostly good!
// bootstrap, then the bootstrap volume index will be >0. In order | ||
// to support this, bootstrapping code will need to incorporate | ||
// merging logic and flush version/volume index will need to be | ||
// synchronized between processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought: I assume if we ever want to do repairs we might also need to support volume index >0 for bootstrapping. (If an "online" peer bootstrapping is the process that repairs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robskillington The designed repair feature won't use the peers bootstrapper at all right? Re: incremental bootstrap, what I've been thinking about recently is to eventually move away from having incremental in the peers boostrapper and instead combine cold flushing + streaming bootstrap to get the same effect.
The alternative approach (where we support index > 0) and then basically do merging from here would work also, but I think it would make allowing M3DB to flush while its bootstrapping really difficult (which is a place I think we want to get to long term) because their would have to be coordination of the flush versions there.
9567db8
to
ec65cb2
Compare
ec65cb2
to
40f98f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from the nits.
Lets try and get a stamp from @robskillington too
src/dbnode/persist/fs/files.go
Outdated
// delimeters found, to be used in conjunction with the intComponentFromFilename | ||
// function to extract filename components. This function is deliberately | ||
// optimized for speed and lack of allocations, since filename parsing is known | ||
// to account for a significant proportion of system resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"since allocation-heavy filename parsing can quickly become the large source of allocations in the entire system, especially when namespaces with long retentions are configured."
// function to extract filename components. This function is deliberately | ||
// optimized for speed and lack of allocations, since filename parsing is known | ||
// to account for a significant proportion of system resources. | ||
func delimiterPositions(baseFilename string) ([maxDelimNum]int, int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
niiiice
src/dbnode/persist/fs/files.go
Outdated
// delimeters. Our only use cases for this involve extracting numeric | ||
// components, so this function assumes this and returns the component as an | ||
// int64. | ||
func intComponentFromFilename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about "componentAtIndex" as the name? The current name gave me the wrong impression until I read the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll call it intComponentAtIndex
because I do want to say that it deals with ints only.
src/dbnode/persist/fs/files.go
Outdated
componentPos int, | ||
delimPos [maxDelimNum]int, | ||
) (int64, error) { | ||
var start int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do var start = 0
, that would have given me a nudge in the right direction to understand how you're using the positions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense.
src/dbnode/storage/shard.go
Outdated
// when we attempt to flush and a fileset already exists unless there is | ||
// racing competing processes. | ||
// filesets exist at bootstrap time so we should never encounter a time | ||
// when we attempt to flush and a fileset that already exists unless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this extra that
seems wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded this.
src/dbnode/storage/shard.go
Outdated
// racing competing processes. | ||
// filesets exist at bootstrap time so we should never encounter a time | ||
// when we attempt to flush and a fileset that already exists unless | ||
// there are racing competing processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"there is a bug in the code" or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha sure.
src/dbnode/storage/shard.go
Outdated
@@ -2006,7 +2019,7 @@ func (s *dbShard) ColdFlush( | |||
|
|||
// After writing the full block successfully, update the cold version | |||
// in the flush state. | |||
nextVersion := s.RetrievableBlockColdVersion(startTime) + 1 | |||
nextVersion := coldVersion + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so there is +1 logic here and in the Merge() function. Would prefer we figure out the next version here and then tell the Merge() function to use that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
blockStates := s.BlockStatesSnapshot() | ||
// Filter without allocating by using same backing array. | ||
compacted := filesets[:0] | ||
for _, datafile := range filesets { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment saying this is safe because cleanup and flushing never happen in parallel so the snapshot will never get stale.
Also do we have BlockStateSnapshot()
and FlushState()
do those two duplicate each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll add a comment. Also, since versions are always increasing, even if we get stale snapshots, we just don't clean up some files that can be cleaned (but we'll clean them up eventually next time round).
BlockStatesSnapshot
gets the flush states for all block starts, while FlushState
is for a specific block start. Wouldn't want to loop through all blocks and keep acquiring and releasing locks (mentioned in a comment above).
src/dbnode/storage/shard.go
Outdated
// locks in a tight loop below. | ||
blockStates := s.BlockStatesSnapshot() | ||
// Filter without allocating by using same backing array. | ||
compacted := filesets[:0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably call this toDelete
or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this is sketching me out lol can we just allocate a new slice. This is only once per shard so its not so bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alllllright.
What this PR does / why we need it:
With cold flushes, we need to support having multiple filesets per block. This PR introduces volume numbers to filesets so that new filesets will be written with an index as part of the filename. In order to facilitate a smooth migration, there is additional logic for gracefully handling legacy (non-indexed) filesets.
Special notes for your reviewer:
Does this PR introduce a user-facing and/or backwards incompatible change?:
Does this PR require updating code package or user-facing documentation?: