-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/manifest: Introduce L0SubLevels data structure #614
Conversation
fee6a52
to
cbbeb50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to mainly lurk in this review, since I know the code. I suspect it will need more code comments to be easier to digest. Here are some initial comments.
Reviewed 1 of 5 files at r1.
Reviewable status: 0 of 5 files reviewed, 17 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 17 at r1 (raw file):
// TODO(sbhola): work items: // - Integration with Pebble (probably as a followup to #546) // - TPCC import experiments
can you change the first todo to TODO(bilal) and remove the second one.
internal/manifest/l0_sublevels.go, line 23 at r1 (raw file):
// occurs because the union of file boundary keys is used to pick intervals. // However the largest key in a file is not inclusive, so when it is used as // an interval, the actual key is ImmediateSuccessor(key). We don't have the
This is likely to be confusing, so before the We don't have ...
sentence can you add an example like the following:
Consider two files with boundaries, f1 [a, c], f2 [b, f]. The interval boundary keys are a, b, IS(c), IS(f), which makes intervals [a, b), [b, IS(c)), [IS(c), IS(f)), where f1 perfectly overlaps with the the first two intervals and f2 with the second two intervals.
internal/manifest/l0_sublevels.go, line 85 at r1 (raw file):
} // A wrapper around FileMetadata.
I think Peter wanted these extra fields folded into FileMetadata
, because when integrating this we'll want L0SubLevels
to return [][]*FileMetadata
( for integrating with the read path). Unlike the prototype which is maintaining both in Version
.
internal/manifest/l0_sublevels.go, line 104 at r1 (raw file):
// A key interval of the form [start, end). The end is not represented here since it // is implicit in the start of the next interval. The last interval is an exception // but we don't need to ever lookup the end of that interval. The set of intervals
I don't think we even need that last interval -- it will have 0 files. That will get rid of this confusing comment.
internal/manifest/l0_sublevels.go, line 115 at r1 (raw file):
// sublevels in this interval than files undergoing compaction, which may // allow for this interval to participate in another compaction, but for now // we eschew such complexity).
as far as I know, this seqnum 0 case only occurs in RocksDB, so perhaps this comment is not relevant, since ongoing compactions are irrelevant for compatibility.
internal/manifest/l0_sublevels.go, line 131 at r1 (raw file):
// to true. Note that this is a pessimistic filter: the file that widened // [internalFilesMinIntervalIndex, internalFilesMaxIntervalIndex] may be at a high // sub-level and may not need to be included in the compaction.
I think this comment could be rephrased and one can include an example like the one I used in the walkthrough. The issue here is that we stop once we have a viable candidate (we really can't afford to consider all intervals as candidates, at every picking request), so we want to first consider candidates further away from ongoing compactions, since they are more likely to give us a deeper compaction.
internal/manifest/l0_sublevels.go, line 179 at r1 (raw file):
_ = log.Output(2, fmt.Sprintf(format, args...)) os.Exit(1) }
I just copy-pasted this logging stuff quickly in the prototype -- there may be a more principled solution.
internal/manifest/l0_sublevels.go, line 187 at r1 (raw file):
// sublevels having newer keys as opposed to lower indexed levels. // // There is no limit to the number of sublevels that can exist in L0 at any time
nit: comma after "time".
internal/manifest/l0_sublevels.go, line 209 at r1 (raw file):
flushSplitUserKeys [][]byte flushSplitFraction []float64 flushSplitIntervalIndex []int
I was using flushSplitFraction
and flushSplitIntervalIndex
to print to log, to figure out the root cause of wide files. They didn't indicate anything out of the ordinary, so I'd suggest removing them.
internal/manifest/l0_sublevels.go, line 212 at r1 (raw file):
// for debugging CompactingFilesAtCreation []*FileMetadata
This can be removed to -- it was a temporary debugging aid.
internal/manifest/l0_sublevels.go, line 469 at r1 (raw file):
// Only for temporary debugging in the absence of proper tests. // // TODO(itsbilal): Remove this after integration is complete.
I suspect we'll want to run with checking turned on in production -- it is better to fail before a compaction corrupts the LSM invariants.
internal/manifest/l0_sublevels.go, line 585 at r1 (raw file):
// UpdateStateForStartedCompaction updates internal L0SubLevels state for a // recently started compaction. isBase specifies if this is a base compaction; // if false, this is assumed to be an L0 compaction. The specified compaction
s/L0/intra-L0/
internal/manifest/l0_sublevels.go, line 701 at r1 (raw file):
// first interval we can construct a compaction for is the compaction that is // started. There can be multiple heuristics in choosing the ordering of the // intervals -- the code uses one heuristic, but experimentation is probably
... uses one heuristic that worked well for a large TPCC import, but more experimentation is probably needed.
internal/manifest/l0_sublevels.go, line 721 at r1 (raw file):
// shape. But this non-overlapping sequence number is already relaxed in // RocksDB -- sstables are primarily ordered by their largest sequence
do we need a TODO to confirm this assertion?
internal/manifest/l0_sublevels.go, line 892 at r1 (raw file):
// optional activity so when it fails we can fallback to the last // successful candidate. Currently the code keeps adding until it can't // add more, but we could optionally stop based on
the "Currently ..." comment is stale.
internal/manifest/l0_sublevels.go, line 934 at r1 (raw file):
if lastCandidate.seedIntervalStackDepthReduction >= minCompactionDepth && cFiles.fileBytes > 100<<20 && (float64(cFiles.fileBytes)/float64(lastCandidate.fileBytes) > 1.5 || cFiles.fileBytes > 500<<20) {
the heuristic in this line could use a comment -- I pulled the 1.5 constant out of thin air. The goal was to grow if the increase rate was low enough (which is the cost of decreasing stack depth by 1), until we hit the high threshold.
internal/manifest/l0_sublevels.go, line 1315 at r1 (raw file):
// We pick the longest sequence between firstIndex // and lastIndex of non-compacting files -- this is represented by // [candidateNonCompactingFirst, candidateNonCompactingLast].
The logic here is tricky and can probably use more comments. Specifically, the "longest sequence" comment is incomplete -- if we have already picked some files in this sublevel for this compaction, the sequence we use has to be the one that includes those already picked files, even if it is not the longest one. This is the reason for candidateHasAlreadyPickedFiles
below.
I wouldn't be surprised if there is still a bug lurking in extendCandidateToRectangle
-- it was a source of multiple bugs during my experiments.
Split off the compaction picking / generating methods into #615. Thanks for the review @sumeerbhola , I'll update both PRs with that soon. |
cd82ce8
to
263dccc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 16 unresolved discussions (waiting on @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 17 at r1 (raw file):
Previously, sumeerbhola wrote…
can you change the first todo to TODO(bilal) and remove the second one.
Done.
internal/manifest/l0_sublevels.go, line 23 at r1 (raw file):
Previously, sumeerbhola wrote…
This is likely to be confusing, so before the
We don't have ...
sentence can you add an example like the following:
Consider two files with boundaries, f1 [a, c], f2 [b, f]. The interval boundary keys are a, b, IS(c), IS(f), which makes intervals [a, b), [b, IS(c)), [IS(c), IS(f)), where f1 perfectly overlaps with the the first two intervals and f2 with the second two intervals.
Done.
internal/manifest/l0_sublevels.go, line 85 at r1 (raw file):
Previously, sumeerbhola wrote…
I think Peter wanted these extra fields folded into
FileMetadata
, because when integrating this we'll wantL0SubLevels
to return[][]*FileMetadata
( for integrating with the read path). Unlike the prototype which is maintaining both inVersion
.
Done. That doesn't involve removing this struct from Version
though, right? Since we'd still need extra info here (that's not in FileMetadata
) to pick compactions.
internal/manifest/l0_sublevels.go, line 104 at r1 (raw file):
Previously, sumeerbhola wrote…
I don't think we even need that last interval -- it will have 0 files. That will get rid of this confusing comment.
We still need it to know the end of the interval before it. I clarified this commend with that info.
internal/manifest/l0_sublevels.go, line 115 at r1 (raw file):
Previously, sumeerbhola wrote…
as far as I know, this seqnum 0 case only occurs in RocksDB, so perhaps this comment is not relevant, since ongoing compactions are irrelevant for compatibility.
Done.
internal/manifest/l0_sublevels.go, line 131 at r1 (raw file):
Previously, sumeerbhola wrote…
I think this comment could be rephrased and one can include an example like the one I used in the walkthrough. The issue here is that we stop once we have a viable candidate (we really can't afford to consider all intervals as candidates, at every picking request), so we want to first consider candidates further away from ongoing compactions, since they are more likely to give us a deeper compaction.
Done.
internal/manifest/l0_sublevels.go, line 179 at r1 (raw file):
Previously, sumeerbhola wrote…
I just copy-pasted this logging stuff quickly in the prototype -- there may be a more principled solution.
Done. Moved to returning errors instead, unless it's a good candidate for a panic.
internal/manifest/l0_sublevels.go, line 209 at r1 (raw file):
Previously, sumeerbhola wrote…
I was using
flushSplitFraction
andflushSplitIntervalIndex
to print to log, to figure out the root cause of wide files. They didn't indicate anything out of the ordinary, so I'd suggest removing them.
Done.
internal/manifest/l0_sublevels.go, line 212 at r1 (raw file):
Previously, sumeerbhola wrote…
This can be removed to -- it was a temporary debugging aid.
Done.
internal/manifest/l0_sublevels.go, line 469 at r1 (raw file):
Previously, sumeerbhola wrote…
I suspect we'll want to run with checking turned on in production -- it is better to fail before a compaction corrupts the LSM invariants.
Done. Updated the TODO with this instead. (All comments after this point were addressed in #615, not here).
internal/manifest/l0_sublevels.go, line 585 at r1 (raw file):
Previously, sumeerbhola wrote…
s/L0/intra-L0/
Done.
internal/manifest/l0_sublevels.go, line 701 at r1 (raw file):
Previously, sumeerbhola wrote…
... uses one heuristic that worked well for a large TPCC import, but more experimentation is probably needed.
Done.
internal/manifest/l0_sublevels.go, line 721 at r1 (raw file):
Previously, sumeerbhola wrote…
// shape. But this non-overlapping sequence number is already relaxed in // RocksDB -- sstables are primarily ordered by their largest sequence
do we need a TODO to confirm this assertion?
That's a separate, general task that I have planned for later on - confirm if RocksDB is able to read a split flush under this sort of a setup. Don't think it warrants a todo in Pebble.
internal/manifest/l0_sublevels.go, line 892 at r1 (raw file):
Previously, sumeerbhola wrote…
the "Currently ..." comment is stale.
Done.
internal/manifest/l0_sublevels.go, line 934 at r1 (raw file):
Previously, sumeerbhola wrote…
the heuristic in this line could use a comment -- I pulled the 1.5 constant out of thin air. The goal was to grow if the increase rate was low enough (which is the cost of decreasing stack depth by 1), until we hit the high threshold.
Done.
internal/manifest/l0_sublevels.go, line 1315 at r1 (raw file):
Previously, sumeerbhola wrote…
The logic here is tricky and can probably use more comments. Specifically, the "longest sequence" comment is incomplete -- if we have already picked some files in this sublevel for this compaction, the sequence we use has to be the one that includes those already picked files, even if it is not the longest one. This is the reason for
candidateHasAlreadyPickedFiles
below.I wouldn't be surprised if there is still a bug lurking in
extendCandidateToRectangle
-- it was a source of multiple bugs during my experiments.
This is a general TODO for the next PR (#615) then, I guess. Ill go through this code again and add more comments, thanks for the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 1 at r2 (raw file):
package manifest
Missing copyright notice.
internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):
// file overlaps perfectly with a sequence of intervals. This perfect overlap // occurs because the union of file boundary keys is used to pick intervals. // However the largest key in a file is not inclusive, so when it is used as
s/not inclusive/exclusive/g
internal/manifest/l0_sublevels.go, line 37 at r2 (raw file):
// being the index of the start key of the interval. // // In addition to helping with compaction choosing, we use interval indices
s/compaction choosing/compaction picking/g
?
internal/manifest/l0_sublevels.go, line 86 at r2 (raw file):
} } return keys[0 : j+1]
You can omit the 0
here: return keys[:j+1]
.
internal/manifest/l0_sublevels.go, line 207 at r2 (raw file):
) (*L0SubLevels, error) { s := &L0SubLevels{cmp: cmp, format: formatter} s.filesByAge = make([]*FileMetadata, len(files))
It isn't obvious to me why you have to make a copy of files
here. You don't seem to be mutating filesByAge
. You are mutating the *FileDescriptor
. I believe this line could be s.filesByAge = files
. If it can't, please add a comment above the filesByAge
field explaining why.
internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):
for i := range files { s.filesByAge[i] = files[i] files[i].subLevelIndex = i
The subLevelIndex
field isn't used after being set here. Is it needed?
The name of this field is a bit confusing as it isn't the sublevel, but the index of the file in the filesByAge
slice.
internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):
} f.maxIntervalIndex-- interpolatedBytes := f.Size / uint64(f.maxIntervalIndex-f.minIntervalIndex+1)
This estimation is basic. Would we get a benefit from a more accurate estimation of the overlapping bytes in an sstable. This already exists in the sstable/Reader.EstimatedDiskUsage()
function, though there would be a small challenge in plumbing access to this method here and we might need to be wary of the cost of that method.
internal/manifest/l0_sublevels.go, line 249 at r2 (raw file):
subLevel := 0 if f.Compacting { if !f.IsBaseCompacting && !f.IsIntraL0Compacting {
If f.Compacting
is true, f.IsBaseCompacting == !f.IsIntraL0Compacting
. Perhaps you can get rid of one of Is{Base,IntraL0}Compacting
to eliminate the possibility of this erroneous state from ever occurring.
internal/manifest/l0_sublevels.go, line 275 at r2 (raw file):
interval.topOfStackNonCompactingFileCount++ } interval.fileBytes += interpolatedBytes
Perhaps s/fileBytes/estimatedBytes
to remind ourselves that this is an estimate.
internal/manifest/l0_sublevels.go, line 313 at r2 (raw file):
} if cumulativeBytes > flushSplitMaxBytes && (len(s.flushSplitUserKeys) == 0 || !bytes.Equal(interval.startKey.key.UserKey, s.flushSplitUserKeys[len(s.flushSplitUserKeys)-1])) {
I believe the bytes.Equal
comparison here is done to ensure that we're never adding a duplicate key to flushSplitUserKeys
. Could we achieve the same by ensuring that orderedIntervals[i].startKey.key.UserKey
do not have duplicates?
internal/manifest/l0_sublevels.go, line 329 at r2 (raw file):
len(s.filesByAge), len(s.subLevels), len(s.orderedIntervals), len(s.flushSplitUserKeys)) for i := range s.flushSplitUserKeys { fmt.Fprintf(&buf, "%s,",s.flushSplitUserKeys[i])
I think you want to use s.format
for formatting the user keys.
internal/manifest/l0_sublevels_test.go, line 1 at r2 (raw file):
package manifest
Missing copyright comment.
internal/manifest/l0_sublevels_test.go, line 183 at r2 (raw file):
func BenchmarkL0SubLevelsInit(b *testing.B) { v, err := readManifest("testdata/MANIFEST_import")
I suppose this is the primary reason for adding testdata/MANIFEST_import
. That is a hefty file to be adding for benchmark purposes. Could we instead construct a MANIFEST
with similar properties?
internal/manifest/version.go, line 67 at r2 (raw file):
// For L0 files only. Protected by DB.mu. Used to generate L0 sublevels and // pick L0 compactions. IsBaseCompacting bool
Does this field need to be exported?
These fields need some comments. How does IsBaseCompacting
differ from Compacting
? Ditto for IsIntraL0Compacting
?
internal/manifest/testdata/l0_sublevels, line 10 at r2 (raw file):
0009:f.SET.10-i.SET.10 0010:f.SET.11-g.SET.11 L6
What is the point of defining L6
in these tests? As far as I can tell it is never used. Am I missing something?
internal/manifest/testdata/l0_sublevels, line 45 at r2 (raw file):
---- file count: 5, sublevels: 4, intervals: 9, flush keys: 4 split:f,g,h,p,
Nit: add a space after split:
and there shouldn't be a trailing comma.
It is mildly confusing to me to have flush keys
on the previous line and then split
here. I think it might be mildly clearer to use the Go slice format:
split: [f, g, h, p]
If we want to indicate the cardinality, we could do something like: split(4): [f, g, h, p]
.
internal/manifest/testdata/l0_sublevels, line 51 at r2 (raw file):
0.0: file count: 2, bytes: 512, width (mean, max): 1, 2, interval range: [3, 7] compacting file count: 0, base compacting intervals: Sublevel 0: 000006:f#4,1-g#5,1, 000011:n#8,1-p#10,1
There are two inconsistencies here. Above we display the sublevel as 0.3
, 0.2
, etc. Here it is named Sublevel 0
. I think we should use the 0.<sublevel>
format in both places.
The second inconsistency is the order in which the sublevels are displayed.
Should the details of the files in each sublevel be interleaved with the overview above?
Formatting all of the file details in a single line per sublevel will be unreadable if there are a large number of files. Combining all of these points, how about something like:
0.3: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [0, 3]
000010:f#11,1-g#11,1
0.2: file count: 1, bytes: 256, width (mean, max): 7, 7, interval range: [1, 7]
000009:f#10,1-p#10,1
0.1: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [2, 5]
000005:f#6,1-h#9,1
0.0: file count: 2, bytes: 512, width (mean, max): 1, 2, interval range: [3, 7]
000006:f#4,1-g#5,1
000011:n#8,1-p#10,1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 207 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
It isn't obvious to me why you have to make a copy of
files
here. You don't seem to be mutatingfilesByAge
. You are mutating the*FileDescriptor
. I believe this line could bes.filesByAge = files
. If it can't, please add a comment above thefilesByAge
field explaining why.
filesByAge
used to be []fileMeta
which is why the prototype code needed to copy.
internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
The
subLevelIndex
field isn't used after being set here. Is it needed?The name of this field is a bit confusing as it isn't the sublevel, but the index of the file in the
filesByAge
slice.
It used to be called fileMeta.index
. It is extensively used in the compaction picking logic.
internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This estimation is basic. Would we get a benefit from a more accurate estimation of the overlapping bytes in an sstable. This already exists in the
sstable/Reader.EstimatedDiskUsage()
function, though there would be a small challenge in plumbing access to this method here and we might need to be wary of the cost of that method.
Maybe a TODO? I think we would need to keep an incremental byte estimates data-structure and only adjust for added and removed files.
internal/manifest/l0_sublevels.go, line 313 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I believe the
bytes.Equal
comparison here is done to ensure that we're never adding a duplicate key toflushSplitUserKeys
. Could we achieve the same by ensuring thatorderedIntervals[i].startKey.key.UserKey
do not have duplicates?
IMO, that would be messier since we'd change the invariant that a file exactly overlaps with a sequence of intervals. We've contained special handling of userKeys
to the split keys logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):
Previously, sumeerbhola wrote…
It used to be called
fileMeta.index
. It is extensively used in the compaction picking logic.
Would l0Index
be a better name?
internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):
Previously, sumeerbhola wrote…
Maybe a TODO? I think we would need to keep an incremental byte estimates data-structure and only adjust for added and removed files.
A TODO is good. I didn't mean to suggest changing this here.
internal/manifest/l0_sublevels.go, line 313 at r2 (raw file):
Previously, sumeerbhola wrote…
IMO, that would be messier since we'd change the invariant that a file exactly overlaps with a sequence of intervals. We've contained special handling of
userKeys
to the split keys logic.
Ack. Ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 33 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
A TODO is good. I didn't mean to suggest changing this here.
This interpolation worked ok because the ingested sstables were usually narrow. But it has been in the back of my mind that L0SubLevels
(and the compaction picker changes) should also be able to handle a burst in write throughput coming through normal flushes. Say we have accumulated 10 files in L0 that all overlap with each other, and want the 11th file being flushed to be properly split -- the interpolated estimate from the first 10 files is going to be wildly wrong if they all pair-wise overlap with each other. So we will probably require a better estimate to handle this case.
263dccc
to
47f48c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review!
Reviewable status: 0 of 5 files reviewed, 32 unresolved discussions (waiting on @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 1 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Missing copyright notice.
Done.
internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
s/not inclusive/exclusive/g
Done.
internal/manifest/l0_sublevels.go, line 37 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
s/compaction choosing/compaction picking/g
?
Done.
internal/manifest/l0_sublevels.go, line 86 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
You can omit the
0
here:return keys[:j+1]
.
Done.
internal/manifest/l0_sublevels.go, line 207 at r2 (raw file):
Previously, sumeerbhola wrote…
filesByAge
used to be[]fileMeta
which is why the prototype code needed to copy.
Ah right, good catch. It's an artifact from a past iteration.
internal/manifest/l0_sublevels.go, line 211 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Would
l0Index
be a better name?
l0Index
makes more sense, and has more meaning outside of this data structure. Done.
internal/manifest/l0_sublevels.go, line 245 at r2 (raw file):
Previously, sumeerbhola wrote…
This interpolation worked ok because the ingested sstables were usually narrow. But it has been in the back of my mind that
L0SubLevels
(and the compaction picker changes) should also be able to handle a burst in write throughput coming through normal flushes. Say we have accumulated 10 files in L0 that all overlap with each other, and want the 11th file being flushed to be properly split -- the interpolated estimate from the first 10 files is going to be wildly wrong if they all pair-wise overlap with each other. So we will probably require a better estimate to handle this case.
Added a TODO. I agree that this estimation is simple and overlooks cases like the one Sumeer mentioned.
internal/manifest/l0_sublevels.go, line 249 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
If
f.Compacting
is true,f.IsBaseCompacting == !f.IsIntraL0Compacting
. Perhaps you can get rid of one ofIs{Base,IntraL0}Compacting
to eliminate the possibility of this erroneous state from ever occurring.
Done.
internal/manifest/l0_sublevels.go, line 275 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Perhaps
s/fileBytes/estimatedBytes
to remind ourselves that this is an estimate.
Done.
internal/manifest/l0_sublevels.go, line 329 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think you want to use
s.format
for formatting the user keys.
Done.
internal/manifest/l0_sublevels_test.go, line 1 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Missing copyright comment.
Done.
internal/manifest/l0_sublevels_test.go, line 183 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I suppose this is the primary reason for adding
testdata/MANIFEST_import
. That is a hefty file to be adding for benchmark purposes. Could we instead construct aMANIFEST
with similar properties?
There's another test that the next PR adds, that relies on being able to pick base and intra-L0 compactions repeatedly from this large file. Agree that it is a pretty hefty file though. Maybe I could write a method to dynamically generate a smaller file for this benchmark?
internal/manifest/version.go, line 67 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Does this field need to be exported?
These fields need some comments. How does
IsBaseCompacting
differ fromCompacting
? Ditto forIsIntraL0Compacting
?
It needs to be exported since it'll be used in addInProgressCompaction
/removeInProgressCompaction
in compaction.go
. Added a comment for IsIntraL0Compacting
, and removed IsBaseCompacting
altogether.
internal/manifest/testdata/l0_sublevels, line 10 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
What is the point of defining
L6
in these tests? As far as I can tell it is never used. Am I missing something?
It's used in the next PR, since the compaction picking logic relies on knowing the state of Lbase files.
internal/manifest/testdata/l0_sublevels, line 45 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: add a space after
split:
and there shouldn't be a trailing comma.It is mildly confusing to me to have
flush keys
on the previous line and thensplit
here. I think it might be mildly clearer to use the Go slice format:split: [f, g, h, p]
If we want to indicate the cardinality, we could do something like:
split(4): [f, g, h, p]
.
Done
internal/manifest/testdata/l0_sublevels, line 51 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
There are two inconsistencies here. Above we display the sublevel as
0.3
,0.2
, etc. Here it is namedSublevel 0
. I think we should use the0.<sublevel>
format in both places.The second inconsistency is the order in which the sublevels are displayed.
Should the details of the files in each sublevel be interleaved with the overview above?
Formatting all of the file details in a single line per sublevel will be unreadable if there are a large number of files. Combining all of these points, how about something like:
0.3: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [0, 3] 000010:f#11,1-g#11,1 0.2: file count: 1, bytes: 256, width (mean, max): 7, 7, interval range: [1, 7] 000009:f#10,1-p#10,1 0.1: file count: 1, bytes: 256, width (mean, max): 4, 4, interval range: [2, 5] 000005:f#6,1-h#9,1 0.0: file count: 2, bytes: 512, width (mean, max): 1, 2, interval range: [3, 7] 000006:f#4,1-g#5,1 000011:n#8,1-p#10,1
That's a pretty good formatting suggestion. Implemented it as-is. The old one had the duplicated sublevel output since String was supposed to be less verbose than what we wanted here, but I fixed that by adding a describe(verbose bool)
method that is called here instead. Thanks!
47f48c9
to
ad67b0e
Compare
ad67b0e
to
a95224c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 18 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Done.
I'm confused reading this comment. FileMetadata.Largest
is inclusive. Is the intention to make the end key of an interval exclusive? Why is this important? This might be called out somewhere, but I'm not seeing it right now.
internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):
// comparisons in the following manner: // - intervalKey{k, false} < intervalKey{k, true} // - k1 < k2 => intervalKey{k1, _} < intervalKey{k2, _}.
This reminds me of the construction performed in compaction.go
to determine the largest key. Not quite the same, though.
It is important for these intervals to be on internal keys? Given atomic compaction boundary constraints if a user-key straddles two adjacent tables we have to compact both at the same time, right? If that is the case, perhaps we can use MakeSearchKey(f.Smallest.UserKey)
and MakeInternalKey(f.Largest.UserKey, 0, 0)
.
If it is important to keep these intervals on internal keys, it would be great to spell out why.
internal/manifest/l0_sublevels_test.go, line 183 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
There's another test that the next PR adds, that relies on being able to pick base and intra-L0 compactions repeatedly from this large file. Agree that it is a pretty hefty file though. Maybe I could write a method to dynamically generate a smaller file for this benchmark?
I think this is fine for now. We can revisit in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 30 at r2 (raw file):
// keys produced would be intervalKey{a, false}, intervalKey{b, false}, // intervalKey{e, true} and intervalKey{g, true}, resulting in intervals // [a, b], [b, (e, true)], [(e, true), (g, true)]. The first file overlaps
this example needs some correction. The file bounds need to be inclusive at the end. And the interval bounds need be exclusive at the end.
And can you include something like [e, f] to show e
as both (e, false) and (e, true).
internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This reminds me of the construction performed in
compaction.go
to determine the largest key. Not quite the same, though.It is important for these intervals to be on internal keys? Given atomic compaction boundary constraints if a user-key straddles two adjacent tables we have to compact both at the same time, right? If that is the case, perhaps we can use
MakeSearchKey(f.Smallest.UserKey)
andMakeInternalKey(f.Largest.UserKey, 0, 0)
.If it is important to keep these intervals on internal keys, it would be great to spell out why.
I don't think it is necessary for the intervals to be on internal keys. Regarding atomic compaction boundaries: compaction.expandInputs
is a no-op for level 0, and we happen to get the atomic compaction boundaries for level 0 because Version.Overlaps
uses user keys, but I don't think atomic compaction boundaries are necessary for L0. I think they are only necessary for levels where we have split a range tombstone across output sstables, which is not the case with L0 (even after the future flush changes outputting multiple sstables). That said, there doesn't seem to be any harm in the stronger invariant unless we have severe hotspots on individual keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):
I think they are only necessary for levels where we have split a range tombstone across output sstables, which is not the case with L0 (even after the future flush changes outputting multiple sstables).
Are we going to restrict sstable splits during L0 flushing to never split within a range tombstone? Perhaps I'm misunderstanding your statement here.
That said, there doesn't seem to be any harm in the stronger invariant unless we have severe hotspots on individual keys.
My preference is to stay with the same invariant in each level unless we find a strong reason not to. Sounds like we're not seeing a strong reason here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 19 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I'm confused reading this comment.
FileMetadata.Largest
is inclusive. Is the intention to make the end key of an interval exclusive? Why is this important? This might be called out somewhere, but I'm not seeing it right now.
The comment was incorrect, that's probably what tripped you here. The largest key in a file's metadata is inclusive. The exclusive end keys in intervals allow for intervals with no gaps in between.
internal/manifest/l0_sublevels.go, line 30 at r2 (raw file):
Previously, sumeerbhola wrote…
this example needs some correction. The file bounds need to be inclusive at the end. And the interval bounds need be exclusive at the end.
And can you include something like [e, f] to showe
as both (e, false) and (e, true).
Done. Good catch, also corrected the "file bounds are exclusive" comment above.
internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I think they are only necessary for levels where we have split a range tombstone across output sstables, which is not the case with L0 (even after the future flush changes outputting multiple sstables).
Are we going to restrict sstable splits during L0 flushing to never split within a range tombstone? Perhaps I'm misunderstanding your statement here.
That said, there doesn't seem to be any harm in the stronger invariant unless we have severe hotspots on individual keys.
My preference is to stay with the same invariant in each level unless we find a strong reason not to. Sounds like we're not seeing a strong reason here.
I changed it to user keys. Uncovered a bug on Friday in the iterator PR that was occurring due to sublevels being determined by interval keys. Consider this scenario:
0.1:
00007:[b.SET.6-j.SET.8]
0.0:
00009:[a.SET.10-b.SET.10]
00003:[e.SET.5-j.SET.7]
In this case, b.SET.10
sorts before b.SET.6
due to the higher trailer, so SSTs 00009 and 00007 have no overlap according to our (old) sublevel logic. But the existence of 00003
pushes 00007
higher up in the sublevels, and now b.SET.6
appears to shadow b.SET.10
. Also added a test with this example.
a95224c
to
69a8fb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):
Are we going to restrict sstable splits during L0 flushing to never split within a range tombstone? Perhaps I'm misunderstanding your statement here.
I didn't say it correctly. I meant to say that should probably not "improperly" truncate tombstones (as defined in https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#improperly-truncated-range-deletes -- if reference needed by @jbowens).
The prototype code does split range tombstones, but a user key does not straddle two output sstables produced by the same flush, so they are properly truncated
(https://github.com/cockroachdb/pebble/pull/563/files#diff-ebe3c6610b799c65b7e2b444dc20510a for the Fragmenter change, and the caller in compaction.go
Line 1549 in 57e44da
c.rangeDelFrag.FlushToNoStraddling(key.UserKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 20 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 29 at r3 (raw file):
I didn't say it correctly. I meant to say that should probably not "improperly" truncate tombstones (as defined in https://github.com/cockroachdb/pebble/blob/master/docs/range_deletions.md#improperly-truncated-range-deletes -- if reference needed by @jbowens).
Got it. I haven't reviewed the compaction picking part of this prototype, so I was missing FlushToNoStraddling
.
internal/manifest/l0_sublevels.go, line 48 at r4 (raw file):
// numbers and so avoid expensive byte slice comparisons. type intervalKey struct { key []byte
Now that key
is a user key, we could replace isLargest
by making the type of key
a base.InternalKey
and using a defined trailer to indicate isLargest
. The advantage of doing this is that we get the comparison routine "for free", along with formatting routines. I probably wouldn't simply replace intervalKey
with base.InternalKey
everywhere as I feel that might be more confusing. Concretely, I'm proposing:
type intervalKey struct {
// key.Trailer != 0 indicates !isLargest
// key.Trailer == 0 indicates isLargest
key base.InternalKey
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 20 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 48 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Now that
key
is a user key, we could replaceisLargest
by making the type ofkey
abase.InternalKey
and using a defined trailer to indicateisLargest
. The advantage of doing this is that we get the comparison routine "for free", along with formatting routines. I probably wouldn't simply replaceintervalKey
withbase.InternalKey
everywhere as I feel that might be more confusing. Concretely, I'm proposing:type intervalKey struct { // key.Trailer != 0 indicates !isLargest // key.Trailer == 0 indicates isLargest key base.InternalKey }
I'm less sold on reusing the internal key comparison routine, because it would clutter sortAndDedup
with a lot of != 0
and == 0
logic. It feels cleaner to be able to define a comparator that does just what we want it to do here., and then our dedup logic can just rely on our own comparator's equality. That said, I don't feel strongly either way; happy to implement that too if you feel like that'd be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small additional comments, but I think this is good to merge.
Reviewable status: 0 of 5 files reviewed, 24 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 48 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
I'm less sold on reusing the internal key comparison routine, because it would clutter
sortAndDedup
with a lot of!= 0
and== 0
logic. It feels cleaner to be able to define a comparator that does just what we want it to do here., and then our dedup logic can just rely on our own comparator's equality. That said, I don't feel strongly either way; happy to implement that too if you feel like that'd be better.
I don't have a strong preference. I'm ok with leaving this as-is for now. If the alternative approach becomes more promising for some reason, we can change this in the future.
internal/manifest/l0_sublevels.go, line 120 at r4 (raw file):
// True if another interval that has a file extending into this interval is // undergoing a compaction into Lbase. In other words, this bool is true // if any interval in [internalFilesMinIntervalIndex,
Should this be s/internalFiles/files/g
?
internal/manifest/l0_sublevels.go, line 362 at r4 (raw file):
} fmt.Fprintf(&buf, "0.%d: file count: %d, bytes: %d, width (mean, max): %d, %d, interval range: [%d, %d]\n", i, len(s.Files[i]), totalBytes, sumIntervals/len(s.Files[i]), maxIntervals, s.Files[i][0].minIntervalIndex,
Nit: should the mean be formatted with %0.1f
?
internal/manifest/testdata/l0_sublevels, line 9 at r4 (raw file):
file count: 3, sublevels: 3, intervals: 5 flush split keys(3): [b, e, j] 0.2: file count: 1, bytes: 256, width (mean, max): 2, 2, interval range: [0, 1]
file count
is simply the number of files listed below, right? I could see leaving it out. Ditto with the summary file count
above and the sublevel
count. Doesn't have to happen in this PR.
What is the benefit of displaying the mean and max interval width? Is this to look for cases where there are wide intervals which are "bad".
internal/manifest/testdata/l0_sublevels, line 10 at r4 (raw file):
flush split keys(3): [b, e, j] 0.2: file count: 1, bytes: 256, width (mean, max): 2, 2, interval range: [0, 1] 000009:a#10,1-b#10,1
Should we include the width and interval range for each file? If there is a file which covers a large number of intervals, this would highlight that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 5 files reviewed, 24 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/testdata/l0_sublevels, line 10 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Should we include the width and interval range for each file? If there is a file which covers a large number of intervals, this would highlight that.
IMO the verbose option that prints every file is going to spam the log when we have large numbers of files in L0 -- we had > 2000 files in L0 during the large import. The prototype only printed the wide files, if any, with each sublevel. There were usually none, and the maximum I saw was 4 wide files at any point in time in L0.
69a8fb6
to
a128988
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Addressed most of your comments.
Reviewable status: 0 of 5 files reviewed, 23 unresolved discussions (waiting on @itsbilal, @jbowens, @petermattis, and @sumeerbhola)
internal/manifest/l0_sublevels.go, line 120 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Should this be
s/internalFiles/files/g
?
Good catch. Done.
internal/manifest/l0_sublevels.go, line 362 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: should the mean be formatted with
%0.1f
?
Done.
internal/manifest/testdata/l0_sublevels, line 9 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
file count
is simply the number of files listed below, right? I could see leaving it out. Ditto with the summaryfile count
above and thesublevel
count. Doesn't have to happen in this PR.What is the benefit of displaying the mean and max interval width? Is this to look for cases where there are wide intervals which are "bad".
File counts are useful for when verbose = false
and we don't print out all the files. The mean / max widths help narrow down sublevels that have the wider files we'd need to investigate more.
I'll merge this, barring no more objections. Also turns out the Github issues robbed the latest commit of a travis build, so I'll do a no-change force push. |
This change introduces L0SubLevels, a data structure to encaptulate all information to compute L0 Sublevels, flush split keys, and base / intra-L0 compactions. This data structure will be generated every time there's a file addition/deletion in L0, the integration bits will come in future PRs. The methods to pick base / intra-L0 compactions will come in a separate PR. Captures some of the largest code pieces of cockroachdb#563, and effectively replaces a part of that PR. Sumeer wrote the bulk of this code as part of his prototype, so thanks Sumeer for his work. Most of these functions are dead code at the moment, only invoked by the provided datadriven and non-datadriven tests.
a128988
to
f712f8b
Compare
Merging cockroachdb#614 and cockroachdb#635 in quick succession broke the master build. This change updates L0SubLevels with the type rename.
This change introduces L0SubLevels, a data structure to encaptulate
all information to compute L0 Sublevels, flush split keys, and
base / intra-L0 compactions. This data structure will be generated
every time there's a file addition/deletion in L0, the integration
bits will come in future PRs.
The methods to pick base / intra-L0 compactions will come in a
separate PR.
Captures some of the largest code pieces of #563, and effectively
replaces a part of that PR. Sumeer wrote the bulk of this code as
part of his prototype, so thanks Sumeer for his work.
Most of these functions are dead code at the moment, only invoked
by the provided datadriven and non-datadriven tests.