-
Notifications
You must be signed in to change notification settings - Fork 15
Simplify getCandidateCellsForSweeping() #2439
Simplify getCandidateCellsForSweeping() #2439
Conversation
Initially I hoped to take advantage of in-database filtering if we get to implement the "transaction table sweeping" feature. However, seems like that wasn't a great decision on my part - unclear when (and if) we get that feature done, and also how much of improvement we would actually get. The extra logic makes the code significantly more complex, so I think we need to back off and give up the idea. This doesn't invalidate the sweep rewrite project. The new impls still bring significant performance improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me (just a few nits) - @hsaraogi will also take a look today.
candidateBatch.add(ImmutableCandidateCellForSweeping.builder() | ||
.cell(cell) | ||
.sortedTimestamps(candidate ? timestampArr : EMPTY_LONG_ARRAY) | ||
.sortedTimestamps(timestampArr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to check if we should still be returning EMPTY_LONG_ARRAY
if there is just one timestamp - I see that case excluded at SweepableCellFilter.getCellToSweep
, so looks ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if there is one timestamp, you still might need to sweep it (if that timestamp is uncommitted)
assertThat(getAllCandidates(conservativeRequest(PtBytes.EMPTY_BYTE_ARRAY, 40L, 30L))).isEmpty(); | ||
public void singleCellSpanningSeveralPages() { | ||
new TestDataBuilder() | ||
.put(10, 1, 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 1000L
(and similar below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -125,7 +120,7 @@ public void returnCellsInOrder() { | |||
.putEmpty(3, 2, 10L) | |||
.putEmpty(3, 3, 10L) | |||
.store(); | |||
assertThat(getAllCandidates(conservativeRequest(PtBytes.EMPTY_BYTE_ARRAY, 30L, 5L)) | |||
assertThat(getAllCandidates(conservativeRequest(PtBytes.EMPTY_BYTE_ARRAY, 30L, 100)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 100L
(repeated on line 138 below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that argument is actually an int (the batch size)
+ " FROM (" | ||
+ " SELECT row_name, col_name, ts" | ||
+ " FROM " + prefixedTableName | ||
+ " WHERE ts < ? " + bounds.predicates + getIgnoredTimestampPredicate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should thorough have the getIgnoredTimestampPredicate? This brings us to the question if shouldCheckIfLatestValueIsEmpty
and timestampsToIgnore
should be two separate values or grouped as SweepStrategyParams (maybe an enum with types CONSERVATIVE/ThOROUGH).
timestampsToIgnore will be -1 for Conservative and empty for THOROUGH and we can add binds in the conservative query as well..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss what the API should be (and in fact we could have discussed this a few months back when I introduced it), but in this PR I'm trying to keep changes minimal.
This is a question of the level of abstraction. In the current design, KVS doesn't really know anything about sweep strategies; we just tell it whether to ignore timestamps and/or load empty values. In this design, the two concepts are orthogonal, and that's why you see getIgnoredTimestampPredicate() in both "if" branches. We could change the design and make KVS aware of sweep strategies, but that's beyond the scope of this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to point out that shouldCheckIfLatestValueIsEmpty
and timestampsToIgnore
are not independent concepts but are correlated based on the sweep strategy, seems like a bad design to allow many more cases here when there are exactly two possibilities.
Open to the idea of doing this in a separate PR, but I think allowing exactly two sets of values will simplify the queries here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think that the KVS should not know about sweep - its one job is to make the appropriate calls to the DB (via the clientPool, in the C* case) in a correct and efficient way. It shouldn't have to know why the user wants to get these timestamps - +1 to keeping as-is.
However, the boundaries are blurred somewhat as we're specifically satisfying a getCandidateCellsForSweeping
request, implying that it's reasonable to know something about what being a CandidateCellForSweeping might entail.
But also, the branching here is no more than it was before this PR, and I don't think it's an egregious combinatorial explosion.
On balance, I'd like this to merge as is - any refactor to try and cut down on the branching should be in a separate PR, and should also be a low priority, given the limited potential for simplification and the urgency of getting this + Oracle sweep merged in to develop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding removing branching - I want to benchmark a different query later that would eliminate this branching (for thorough, check values for all rows rather than only load the latest values). But of course I need to get the Oracle part done first - we can get back to this later.
} | ||
|
||
@Test | ||
public void returnFirstAndLastCellOfThePage() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test still seems valuable, why did we get rid of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tested implementation-specific behavior that doesn't exist anymore
Codecov Report
@@ Coverage Diff @@
## feature/sweep-rewrite #2439 +/- ##
===========================================================
- Coverage 60.36% 60.32% -0.04%
- Complexity 4470 4690 +220
===========================================================
Files 843 843
Lines 39797 39782 -15
Branches 4096 4083 -13
===========================================================
- Hits 24023 24000 -23
- Misses 14284 14290 +6
- Partials 1490 1492 +2
Continue to review full report at Codecov.
|
* The big sweep rewrite, part 3: Postgres impl (#2021) * The big sweep rewrite, part 3: Postgres impl - Implement KeyValueService.getCandidateCellsForSweeping() for Postgres - Add AgnosticResultRow.getBoolean() and LightAgnosticResultRow.getArray() * Address CR comments Also change the stopping condition in computeNext() * Add a release note * Add the ORDER BY that I forgot * Rewrite a fragile test * Join pages based on the number of (cell, ts) pairs * Move release note to develop * The big sweep rewrite, part 4: Cassandra impl (#2231) * The big sweep rewrite, part 4: Cassandra impl Iterate over cells in a table without ever loading more than a given number at once regardless of how wide the rows are, but still perform efficiently on narrow tables. This should finally allow us to run sweep safely on Cassandra with decent performance. * Address clockfort's comments + checkstyle + add release notes * Address code review comments * Split up test cases in SingleRowColumnPagerTest * Tests for KeyRanges * Update locks * Return Optional instead of passing in @output list * Update dependency locks * Fix TestDataBuilder after merge * Make tests green - Fix a small bug in CassandraGetCandidateCellsForSweepingImpl - Remove an irrelevant test (should_not_oom_when_...) and make CassKVSeSweepTaskRunnerIntegrationTest non-parameterized. (getRangeOfTimestamps() is not used in the sweep code path anymore) - Relax assertions in testSweepBatchesUpToDeleteBatchSize * Bring back the old variable name to make checkstyle happy * Run ./gradlew generateLock saveLock * Reapply changes * Fix compile errors * Simplify getCandidateCellsForSweeping() (#2439) * Simplify getCandidateCellsForSweeping() Initially I hoped to take advantage of in-database filtering if we get to implement the "transaction table sweeping" feature. However, seems like that wasn't a great decision on my part - unclear when (and if) we get that feature done, and also how much of improvement we would actually get. The extra logic makes the code significantly more complex, so I think we need to back off and give up the idea. This doesn't invalidate the sweep rewrite project. The new impls still bring significant performance improvements. * The big sweep rewrite, part 5: Oracle impl - Extract the paging logic from the Postgres impl into a separate class named CandidatePagingState so that we could reuse it in the Oracle impl (and also test the logic with unit tests) - Add a new benchmark fullTableScanOneWideRowThorough that simulates sweep of a very wide row. This is important for Oracle - a naive implementation would perform extremely poorly in that scenario. * A bunch of refactors - Make a builder for FullQuery - Rename RangeBoundPredicates -> RangePredicateHelper and make it append to a FullQuery.Builder instead of returning a string + args. - Make DbKvsGetCandidateCellsForSweeping a concrete class with common logic. Now, database-specific logic is behind the CellTsPairLoader interface. - Add full test coverage for CandidatePagingState * Tautological tests for RangePredicateHelper * Import the "correct" assertThat * Static classes * Refactors * CandidatePagingState: extract state-updating methods Also, reorganised class * Expand the comment in OracleCellTsPageLoaderFactory Also filter out empty pages * Fix release notes * Fix release notes for real
Initially I hoped to take advantage of in-database filtering if we get
to implement the "transaction table sweeping" feature. However, seems
like that wasn't a great decision on my part - unclear when (and if) we
get that feature done, and also how much of improvement we would
actually get. The extra logic makes the code significantly more complex,
so I think we need to back off and give up the idea.
This doesn't invalidate the sweep rewrite project. The new impls still
bring significant performance improvements.
This change is