feature: Expose API to let client define number of expected values in a Get. #6655

LucasIME · 2023-07-04T15:08:57Z

General

Before this PR:
Only users with schemas that expected all cells to be present could actually make use of the performance improvement provided by #6624.

The idea is to let clients specify how many values they expect at most, so they can benefit from skipping the immutable timestamp lock check if they know only one of two cells is every going to be present, for example.

After this PR:

==COMMIT_MSG==
Expose API letting clients specify how many values they expect at most from a Get. This let us skip the immutable timestamp lock even in the case we've done some empty reads if the client knows it's safe to do so.
==COMMIT_MSG==

Priority:

Concerns / possible downsides (what feedback would you like?): Probably have to pay a bit more attention on how this plays with the caches we have. Are we caching empty values? Also run the risk of returning empty results when we shouldn't if the client misuses the API.

Is documentation needed?:

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?:

Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?:

The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.):

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?:

Does this PR need a schema migration?

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?:

What was existing testing like? What have you done to improve it?:

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.:

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?:

Execution

How would I tell this PR works in production? (Metrics, logs, etc.):

Has the safety of all log arguments been decided correctly?:

Will this change significantly affect our spending on metrics or logs?:

How would I tell that this PR does not work in production? (monitors, etc.):

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:

If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:

Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:

Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:

Development Process

Where should we start reviewing?:

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:

Please tag any other people who should be aware of this PR:
@jeremyk-91
@sverma30
@raiju

LucasIME · 2023-07-04T15:12:30Z

...sdb-impl-shared/src/main/java/com/palantir/atlasdb/transaction/impl/SnapshotTransaction.java

+        return getCache().get(tableRef, cells, uncached -> {
+            int cachedCells = cells.size() - uncached.size();
+            int numberOfCellsExpectingValuePostCache = expectedNumberOfPresentCells - cachedCells;


Are we actually storing empty values in this cache (and on the one on CachingTransaction)?

If so, this might not be the correct thing to do, as we'd be wrongly lowering the number of expected present cells, leading to throwing expectations down the line or potentially returning empty values when we shouldn't have.

Good spot! It's a bit messy:

This cache uses lock watches, which does cache empty reads.

The caching transaction (which I don't think is actually used in normal atlasdb usage) does also cache empty reads.

So yeah we might need a bit of refactoring here; as you note, the expression as written isn't quite right

Refactored! Let me know how do you think things look like now.

...ests-shared/src/test/java/com/palantir/atlasdb/transaction/impl/SnapshotTransactionTest.java

...sdb-impl-shared/src/main/java/com/palantir/atlasdb/transaction/impl/SnapshotTransaction.java

jeremyk-91 · 2023-07-04T18:08:44Z

...sdb-impl-shared/src/main/java/com/palantir/atlasdb/transaction/impl/SnapshotTransaction.java

+        return getCache().get(tableRef, cells, uncached -> {
+            int cachedCells = cells.size() - uncached.size();
+            int numberOfCellsExpectingValuePostCache = expectedNumberOfPresentCells - cachedCells;


Good spot! It's a bit messy:

This cache uses lock watches, which does cache empty reads.

The caching transaction (which I don't think is actually used in normal atlasdb usage) does also cache empty reads.

So yeah we might need a bit of refactoring here; as you note, the expression as written isn't quite right

jeremyk-91 · 2023-07-04T18:15:43Z

...impl-shared/src/main/java/com/palantir/atlasdb/transaction/impl/SerializableTransaction.java

+                            (tableReference, toRead) -> Futures.immediateFuture(super.getWithExpectedNumberOfCells(
+                                    tableReference, toRead, expectedNumberOfPresentCells)))
+                    .get();
+        } catch (InterruptedException | ExecutionException e) {


Might be missing something: usually we need to reset the interrupted flag on catching InterruptedException - is this different for some reason?

Hm... not sure? I just assumed the pattern used on the standard get method was correct and repeated it here?

I see. So Throwables handles re-interruption if an InterruptException is passed... though this is probably actually an old bug, as InterruptedException isn't guaranteed to have a cause (and/or one of type InterruptedException).

I think (though you should check) that in the normal get and here, this is wrong, and should be

} catch (InterruptedException e) { throw Throwables.rewrapAndThrowUncheckedException(e); // This does the InterruptedException magic } catch (ExecutionException e) { throw Throwables.rewrapAndThrowUncheckedException(e.getCause()); }

I think we still need to do the above?

InterruptedException::getCause should return null

Oops, you're right! Fixed the CachingTransaction but forgot to fix the instance here.

atlasdb-client/src/main/java/com/palantir/atlasdb/transaction/impl/CachingTransaction.java

jeremyk-91 · 2023-07-04T18:16:50Z

atlasdb-api/src/main/java/com/palantir/atlasdb/transaction/api/Transaction.java

+     */
+    @Idempotent
+    Map<Cell, byte[]> getWithExpectedNumberOfCells(
+            TableReference tableRef, Set<Cell> cells, int expectedNumberOfPresentCells);


for this one: the docs read well. But, I'm not sure we want to expose this to Atlas clients in general: we should see if there's a way to restrict this to the AtlasDB proxy only.

Do we have any such features that we hide from general AtlasDb users today? If so, any examples on how it's done?

atlasdb-api/src/main/java/com/palantir/atlasdb/transaction/api/Transaction.java

...ests-shared/src/test/java/com/palantir/atlasdb/transaction/impl/SnapshotTransactionTest.java

…ed value is empty

LucasIME · 2023-07-20T16:05:00Z

atlasdb-api/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/TransactionScopedCache.java

+    Map<Cell, byte[]> getWithCachedRef(
+            TableReference tableReference,
+            Set<Cell> cells,
+            Function<CacheLookupResult, ListenableFuture<Map<Cell, byte[]>>> valueLoader);


I could have also just replaced the existing get and getAsync to receive the CacheLookupResult instead of creating new two separate methods. Let me know what you prefer.

How bad would this change have been? Basically I think it's nicer to just have one get/getAsync, but I understand refactoring this might be very difficult/costly

Not very, tbh. Although this PR is already getting quite big, so I'll try to do the refactor in a separate one.

atlasdb-api/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/TransactionScopedCache.java

...sdb-impl-shared/src/main/java/com/palantir/atlasdb/transaction/impl/SnapshotTransaction.java

LucasIME · 2023-07-20T16:08:12Z

atlasdb-client/src/main/java/com/palantir/atlasdb/transaction/impl/CachingTransaction.java

+                        long nonEmptyValuesInCache = cachedCells.values().stream()
+                                .filter(value -> value != PtBytes.EMPTY_BYTE_ARRAY)
+                                .count();
+                        long numberOfCellsExpectingValuePostCache =
+                                expectedNumberOfPresentCells - nonEmptyValuesInCache;
+
+                        return Futures.immediateFuture(super.getWithExpectedNumberOfCells(
+                                tableReference, toRead, numberOfCellsExpectingValuePostCache));


I wanted to add some tests to this class verifying that we call super... with less expected values if we have things cached, but couldn't think of a good way to do it... any ideas?

Hmm. It is tangential, but if we pass in a mock delegate to the constructor we can verify stuff on it right? Admittedly this kind of tests ForwardingTransaction itself too but I think that's fine

# Conflicts: # .palantir/revapi.yml

jeremyk-91

Looks pretty solid! Just a couple of smaller bits left

jeremyk-91 · 2023-07-21T19:21:30Z

atlasdb-api/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/TransactionScopedCache.java

+    Map<Cell, byte[]> getWithCachedRef(
+            TableReference tableReference,
+            Set<Cell> cells,
+            Function<CacheLookupResult, ListenableFuture<Map<Cell, byte[]>>> valueLoader);


How bad would this change have been? Basically I think it's nicer to just have one get/getAsync, but I understand refactoring this might be very difficult/costly

atlasdb-api/src/main/java/com/palantir/atlasdb/transaction/api/Transaction.java

atlasdb-client/src/main/java/com/palantir/atlasdb/transaction/impl/CachingTransaction.java

jeremyk-91 · 2023-07-21T19:34:39Z

...impl-shared/src/main/java/com/palantir/atlasdb/transaction/impl/SerializableTransaction.java

+                            (tableReference, toRead) -> Futures.immediateFuture(super.getWithExpectedNumberOfCells(
+                                    tableReference, toRead, expectedNumberOfPresentCells)))
+                    .get();
+        } catch (InterruptedException | ExecutionException e) {


I see. So Throwables handles re-interruption if an InterruptException is passed... though this is probably actually an old bug, as InterruptedException isn't guaranteed to have a cause (and/or one of type InterruptedException).

I think (though you should check) that in the normal get and here, this is wrong, and should be

} catch (InterruptedException e) { throw Throwables.rewrapAndThrowUncheckedException(e); // This does the InterruptedException magic } catch (ExecutionException e) { throw Throwables.rewrapAndThrowUncheckedException(e.getCause()); }

...ests-shared/src/test/java/com/palantir/atlasdb/transaction/impl/SnapshotTransactionTest.java

...sdb-impl-shared/src/main/java/com/palantir/atlasdb/transaction/impl/SnapshotTransaction.java

...ests-shared/src/test/java/com/palantir/atlasdb/transaction/impl/SnapshotTransactionTest.java

jeremyk-91 · 2023-07-21T20:14:07Z

atlasdb-client/src/main/java/com/palantir/atlasdb/transaction/impl/CachingTransaction.java

+                        long nonEmptyValuesInCache = cachedCells.values().stream()
+                                .filter(value -> value != PtBytes.EMPTY_BYTE_ARRAY)
+                                .count();
+                        long numberOfCellsExpectingValuePostCache =
+                                expectedNumberOfPresentCells - nonEmptyValuesInCache;
+
+                        return Futures.immediateFuture(super.getWithExpectedNumberOfCells(
+                                tableReference, toRead, numberOfCellsExpectingValuePostCache));


Hmm. It is tangential, but if we pass in a mock delegate to the constructor we can verify stuff on it right? Admittedly this kind of tests ForwardingTransaction itself too but I think that's fine

LucasIME · 2023-07-25T13:50:19Z

...shared/src/main/java/com/palantir/atlasdb/keyvalue/api/cache/TransactionScopedCacheImpl.java

-        if (cacheLookup.missedCells().isEmpty()) {
-            return Futures.immediateFuture(filterEmptyValues(cacheLookup.cacheHits()));
-        } else {
-            return Futures.transform(
-                    valueLoader.apply(cacheLookup.missedCells()),
-                    uncachedValues -> processUncachedCells(
-                            tableReference, cacheLookup.cacheHits(), cacheLookup.missedCells(), uncachedValues),
-                    MoreExecutors.directExecutor());
-        }


Potentially controversial change here: before if we had no missed cells, we'd just return early. But now we delegate to the value loader, so it can decide if too many cells were cached and throw. Not extra cells should be loaded, though, since we pass an empty Set as the missed cells.

Modified the test for the existing behaviour here: https://github.com/palantir/atlasdb/pull/6655/files#diff-0205f833db5fc2b2032f684954c530b494e176e210ab17d9bdd0713869ef1981R152-R163

… loader

Sam-Kramer · 2023-08-02T09:08:22Z

...a/com/palantir/atlasdb/transaction/api/exceptions/MoreCellsPresentThanExpectedException.java

+        return arguments;
+    }
+
+    @Unsafe


Why do we need the @Unsafe marker here?

Due to the SafeLoggingPropagation check. Otherwise we fail to compile with the following message:

/Volumes/git/Projects/atlasdb/atlasdb-api/src/main/java/com/palantir/atlasdb/transaction/api/exceptions/MoreCellsPresentThanExpectedException.java:59: error: [SafeLoggingPropagation] Safe logging annotations should be propagated to encapsulating elements to allow static analysis tooling to work with as much information as possible. This check can be auto-fixed using `./gradlew classes testClasses -PerrorProneApply=SafeLoggingPropagation` private static List<Arg<?>> argsFrom(Map<Cell, byte[]> retrievedCells, long expectedNumberOfCells) { ^ (see https://github.com/palantir/gradle-baseline#baseline-error-prone-checks) Did you mean '@Unsafe private static List<Arg<?>> argsFrom(Map<Cell, byte[]> retrievedCells, long expectedNumberOfCells) {'?

Sam-Kramer

lgtm, just a style nit on the new tests

...src/test/java/com/palantir/atlasdb/transaction/impl/expectations/CellCountValidatorTest.java

Sam-Kramer · 2023-08-02T09:21:11Z

...a/com/palantir/atlasdb/transaction/api/exceptions/MoreCellsPresentThanExpectedException.java

+import com.palantir.logsafe.exceptions.SafeIllegalStateException;
+import java.util.Map;
+
+public class MoreCellsPresentThanExpectedException extends IllegalStateException {


I think the entire message that is passed to super will be unrendered IMO

* rename validator * changing API to return result type * rev api * isOk final * dry * spotless * Result refactor (#6692) * implement result on children * method reference

[skip ci]

jeremyk-91

👍 Thanks for updating the PR! Looks great

...a/com/palantir/atlasdb/transaction/api/exceptions/MoreCellsPresentThanExpectedException.java

atlasdb-api/src/main/java/com/palantir/atlasdb/transaction/api/Transaction.java

jeremyk-91 · 2023-08-30T12:15:20Z

...red/src/main/java/com/palantir/atlasdb/transaction/impl/expectations/CellCountValidator.java

+            long expectedNumberOfPresentCellsToFetch, Map<Cell, CacheValue> cachedLookup) {
+        Map<Cell, byte[]> cachedCellsWithNonEmptyValue = EntryStream.of(cachedLookup)
+                .filterValues(value -> value.value().isPresent()
+                        && !Arrays.equals(value.value().get(), PtBytes.EMPTY_BYTE_ARRAY))


I'm guessing this was the bit we missed the last time?

Not really! That was correct from the beginning.

We were over subtracting from the number of expected cells here and fixed in this commit: f11d972

Ah, okay. I see, yep thanks

...ests-shared/src/test/java/com/palantir/atlasdb/transaction/impl/SnapshotTransactionTest.java

svc-autorelease · 2023-08-30T14:40:14Z

Released 0.920.0

LucasIME and others added 2 commits July 4, 2023 16:02

Expose API to let client define number of expected values

6c2aa5f

Add generated changelog entries

99f2125

LucasIME commented Jul 4, 2023

View reviewed changes

accept revapi change

7e76c60

jeremyk-91 reviewed Jul 4, 2023

View reviewed changes

LucasIME added 10 commits July 6, 2023 10:48

T -> Timestamp

e3224bd

Extract validation to private method and unsafe log the cells

8d5320d

fix documentation typo

d3afa8d

Add cache test

37ac526

fix test

b557bc7

test for not reducing

90477bd

description

ffaa25c

Add methods with cachedLookupRef to TransactionScopedCache gets

648e54e

Add test verifying we don't decrease number of expected cells if cach…

f273357

…ed value is empty

verify no lock is ever called

773e1ad

LucasIME commented Jul 20, 2023

View reviewed changes

LucasIME marked this pull request as ready for review July 20, 2023 16:10

LucasIME added 2 commits July 20, 2023 17:11

Merge branch 'develop' into lmeireles/expose-get-with-expected-size

34c0b25

# Conflicts: # .palantir/revapi.yml

rev api

0a13148

LucasIME requested a review from jeremyk-91 July 20, 2023 16:20

jeremyk-91 reviewed Jul 21, 2023

View reviewed changes

LucasIME added 7 commits July 24, 2023 15:47

catch interrupted properly

d54908e

skip lock check and not lock

3d913ff

fix test description

797e635

Array equals instead of !=

2926f79

testing all fetched cached

e8c08c7

Extract validor to separate class

f48fd14

update expected exception class

89cb066

LucasIME commented Jul 25, 2023

View reviewed changes

LucasIME force-pushed the lmeireles/expose-get-with-expected-size branch from 0155cb8 to a25666a Compare July 25, 2023 13:56

LucasIME force-pushed the lmeireles/expose-get-with-expected-size branch from 2ad1b7a to 2ab80f9 Compare August 1, 2023 13:06

LucasIME added 5 commits August 1, 2023 14:41

assert args in exception

a42100a

cell count validator tests

f91d0ba

add test to verify we don't filter down the cells we pass to the cell…

8f0a05d

… loader

import

e147c14

spotless

9d61cc7

Sam-Kramer reviewed Aug 2, 2023

View reviewed changes

Sam-Kramer approved these changes Aug 2, 2023

View reviewed changes

LucasIME added the do not merge label Aug 2, 2023

LucasIME added 13 commits August 2, 2023 10:42

fix test style

b2292ce

Changing getWithExpectedSize to return a Result type (#6686)

82da62a

* rename validator * changing API to return result type * rev api * isOk final * dry * spotless * Result refactor (#6692) * implement result on children * method reference

Merge branch 'develop' into lmeireles/expose-get-with-expected-size

ad4d984

rev api

3227916

Autorelease 0.908.0-rc1

806abe0

[skip ci]

store number of expected cells on exception too

b8d981f

Autorelease 0.908.0-rc2

9e25ffa

[skip ci]

Merge branch 'develop' into lmeireles/expose-get-with-expected-size

11951f1

fix ambiguous reference

f1912e1

don't count deleted local writes as expected cells

f11d972

Merge branch 'develop' into lmeireles/expose-get-with-expected-size

1f68638

Autorelease 0.919.0-rc1

cad58d2

[skip ci]

Autorelease 0.919.0-rc2

b3f7284

[skip ci]

jeremyk-91 approved these changes Aug 30, 2023

View reviewed changes

missing space

86831da

LucasIME added autorelease merge when ready and removed do not merge labels Aug 30, 2023

bulldozer-bot bot merged commit 954092e into develop Aug 30, 2023

bulldozer-bot bot deleted the lmeireles/expose-get-with-expected-size branch August 30, 2023 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Expose API to let client define number of expected values in a Get. #6655

feature: Expose API to let client define number of expected values in a Get. #6655

LucasIME commented Jul 4, 2023

LucasIME Jul 4, 2023

jeremyk-91 Jul 4, 2023

LucasIME Jul 20, 2023

jeremyk-91 Jul 4, 2023

jeremyk-91 Jul 4, 2023

LucasIME Jul 6, 2023

jeremyk-91 Jul 21, 2023

Sam-Kramer Aug 1, 2023

Sam-Kramer Aug 1, 2023

LucasIME Aug 1, 2023

jeremyk-91 Jul 4, 2023

LucasIME Jul 6, 2023

LucasIME Jul 20, 2023

jeremyk-91 Jul 21, 2023

LucasIME Jul 25, 2023

LucasIME Jul 20, 2023

jeremyk-91 Jul 21, 2023

jeremyk-91 left a comment

jeremyk-91 Jul 21, 2023

jeremyk-91 Jul 21, 2023

jeremyk-91 Jul 21, 2023

LucasIME Jul 25, 2023

Sam-Kramer Aug 2, 2023

LucasIME Aug 2, 2023

Sam-Kramer left a comment

Sam-Kramer Aug 2, 2023

jeremyk-91 left a comment

jeremyk-91 Aug 30, 2023

LucasIME Aug 30, 2023

jeremyk-91 Aug 30, 2023

svc-autorelease commented Aug 30, 2023

feature: Expose API to let client define number of expected values in a Get. #6655

feature: Expose API to let client define number of expected values in a Get. #6655

Conversation

LucasIME commented Jul 4, 2023

General

Compatibility

Testing and Correctness

Execution

Scale

Development Process

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyk-91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sam-Kramer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyk-91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svc-autorelease commented Aug 30, 2023