Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success #2630

tboam · 2017-11-03T18:47:37Z

Goals (and why): Two problems with the decreasing of the batchSizeMultiplier meant that the config values never went down as far as 1. On Cassandra (where we had set one parameter to 1) it actually increased when things failed. Even if all config values are 1000 we never went below 30. Goal is for parameters to halve until they all reach one. Then they increase again as things succeed.

Implementation Description (bullets): Removed the complicated max expression and replaced with clearer logic.

Concerns (what feedback would you like?): We're now going to shrink all the way down to 0.001 multiplier. This takes nearly 700 iterations to get back to 1. This should probably be faster but I'd like to do that in other PR, is this ok?

Where should we start reviewing?: The test describes the behaviour we want.

Priority (whenever / two weeks / yesterday): Monday as sweep is currently not backing off on failures.

This change is

…re and increase with each success

hsaraogi

LGTM overall, some small comments, we should revamp the rate of increase of the sweep batch configs in a separate PR. Lets discuss if we should hold off on merging this until the increase at a faster rate is ready?

hsaraogi · 2017-11-05T18:41:52Z

...b-impl-shared/src/main/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSource.java

+
+        double newBatchSizeMultiplier = batchSizeMultiplier / 2;
+        if (newBatchSizeMultiplier < smallestSensibleBatchSizeMultiplier) {
+            batchSizeMultiplier = smallestSensibleBatchSizeMultiplier;


We should have a warn here, saying something along the lines: sweep tried to decrease the multiplier, but couldnt as it was at the minimum.

Agree that logging here would be useful. I have concerns that this might log too much like the clock skew monitor though.

Out of interest, do we know offhand what happens if C* is unreachable or down? For a timeout exception you get automatic log throttling, but uncertain about connect exceptions that arise more quickly.

Done, I've also added a guard so if the batchSizeMultiplier is already set to the smallestSensibleValue then we early out to avoid excessive logging.

hsaraogi · 2017-11-05T18:42:34Z

...b-impl-shared/src/main/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSource.java

+        if (newBatchSizeMultiplier < smallestSensibleBatchSizeMultiplier) {
+            batchSizeMultiplier = smallestSensibleBatchSizeMultiplier;
+        } else {
+            batchSizeMultiplier = newBatchSizeMultiplier;


We can have a metric/log for the new batchsize multiplier value. I think we have some metric for this already, we should update that one here.

The metric is passed as a gauge which is a Supplier<Long>, so we actually get the updates for free here :)

yep, we already have this metric and it just reads the value here so no need to update it.

hsaraogi · 2017-11-05T18:43:57Z

...pl-shared/src/test/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSourceTest.java

+    @Test
+    public void batchSizeMultiplierDecreasesOnFailure() {
+        // Given
+        configWithValues(1000, 1000, 1000);


nit: adjustableConfig should be a local variable here. Its strange to see methods operating via side-effects only.

This means passing the item under test to every method in the entire class which I think is going to be messy and complicating. The side-effects are unusual for production code but hopefully makes the tests easier to read.

hsaraogi · 2017-11-05T18:47:01Z

...pl-shared/src/test/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSourceTest.java

+    }
+
+    private void whenDecreasingTheMultiplier_thenAdjustedConfigValuesDecrease() {
+        for (int i = 0; i < 10_000; i++) {


The multiplier is generally stuck at 0.001 after about 10 iterations, 10_000 seems like too many here, lets make this 100?

I wanted the test to be agnostic of the implementation details. At the moment we back off aggressively so we get to 1 fast for a value of 1000. But we may not always want to do this, this way the test is more flexible and won't need updating when we change behaviour that we don't care about here.

hsaraogi · 2017-11-05T18:49:15Z

...pl-shared/src/test/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSourceTest.java

+    }
+
+    private void configWithValues(int maxCellTsPairsToExamine, int candidateBatchSize, int deleteBatchSize) {
+        adjustableConfig = AdjustableSweepBatchConfigSource.create(() -> new SweepBatchConfig() {


Use ImmutableSweepBatchConfig.builder to create the sweepBatchConfig.

Done, thanks

hsaraogi · 2017-11-05T19:11:39Z

...pl-shared/src/test/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSourceTest.java

+        if (newValue == 1) {
+            return;
+        }
+        assertThat(newValue, is(lessThan(previousCandidateBatchSize)));


Nit: replace the three lines with: assertThat(newValue, is(anyOf(lessThan(previousMaxCellTsPairsToExamine), equalTo(1))));

Same for the other two batch config value asserts.

Done, had forgotten about anyOf

hsaraogi · 2017-11-05T19:15:14Z

...pl-shared/src/test/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSourceTest.java

+
+            // Then
+            batchSizeMultiplierDecreases();
+            maxCellTsPairsToExamineDecreasesToAMinimumOfOne();


nit coding-style: These three asserts are repetitive, we might be able to colease this into one function and/or use lambdas.

I've applied some lambdas to the assertion methods to tidy them up and reduce duplication

hsaraogi · 2017-11-05T19:17:42Z

docs/source/release_notes/release-notes.rst

@@ -48,6 +48,12 @@ develop
    *    - Type
         - Change

+    *    - |fix|
+         - ``SweepBatchConfig`` values are now decayed correctly when there's an error.
+           ``SweepBatchConfig`` should be decreased until sweep succeeds, however on Cassandra the config was actually multiplied by 1.5.  This was caused by us fixing one of the values at 1.


In 0.65.1, it wasn't being multiplied by 1.5, with the default candidateBatchSize being increased to 1024. So, technically this wasnt the behaviour in the last version.

If I'm not wrong this behaviour can still happen with any value of the params, just that it might not happen immediately. Perhaps we could reword this to ...the config actually oscillated between small values that could be larger than 1.

Even if we start with something big, it will be cut down in the first few iterations until the candidate batch size becomes small.

e.g. running decreases with a batch of 1024, you get something like:

0;0.5;512 1;0.25;256 2;0.125;128 3;0.0625;64 4;0.03125;32 5;0.046875;48 6;0.03125;32 7;0.046875;48 8;0.03125;32 9;0.046875;48 10;0.03125;32

and with a batch config of 123456789:

0;0.5;61728394 1;0.25;30864197 2;0.125;15432098 3;0.0625;7716049 4;0.03125;3858024 5;0.015625;1929012 6;0.0078125;964506 7;0.00390625;482253 8;0.001953125;241126 9;9.765625E-4;120563 10;4.8828125E-4;60281 11;2.44140625E-4;30140 12;1.220703125E-4;15070 13;9.953550099535501E-5;12288 14;1.220703125E-4;15070

There's probably a proof we can write that the original decreases until it oscillates between K and 1.5K for some K (not going to do it in the interest of time 😅).

jeremyk-91

@hsaraogi already reviewed this in detail. I had a look through and have a couple of small additions. The general idea makes sense though, and as far as I can tell looks like a correct "halve or increase by 1%, but don't halve the multiplier into oblivion".

jeremyk-91 · 2017-11-06T10:06:24Z

docs/source/release_notes/release-notes.rst

+         - ``SweepBatchConfig`` values are now decayed correctly when there's an error.
+           ``SweepBatchConfig`` should be decreased until sweep succeeds, however on Cassandra the config was actually multiplied by 1.5.  This was caused by us fixing one of the values at 1.
+           ``SweepBatchConfig`` values will now be halved with each failure until they reach 1 (previously they only went to about 30% due to another bug).  This ensures we fully backoff and gives us the best possible chance of success.  Values will slowly increase with each successful run until they are back to their default level.
+           (`Pull Request <https://github.com/palantir/atlasdb/pull/XXXX>`__)


(we should remember to update this)

Thanks, done

jeremyk-91 · 2017-11-06T10:08:31Z

...pl-shared/src/test/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSourceTest.java

+
+    private void batchSizeMultiplierIncreases() {
+        assertThat(adjustableConfig.getBatchSizeMultiplier(), is(greaterThanOrEqualTo(previousMultiplier)));
+    }


nit: This seems like a ...DoesNotDecrease(), following consistency with the method for max-cell-ts-pairs. I'm guessing we need the equality case for when it's 1 - in which case we could do increasesToAMaximumOfOne() and something analogous to the batch sizes needing to be at least 1.

Refactored this to be clearer

jeremyk-91 · 2017-11-06T10:22:27Z

...b-impl-shared/src/main/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSource.java

+
+        double newBatchSizeMultiplier = batchSizeMultiplier / 2;
+        if (newBatchSizeMultiplier < smallestSensibleBatchSizeMultiplier) {
+            batchSizeMultiplier = smallestSensibleBatchSizeMultiplier;


Agree that logging here would be useful. I have concerns that this might log too much like the clock skew monitor though.

Out of interest, do we know offhand what happens if C* is unreachable or down? For a timeout exception you get automatic log throttling, but uncertain about connect exceptions that arise more quickly.

jeremyk-91 · 2017-11-06T10:27:45Z

...b-impl-shared/src/main/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSource.java

+        if (newBatchSizeMultiplier < smallestSensibleBatchSizeMultiplier) {
+            batchSizeMultiplier = smallestSensibleBatchSizeMultiplier;
+        } else {
+            batchSizeMultiplier = newBatchSizeMultiplier;


The metric is passed as a gauge which is a Supplier<Long>, so we actually get the updates for free here :)

jeremyk-91 · 2017-11-06T10:44:57Z

docs/source/release_notes/release-notes.rst

@@ -48,6 +48,12 @@ develop
    *    - Type
         - Change

+    *    - |fix|
+         - ``SweepBatchConfig`` values are now decayed correctly when there's an error.
+           ``SweepBatchConfig`` should be decreased until sweep succeeds, however on Cassandra the config was actually multiplied by 1.5.  This was caused by us fixing one of the values at 1.


If I'm not wrong this behaviour can still happen with any value of the params, just that it might not happen immediately. Perhaps we could reword this to ...the config actually oscillated between small values that could be larger than 1.

Even if we start with something big, it will be cut down in the first few iterations until the candidate batch size becomes small.

e.g. running decreases with a batch of 1024, you get something like:

0;0.5;512 1;0.25;256 2;0.125;128 3;0.0625;64 4;0.03125;32 5;0.046875;48 6;0.03125;32 7;0.046875;48 8;0.03125;32 9;0.046875;48 10;0.03125;32

and with a batch config of 123456789:

0;0.5;61728394 1;0.25;30864197 2;0.125;15432098 3;0.0625;7716049 4;0.03125;3858024 5;0.015625;1929012 6;0.0078125;964506 7;0.00390625;482253 8;0.001953125;241126 9;9.765625E-4;120563 10;4.8828125E-4;60281 11;2.44140625E-4;30140 12;1.220703125E-4;15070 13;9.953550099535501E-5;12288 14;1.220703125E-4;15070

There's probably a proof we can write that the original decreases until it oscillates between K and 1.5K for some K (not going to do it in the interest of time 😅).

codecov-io · 2017-11-06T15:17:40Z

Codecov Report

Merging #2630 into develop will decrease coverage by <.01%.
The diff coverage is 100%.

@@              Coverage Diff              @@
##             develop    #2630      +/-   ##
=============================================
- Coverage      60.32%   60.31%   -0.01%     
+ Complexity      4706     4705       -1     
=============================================
  Files            865      865              
  Lines          39938    39956      +18     
  Branches        4018     4021       +3     
=============================================
+ Hits           24093    24101       +8     
- Misses         14367    14374       +7     
- Partials        1478     1481       +3

Impacted Files	Coverage Δ	Complexity Δ
...tlasdb/sweep/AdjustableSweepBatchConfigSource.java	`100% <100%> (+6.89%)`	`10 <3> (+1)`	⬆️
.../java/com/palantir/util/debug/StackTraceUtils.java	`51.12% <0%> (-3.14%)`	`22% <0%> (ø)`
...ain/java/com/palantir/paxos/PaxosAcceptorImpl.java	`75% <0%> (-1.79%)`	`0% <0%> (ø)`
...ain/java/com/palantir/paxos/PaxosStateLogImpl.java	`85.59% <0%> (-1.7%)`	`0% <0%> (ø)`
...a/com/palantir/common/base/BatchingVisitables.java	`75.12% <0%> (-1.04%)`	`18% <0%> (ø)`
...tion/impl/AbstractSerializableTransactionTest.java	`85.24% <0%> (-0.55%)`	`35% <0%> (ø)`
...n/java/com/palantir/lock/impl/LockServiceImpl.java	`82.95% <0%> (ø)`	`91% <0%> (ø)`	⬇️
...ain/java/com/palantir/paxos/PaxosProposerImpl.java	`88.33% <0%> (+3.33%)`	`0% <0%> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 47724dc...41f7654. Read the comment docs.

…ease by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000. Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time. This means that when sweep has to back off it should speed up again quickly.

tboam · 2017-11-06T15:25:59Z

New version up with much tidier tests and faster return to normal after backing off.

hsaraogi · 2017-11-07T10:31:54Z

...b-impl-shared/src/main/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSource.java

@@ -29,6 +30,7 @@
    private final Supplier<SweepBatchConfig> rawSweepBatchConfig;

    private static volatile double batchSizeMultiplier = 1.0;
+    private int successiveIncreases = 0;


Do we want to make this atomic/volatile?

I suspect we're safe here because we usually read or update this in sweep whcih is guaranteed to run in a single thread. But you're right that there's a risk.

Changed to AtomicInteger - volatile doesn't necessarily work as expected with variable++

hsaraogi

LGTM, thanks for fixing this!

* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success * add logging when we stop reducing the batch size multiplier * further improve the tests * Allow sweep to recover faster after backing off. Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000. Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time. This means that when sweep has to back off it should speed up again quickly. * Use an AtomicInteger to handle concurrent updates * SweeperService logging improvements (#2618) * SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not * Added sweep parameters to the log lines * no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true. Behaviour is unchanged but we can log a message when defaulting. * Refactor TracingKVS (#2643) * Wrap next() and hasNext() in traces * Use span names as safe * Remove iterator wrappings * checkstyle * refactor methods and remove misleading traces * Fix unit tests * release notes * Final nits * fix java arrays usage * Delete docs (#2657) * [20 minute tasks] Add test for when a batch is full (#2655) * [no release notes] Drive-by add test for when a batch is full * MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636) * change metrics manager to warn plus log the metric name * more timestamp tracker tests * release notes * Extract interface for Cassandra client (#2660) * Create a CassandraClient * Propagate CassandraClient to all classes but CKVS * Use CassandraClient on CKVS * Propagate CassandraClient to remaining Impl classes * Use CassandraClient in tests * [no release notes] * client -> namespace [no release notes] (#2654) * 0.65.2 and 0.66.0 release notes (#2663) * Release notes banners * fix pr numbers * [QoS] Add getNamespace to AtlasDBConfig (#2661) * Add getNamespace [no release notes] * Timelock client config cannot be empty * Make it explicit that unspecified namespace is only possible for InMemoryKVS * CR comments * Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621) * thoughts * More tests for RIH * Paranoid logging * statics * javadoc part 1 * polling refreshable * Unit tests * Remove the old RIH * lock lock * Tests that test how we deal with exceptions * logging * [no release notes] * CR comments part 1 * Make interval configurable * Standard nasty time edge cases * lastSeenValue does not need to be volatile * Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622) * ServiceCreator.applyDynamic() * Propagate config through TMs * Json Serialization fixes * Some refactoring * lock/lock * Fixed checkstyle * CR comments part 1 * Switch to RPIH * add test * [no release notes] forthcoming in part 4 * checkstyle * [TTT] [no release notes] Document behaviour regarding index rows (#2658) * [no release notes] Document behaviour regarding index rows * fix compile bug * ``List`` * Refactor and Instrument CassandraClient api (#2665) * Sanitize Client API * Instrument CassandraClient * checkstyle * Address comment * [no release notes] * checkstyle * Fix cas * Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647) * 0 nodes part 1 * add support for 0 servers in a ServerListConfig * extend deserialization tests * More tests * code defensively * [no release notes] defer to 2648 * Fixed CR nits * singleton server list * check immutable ts (#2406) * check immutable ts * checkstyle * release notes * Fix TM creation * checkstyle * Propagate top-level KVS method names to CassandraClient (#2669) * Propagate method names down to multiget_slice * Add the corresponding KVS method to remaining methods * Add TODO * [no release notes] * nit * Extract cql executor interface (#2670) * Instrument CqlExecutor * [no release notes] * bump awaitility (#2668) * Upgrade to newer Awaitility. * locks [no release notes] * unused import * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662) * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts * Add changes into missing file * Doc changes * Exclude Tracing and HdrHistogram from Tritium dependencies * update locks * Add excluded dependencies explicitly * Fix merge conflict in relase notes * Uncomment dependencies * Regenerate locks * Correctly log Paxos events (#2674) * Log out Paxos values when recording Paxos events * Updated release notes * Checkstyle * Pull request number * Address comments * fix docs * Slow log and tracing (#2673) * Trace and instrument the thrift client * Instrument CqlExecutor * Fix metric names of IntrumentedCassandraClient * Fix nit * Also log internal table references * Checkstyle * simplify metric names * Address comments * add slow logging to the cassandra thrift client * add slow logging to cqlExecutor * fix typos * Add tracing to the CassandraClient * trace cqlExecutor queries * Add slow-logging in the CassandraClient * Delete InstrumentedCC and InstrumentedCqlExec * Fix small nits * Checkstyle * Add kvs method names to slow logs * Fix wrapping of exception * Extract CqlQuery * Move kvs-slow-log and tracing of CqlExecutor to CCI * Propagate execute_cql3_query api breaks * checkstyle * delete unused string * checkstyle * fix number of mutations on batch_mutate * some refactors * fix compile * Refactor cassandra client (#2676) * Extract TracingCassandraClient Extract ProfilingCassandraClient Move todos and some cleanup Cherry-pick QoS metrics to develop (#2679) * [QoS] Feature/qos meters (#2640) * Metrics for bytes and counts in each read/write * Refactors, dont throw if recordMetrics throws * Use meters instead of histograms * Multiget bytes * Batch mutate exact size * Cqlresult size * Calculate exact byte sizes for all thrift objects * tests and bugfixes - partial * More tests and bugs fixed * More tests and cr comments * byte buffer size * Remove register histogram * checkstyle * checkstyle * locks and license * Qos metrics CassandraClient * Exclude unused classes * fix cherry pick * use supplier for object size [no release notes] * fix merge in AtlasDbConfig

* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success * add logging when we stop reducing the batch size multiplier * further improve the tests * Allow sweep to recover faster after backing off. Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000. Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time. This means that when sweep has to back off it should speed up again quickly. * Use an AtomicInteger to handle concurrent updates * SweeperService logging improvements (#2618) * SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not * Added sweep parameters to the log lines * no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true. Behaviour is unchanged but we can log a message when defaulting. * Refactor TracingKVS (#2643) * Wrap next() and hasNext() in traces * Use span names as safe * Remove iterator wrappings * checkstyle * refactor methods and remove misleading traces * Fix unit tests * release notes * Final nits * fix java arrays usage * Delete docs (#2657) * [20 minute tasks] Add test for when a batch is full (#2655) * [no release notes] Drive-by add test for when a batch is full * MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636) * change metrics manager to warn plus log the metric name * more timestamp tracker tests * release notes * Extract interface for Cassandra client (#2660) * Create a CassandraClient * Propagate CassandraClient to all classes but CKVS * Use CassandraClient on CKVS * Propagate CassandraClient to remaining Impl classes * Use CassandraClient in tests * [no release notes] * client -> namespace [no release notes] (#2654) * 0.65.2 and 0.66.0 release notes (#2663) * Release notes banners * fix pr numbers * [QoS] Add getNamespace to AtlasDBConfig (#2661) * Add getNamespace [no release notes] * Timelock client config cannot be empty * Make it explicit that unspecified namespace is only possible for InMemoryKVS * CR comments * Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621) * thoughts * More tests for RIH * Paranoid logging * statics * javadoc part 1 * polling refreshable * Unit tests * Remove the old RIH * lock lock * Tests that test how we deal with exceptions * logging * [no release notes] * CR comments part 1 * Make interval configurable * Standard nasty time edge cases * lastSeenValue does not need to be volatile * Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622) * ServiceCreator.applyDynamic() * Propagate config through TMs * Json Serialization fixes * Some refactoring * lock/lock * Fixed checkstyle * CR comments part 1 * Switch to RPIH * add test * [no release notes] forthcoming in part 4 * checkstyle * [TTT] [no release notes] Document behaviour regarding index rows (#2658) * [no release notes] Document behaviour regarding index rows * fix compile bug * ``List`` * Refactor and Instrument CassandraClient api (#2665) * Sanitize Client API * Instrument CassandraClient * checkstyle * Address comment * [no release notes] * checkstyle * Fix cas * Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647) * 0 nodes part 1 * add support for 0 servers in a ServerListConfig * extend deserialization tests * More tests * code defensively * [no release notes] defer to 2648 * Fixed CR nits * singleton server list * check immutable ts (#2406) * check immutable ts * checkstyle * release notes * Fix TM creation * checkstyle * Propagate top-level KVS method names to CassandraClient (#2669) * Propagate method names down to multiget_slice * Add the corresponding KVS method to remaining methods * Add TODO * [no release notes] * nit * Extract cql executor interface (#2670) * Instrument CqlExecutor * [no release notes] * bump awaitility (#2668) * Upgrade to newer Awaitility. * locks [no release notes] * unused import * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662) * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts * Add changes into missing file * Doc changes * Exclude Tracing and HdrHistogram from Tritium dependencies * update locks * Add excluded dependencies explicitly * Fix merge conflict in relase notes * Uncomment dependencies * Regenerate locks * Correctly log Paxos events (#2674) * Log out Paxos values when recording Paxos events * Updated release notes * Checkstyle * Pull request number * Address comments * fix docs * Slow log and tracing (#2673) * Trace and instrument the thrift client * Instrument CqlExecutor * Fix metric names of IntrumentedCassandraClient * Fix nit * Also log internal table references * Checkstyle * simplify metric names * Address comments * add slow logging to the cassandra thrift client * add slow logging to cqlExecutor * fix typos * Add tracing to the CassandraClient * trace cqlExecutor queries * Add slow-logging in the CassandraClient * Delete InstrumentedCC and InstrumentedCqlExec * Fix small nits * Checkstyle * Add kvs method names to slow logs * Fix wrapping of exception * Extract CqlQuery * Move kvs-slow-log and tracing of CqlExecutor to CCI * Propagate execute_cql3_query api breaks * checkstyle * delete unused string * checkstyle * fix number of mutations on batch_mutate * some refactors * fix compile * Refactor cassandra client (#2676) * Extract TracingCassandraClient Extract ProfilingCassandraClient Move todos and some cleanup Cherry-pick QoS metrics to develop (#2679) * [QoS] Feature/qos meters (#2640) * Metrics for bytes and counts in each read/write * Refactors, dont throw if recordMetrics throws * Use meters instead of histograms * Multiget bytes * Batch mutate exact size * Cqlresult size * Calculate exact byte sizes for all thrift objects * tests and bugfixes - partial * More tests and bugs fixed * More tests and cr comments * byte buffer size * Remove register histogram * checkstyle * checkstyle * locks and license * Qos metrics CassandraClient * Exclude unused classes * fix cherry pick * use supplier for object size [no release notes] * fix merge in AtlasDbConfig * rate limiting * total-time * qos config * respect max backoff itme * query weights * extra tests * num rows * checkstyle * fix tests * no int casting * Qos ete tests * shouldFailIfWritingTooManyBytes * fix test * rm file * Remove metrics * Test shouldFailIfReadingTooManyBytes * canBeWritingLargeNumberOfBytesConcurrently * checkstyle * cannotWriteLargeNumberOfBytesConcurrently * fix tests * create tm in test * More read tests (after writing a lot of data at once) * WIP * Tests that should pas * Actually update the rate * Add another test * More tests and address comments * Dont extend etesetup * Make dumping data faster * cleanup * wip * Add back lost file * Cleanup * Write tests * numReadsPerThread -> numThreads * More write tests, cleanup, check style fixes * Refactor to avoid code duplication * Cleanup * cr comments * Small read/write after a rate-limited read/write * annoying no new linw at eof * Uniform parameters for hard limiting

* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success * add logging when we stop reducing the batch size multiplier * further improve the tests * Allow sweep to recover faster after backing off. Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000. Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time. This means that when sweep has to back off it should speed up again quickly. * Use an AtomicInteger to handle concurrent updates * SweeperService logging improvements (#2618) * SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not * Added sweep parameters to the log lines * no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true. Behaviour is unchanged but we can log a message when defaulting. * Refactor TracingKVS (#2643) * Wrap next() and hasNext() in traces * Use span names as safe * Remove iterator wrappings * checkstyle * refactor methods and remove misleading traces * Fix unit tests * release notes * Final nits * fix java arrays usage * Delete docs (#2657) * [20 minute tasks] Add test for when a batch is full (#2655) * [no release notes] Drive-by add test for when a batch is full * MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636) * change metrics manager to warn plus log the metric name * more timestamp tracker tests * release notes * Extract interface for Cassandra client (#2660) * Create a CassandraClient * Propagate CassandraClient to all classes but CKVS * Use CassandraClient on CKVS * Propagate CassandraClient to remaining Impl classes * Use CassandraClient in tests * [no release notes] * client -> namespace [no release notes] (#2654) * 0.65.2 and 0.66.0 release notes (#2663) * Release notes banners * fix pr numbers * [QoS] Add getNamespace to AtlasDBConfig (#2661) * Add getNamespace [no release notes] * Timelock client config cannot be empty * Make it explicit that unspecified namespace is only possible for InMemoryKVS * CR comments * Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621) * thoughts * More tests for RIH * Paranoid logging * statics * javadoc part 1 * polling refreshable * Unit tests * Remove the old RIH * lock lock * Tests that test how we deal with exceptions * logging * [no release notes] * CR comments part 1 * Make interval configurable * Standard nasty time edge cases * lastSeenValue does not need to be volatile * Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622) * ServiceCreator.applyDynamic() * Propagate config through TMs * Json Serialization fixes * Some refactoring * lock/lock * Fixed checkstyle * CR comments part 1 * Switch to RPIH * add test * [no release notes] forthcoming in part 4 * checkstyle * [TTT] [no release notes] Document behaviour regarding index rows (#2658) * [no release notes] Document behaviour regarding index rows * fix compile bug * ``List`` * Refactor and Instrument CassandraClient api (#2665) * Sanitize Client API * Instrument CassandraClient * checkstyle * Address comment * [no release notes] * checkstyle * Fix cas * Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647) * 0 nodes part 1 * add support for 0 servers in a ServerListConfig * extend deserialization tests * More tests * code defensively * [no release notes] defer to 2648 * Fixed CR nits * singleton server list * check immutable ts (#2406) * check immutable ts * checkstyle * release notes * Fix TM creation * checkstyle * Propagate top-level KVS method names to CassandraClient (#2669) * Propagate method names down to multiget_slice * Add the corresponding KVS method to remaining methods * Add TODO * [no release notes] * nit * Extract cql executor interface (#2670) * Instrument CqlExecutor * [no release notes] * bump awaitility (#2668) * Upgrade to newer Awaitility. * locks [no release notes] * unused import * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662) * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts * Add changes into missing file * Doc changes * Exclude Tracing and HdrHistogram from Tritium dependencies * update locks * Add excluded dependencies explicitly * Fix merge conflict in relase notes * Uncomment dependencies * Regenerate locks * Correctly log Paxos events (#2674) * Log out Paxos values when recording Paxos events * Updated release notes * Checkstyle * Pull request number * Address comments * fix docs * Slow log and tracing (#2673) * Trace and instrument the thrift client * Instrument CqlExecutor * Fix metric names of IntrumentedCassandraClient * Fix nit * Also log internal table references * Checkstyle * simplify metric names * Address comments * add slow logging to the cassandra thrift client * add slow logging to cqlExecutor * fix typos * Add tracing to the CassandraClient * trace cqlExecutor queries * Add slow-logging in the CassandraClient * Delete InstrumentedCC and InstrumentedCqlExec * Fix small nits * Checkstyle * Add kvs method names to slow logs * Fix wrapping of exception * Extract CqlQuery * Move kvs-slow-log and tracing of CqlExecutor to CCI * Propagate execute_cql3_query api breaks * checkstyle * delete unused string * checkstyle * fix number of mutations on batch_mutate * some refactors * fix compile * Refactor cassandra client (#2676) * Extract TracingCassandraClient Extract ProfilingCassandraClient Move todos and some cleanup Cherry-pick QoS metrics to develop (#2679) * [QoS] Feature/qos meters (#2640) * Metrics for bytes and counts in each read/write * Refactors, dont throw if recordMetrics throws * Use meters instead of histograms * Multiget bytes * Batch mutate exact size * Cqlresult size * Calculate exact byte sizes for all thrift objects * tests and bugfixes - partial * More tests and bugs fixed * More tests and cr comments * byte buffer size * Remove register histogram * checkstyle * checkstyle * locks and license * Qos metrics CassandraClient * Exclude unused classes * fix cherry pick * use supplier for object size [no release notes] * fix merge in AtlasDbConfig * rate limiting * total-time * qos config * respect max backoff itme * query weights * extra tests * num rows * checkstyle * fix tests * no int casting * Qos ete tests * shouldFailIfWritingTooManyBytes * fix test * rm file * Remove metrics * Test shouldFailIfReadingTooManyBytes * canBeWritingLargeNumberOfBytesConcurrently * checkstyle * cannotWriteLargeNumberOfBytesConcurrently * fix tests * create tm in test * More read tests (after writing a lot of data at once) * WIP * Tests that should pas * Actually update the rate * Add another test * More tests and address comments * Dont extend etesetup * Make dumping data faster * cleanup * wip * Add back lost file * Cleanup * Write tests * numReadsPerThread -> numThreads * More write tests, cleanup, check style fixes * Refactor to avoid code duplication * Cleanup * cr comments * Small read/write after a rate-limited read/write * annoying no new linw at eof * Uniform parameters for hard limiting * Don't consume any estimated bytes for a _transaction or metadata table query * Add tests * cr comments

* Extremely basic QosServiceResource * Make resource an interface * Add client PathParam * Clean up javax.ws.rs dependencies * Create stub for AtlasDbQosClient * Calls to checkLimit use up a credit; throw when out of credits * Add QosServiceResourceImpl + test * AutoDelegate for Cassandra.Client * Rename QosService stuff * Pass AtlasDbQosClient to CassandraClient * Check limit on multiget_slice * Check limit on batch_mutate * Don't test we aren't soft-limited while we can never be soft-limited * Check limit on remaining CassandraClient methods * Scheduled refresh of AtlasDbQosClient.credits * Refresh every second Once we have configurable quotas on the QoS service, they will be more understandable (per second rather than per-10-seconds). * Mount qos-service on Timelock * Checkstyle * Update dependency locks * Dont throw limitExceededException * Move client param around * Comment * Qos Service config (#2644) * Service config * Allow clients to run without configuring limits * simpler tests * [QoS] qos ete test (#2652) * checkpoint * checkpoint * working test * check passing * unused deps * [QoS] rate limiter (#2653) * rate limiting * update license and docs * [QoS] Feature/qos client (#2650) * Create one qosCLient for each service QosClientBuilder hooked up to KVS create Create the QosClient in CassandraClientPoolImpl if the config is specified. Create FakeQosClient if the config is not specified Cleanup get broken tests to pass * Locks * Fix failing tests * Add getNamespace [no release notes] * Create QosClient at the Top level * fix test * test and checkstyle fixes * locks * deps * fix tests * [QoS] Feature/qos meters (#2640) * Metrics for bytes and counts in each read/write * Refactors, dont throw if recordMetrics throws * Use meters instead of histograms * Multiget bytes * Batch mutate exact size * Cqlresult size * Calculate exact byte sizes for all thrift objects * tests and bugfixes - partial * More tests and bugs fixed * More tests and cr comments * byte buffer size * Remove register histogram * checkstyle * checkstyle * locks and license * [QoS] QosClient with ratelimiter (#2667) * QosClient with ratelimiter * Checkstyle * locks * [QoS] Create a jaxrs-client for the integ tests (#2675) * Create a jaxrs-client for the integ tests * build fix * clean up * Nziebart/merge develop into qos (#2683) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success * add logging when we stop reducing the batch size multiplier * further improve the tests * Allow sweep to recover faster after backing off. Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000. Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time. This means that when sweep has to back off it should speed up again quickly. * Use an AtomicInteger to handle concurrent updates * SweeperService logging improvements (#2618) * SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not * Added sweep parameters to the log lines * no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true. Behaviour is unchanged but we can log a message when defaulting. * Refactor TracingKVS (#2643) * Wrap next() and hasNext() in traces * Use span names as safe * Remove iterator wrappings * checkstyle * refactor methods and remove misleading traces * Fix unit tests * release notes * Final nits * fix java arrays usage * Delete docs (#2657) * [20 minute tasks] Add test for when a batch is full (#2655) * [no release notes] Drive-by add test for when a batch is full * MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636) * change metrics manager to warn plus log the metric name * more timestamp tracker tests * release notes * Extract interface for Cassandra client (#2660) * Create a CassandraClient * Propagate CassandraClient to all classes but CKVS * Use CassandraClient on CKVS * Propagate CassandraClient to remaining Impl classes * Use CassandraClient in tests * [no release notes] * client -> namespace [no release notes] (#2654) * 0.65.2 and 0.66.0 release notes (#2663) * Release notes banners * fix pr numbers * [QoS] Add getNamespace to AtlasDBConfig (#2661) * Add getNamespace [no release notes] * Timelock client config cannot be empty * Make it explicit that unspecified namespace is only possible for InMemoryKVS * CR comments * Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621) * thoughts * More tests for RIH * Paranoid logging * statics * javadoc part 1 * polling refreshable * Unit tests * Remove the old RIH * lock lock * Tests that test how we deal with exceptions * logging * [no release notes] * CR comments part 1 * Make interval configurable * Standard nasty time edge cases * lastSeenValue does not need to be volatile * Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622) * ServiceCreator.applyDynamic() * Propagate config through TMs * Json Serialization fixes * Some refactoring * lock/lock * Fixed checkstyle * CR comments part 1 * Switch to RPIH * add test * [no release notes] forthcoming in part 4 * checkstyle * [TTT] [no release notes] Document behaviour regarding index rows (#2658) * [no release notes] Document behaviour regarding index rows * fix compile bug * ``List`` * Refactor and Instrument CassandraClient api (#2665) * Sanitize Client API * Instrument CassandraClient * checkstyle * Address comment * [no release notes] * checkstyle * Fix cas * Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647) * 0 nodes part 1 * add support for 0 servers in a ServerListConfig * extend deserialization tests * More tests * code defensively * [no release notes] defer to 2648 * Fixed CR nits * singleton server list * check immutable ts (#2406) * check immutable ts * checkstyle * release notes * Fix TM creation * checkstyle * Propagate top-level KVS method names to CassandraClient (#2669) * Propagate method names down to multiget_slice * Add the corresponding KVS method to remaining methods * Add TODO * [no release notes] * nit * Extract cql executor interface (#2670) * Instrument CqlExecutor * [no release notes] * bump awaitility (#2668) * Upgrade to newer Awaitility. * locks [no release notes] * unused import * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662) * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts * Add changes into missing file * Doc changes * Exclude Tracing and HdrHistogram from Tritium dependencies * update locks * Add excluded dependencies explicitly * Fix merge conflict in relase notes * Uncomment dependencies * Regenerate locks * Correctly log Paxos events (#2674) * Log out Paxos values when recording Paxos events * Updated release notes * Checkstyle * Pull request number * Address comments * fix docs * Slow log and tracing (#2673) * Trace and instrument the thrift client * Instrument CqlExecutor * Fix metric names of IntrumentedCassandraClient * Fix nit * Also log internal table references * Checkstyle * simplify metric names * Address comments * add slow logging to the cassandra thrift client * add slow logging to cqlExecutor * fix typos * Add tracing to the CassandraClient * trace cqlExecutor queries * Add slow-logging in the CassandraClient * Delete InstrumentedCC and InstrumentedCqlExec * Fix small nits * Checkstyle * Add kvs method names to slow logs * Fix wrapping of exception * Extract CqlQuery * Move kvs-slow-log and tracing of CqlExecutor to CCI * Propagate execute_cql3_query api breaks * checkstyle * delete unused string * checkstyle * fix number of mutations on batch_mutate * some refactors * fix compile * Refactor cassandra client (#2676) * Extract TracingCassandraClient Extract ProfilingCassandraClient Move todos and some cleanup Cherry-pick QoS metrics to develop (#2679) * [QoS] Feature/qos meters (#2640) * Metrics for bytes and counts in each read/write * Refactors, dont throw if recordMetrics throws * Use meters instead of histograms * Multiget bytes * Batch mutate exact size * Cqlresult size * Calculate exact byte sizes for all thrift objects * tests and bugfixes - partial * More tests and bugs fixed * More tests and cr comments * byte buffer size * Remove register histogram * checkstyle * checkstyle * locks and license * Qos metrics CassandraClient * Exclude unused classes * fix cherry pick * use supplier for object size [no release notes] * fix merge in AtlasDbConfig * qos rate limiting (#2709) * rate limiting * [QoS] total time spent talking to Cassandra (#2687) * total-time * [QoS] Client config (#2690) * qos config * respect max backoff itme * [QoS] [Refactor] Query Weights (#2697) * query weights * extra tests * [QoS] Number of rows per query (#2698) * num rows * checkstyle * fix tests * no int casting * fix numRows calculation on batch_mutate * [QoS] CAS metrics (#2705) * cas metrics * exceptions (#2706) * [QoS] Guava license (#2703) * guava license * Cleanup: class reference * [QoS] live reload (#2710) * live reload and logging * millis * checkpoint * fix tests * comments * checkstyle * [QoS] Don't rate limit CAS (#2711) * dont limit cas * Remove tests of deleted method * Cherrypick/qos exception mapping (#2715) * very simple ratelimitexceededexception * Need to be able to throw RLEE directly from Cass, rather than ADDE(RLEE)s * fix bug with ADDE(RLEE) * Exception Mapper * unravel a bad javadoc * CR comments part 1 * lock lock * split qos aware throwables * visibility * fix compile break * checkstyle * handle exceptions properly * [QoS] Estimate the number of read bytes w/ number of rows (#2717) * Refactor the name of the functions * Estimate based on the number of rows * Fix modifiers on ThriftQueryWeighers * Add unit tests to estimation logic * ThriftQueryWeighers.multigetSlice takes a List, not number of rows * getRangeSlices takes KeyRange, not count * weight estimates (#2725) * [QoS] Fix exceptions thrown on CqlExecutor (#2696) * Address #2683 comments * Clarify query and add cause * Add just the cqlQuery.queryFormat * checkstyle * Update test We changed the error message... * [QoS] Qos ete test (#2708) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success * add logging when we stop reducing the batch size multiplier * further improve the tests * Allow sweep to recover faster after backing off. Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000. Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time. This means that when sweep has to back off it should speed up again quickly. * Use an AtomicInteger to handle concurrent updates * SweeperService logging improvements (#2618) * SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not * Added sweep parameters to the log lines * no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true. Behaviour is unchanged but we can log a message when defaulting. * Refactor TracingKVS (#2643) * Wrap next() and hasNext() in traces * Use span names as safe * Remove iterator wrappings * checkstyle * refactor methods and remove misleading traces * Fix unit tests * release notes * Final nits * fix java arrays usage * Delete docs (#2657) * [20 minute tasks] Add test for when a batch is full (#2655) * [no release notes] Drive-by add test for when a batch is full * MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636) * change metrics manager to warn plus log the metric name * more timestamp tracker tests * release notes * Extract interface for Cassandra client (#2660) * Create a CassandraClient * Propagate CassandraClient to all classes but CKVS * Use CassandraClient on CKVS * Propagate CassandraClient to remaining Impl classes * Use CassandraClient in tests * [no release notes] * client -> namespace [no release notes] (#2654) * 0.65.2 and 0.66.0 release notes (#2663) * Release notes banners * fix pr numbers * [QoS] Add getNamespace to AtlasDBConfig (#2661) * Add getNamespace [no release notes] * Timelock client config cannot be empty * Make it explicit that unspecified namespace is only possible for InMemoryKVS * CR comments * Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621) * thoughts * More tests for RIH * Paranoid logging * statics * javadoc part 1 * polling refreshable * Unit tests * Remove the old RIH * lock lock * Tests that test how we deal with exceptions * logging * [no release notes] * CR comments part 1 * Make interval configurable * Standard nasty time edge cases * lastSeenValue does not need to be volatile * Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622) * ServiceCreator.applyDynamic() * Propagate config through TMs * Json Serialization fixes * Some refactoring * lock/lock * Fixed checkstyle * CR comments part 1 * Switch to RPIH * add test * [no release notes] forthcoming in part 4 * checkstyle * [TTT] [no release notes] Document behaviour regarding index rows (#2658) * [no release notes] Document behaviour regarding index rows * fix compile bug * ``List`` * Refactor and Instrument CassandraClient api (#2665) * Sanitize Client API * Instrument CassandraClient * checkstyle * Address comment * [no release notes] * checkstyle * Fix cas * Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647) * 0 nodes part 1 * add support for 0 servers in a ServerListConfig * extend deserialization tests * More tests * code defensively * [no release notes] defer to 2648 * Fixed CR nits * singleton server list * check immutable ts (#2406) * check immutable ts * checkstyle * release notes * Fix TM creation * checkstyle * Propagate top-level KVS method names to CassandraClient (#2669) * Propagate method names down to multiget_slice * Add the corresponding KVS method to remaining methods * Add TODO * [no release notes] * nit * Extract cql executor interface (#2670) * Instrument CqlExecutor * [no release notes] * bump awaitility (#2668) * Upgrade to newer Awaitility. * locks [no release notes] * unused import * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662) * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts * Add changes into missing file * Doc changes * Exclude Tracing and HdrHistogram from Tritium dependencies * update locks * Add excluded dependencies explicitly * Fix merge conflict in relase notes * Uncomment dependencies * Regenerate locks * Correctly log Paxos events (#2674) * Log out Paxos values when recording Paxos events * Updated release notes * Checkstyle * Pull request number * Address comments * fix docs * Slow log and tracing (#2673) * Trace and instrument the thrift client * Instrument CqlExecutor * Fix metric names of IntrumentedCassandraClient * Fix nit * Also log internal table references * Checkstyle * simplify metric names * Address comments * add slow logging to the cassandra thrift client * add slow logging to cqlExecutor * fix typos * Add tracing to the CassandraClient * trace cqlExecutor queries * Add slow-logging in the CassandraClient * Delete InstrumentedCC and InstrumentedCqlExec * Fix small nits * Checkstyle * Add kvs method names to slow logs * Fix wrapping of exception * Extract CqlQuery * Move kvs-slow-log and tracing of CqlExecutor to CCI * Propagate execute_cql3_query api breaks * checkstyle * delete unused string * checkstyle * fix number of mutations on batch_mutate * some refactors * fix compile * Refactor cassandra client (#2676) * Extract TracingCassandraClient Extract ProfilingCassandraClient Move todos and some cleanup Cherry-pick QoS metrics to develop (#2679) * [QoS] Feature/qos meters (#2640) * Metrics for bytes and counts in each read/write * Refactors, dont throw if recordMetrics throws * Use meters instead of histograms * Multiget bytes * Batch mutate exact size * Cqlresult size * Calculate exact byte sizes for all thrift objects * tests and bugfixes - partial * More tests and bugs fixed * More tests and cr comments * byte buffer size * Remove register histogram * checkstyle * checkstyle * locks and license * Qos metrics CassandraClient * Exclude unused classes * fix cherry pick * use supplier for object size [no release notes] * fix merge in AtlasDbConfig * rate limiting * total-time * qos config * respect max backoff itme * query weights * extra tests * num rows * checkstyle * fix tests * no int casting * Qos ete tests * shouldFailIfWritingTooManyBytes * fix test * rm file * Remove metrics * Test shouldFailIfReadingTooManyBytes * canBeWritingLargeNumberOfBytesConcurrently * checkstyle * cannotWriteLargeNumberOfBytesConcurrently * fix tests * create tm in test * More read tests (after writing a lot of data at once) * WIP * Tests that should pas * Actually update the rate * Add another test * More tests and address comments * Dont extend etesetup * Make dumping data faster * cleanup * wip * Add back lost file * Cleanup * Write tests * numReadsPerThread -> numThreads * More write tests, cleanup, check style fixes * Refactor to avoid code duplication * Cleanup * cr comments * Small read/write after a rate-limited read/write * annoying no new linw at eof * Uniform parameters for hard limiting * [QoS] Fix/qos system table rate limiting (#2739) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630) * Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success * add logging when we stop reducing the batch size multiplier * further improve the tests * Allow sweep to recover faster after backing off. Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000. Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time. This means that when sweep has to back off it should speed up again quickly. * Use an AtomicInteger to handle concurrent updates * SweeperService logging improvements (#2618) * SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not * Added sweep parameters to the log lines * no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true. Behaviour is unchanged but we can log a message when defaulting. * Refactor TracingKVS (#2643) * Wrap next() and hasNext() in traces * Use span names as safe * Remove iterator wrappings * checkstyle * refactor methods and remove misleading traces * Fix unit tests * release notes * Final nits * fix java arrays usage * Delete docs (#2657) * [20 minute tasks] Add test for when a batch is full (#2655) * [no release notes] Drive-by add test for when a batch is full * MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636) * change metrics manager to warn plus log the metric name * more timestamp tracker tests * release notes * Extract interface for Cassandra client (#2660) * Create a CassandraClient * Propagate CassandraClient to all classes but CKVS * Use CassandraClient on CKVS * Propagate CassandraClient to remaining Impl classes * Use CassandraClient in tests * [no release notes] * client -> namespace [no release notes] (#2654) * 0.65.2 and 0.66.0 release notes (#2663) * Release notes banners * fix pr numbers * [QoS] Add getNamespace to AtlasDBConfig (#2661) * Add getNamespace [no release notes] * Timelock client config cannot be empty * Make it explicit that unspecified namespace is only possible for InMemoryKVS * CR comments * Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621) * thoughts * More tests for RIH * Paranoid logging * statics * javadoc part 1 * polling refreshable * Unit tests * Remove the old RIH * lock lock * Tests that test how we deal with exceptions * logging * [no release notes] * CR comments part 1 * Make interval configurable * Standard nasty time edge cases * lastSeenValue does not need to be volatile * Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622) * ServiceCreator.applyDynamic() * Propagate config through TMs * Json Serialization fixes * Some refactoring * lock/lock * Fixed checkstyle * CR comments part 1 * Switch to RPIH * add test * [no release notes] forthcoming in part 4 * checkstyle * [TTT] [no release notes] Document behaviour regarding index rows (#2658) * [no release notes] Document behaviour regarding index rows * fix compile bug * ``List`` * Refactor and Instrument CassandraClient api (#2665) * Sanitize Client API * Instrument CassandraClient * checkstyle * Address comment * [no release notes] * checkstyle * Fix cas * Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647) * 0 nodes part 1 * add support for 0 servers in a ServerListConfig * extend deserialization tests * More tests * code defensively * [no release notes] defer to 2648 * Fixed CR nits * singleton server list * check immutable ts (#2406) * check immutable ts * checkstyle * release notes * Fix TM creation * checkstyle * Propagate top-level KVS method names to CassandraClient (#2669) * Propagate method names down to multiget_slice * Add the corresponding KVS method to remaining methods * Add TODO * [no release notes] * nit * Extract cql executor interface (#2670) * Instrument CqlExecutor * [no release notes] * bump awaitility (#2668) * Upgrade to newer Awaitility. * locks [no release notes] * unused import * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662) * Bump Atlas on Tritium 0.8.4 to fix dependency conflicts * Add changes into missing file * Doc changes * Exclude Tracing and HdrHistogram from Tritium dependencies * update locks * Add excluded dependencies explicitly * Fix merge conflict in relase notes * Uncomment dependencies * Regenerate locks * Correctly log Paxos events (#2674) * Log out Paxos values when recording Paxos events * Updated release notes * Checkstyle * Pull request number * Address comments * fix docs * Slow log and tracing (#2673) * Trace and instrument the thrift client * Instrument CqlExecutor * Fix metric names of IntrumentedCassandraClient * Fix nit * Also log internal table references * Checkstyle * simplify metric names * Address comments * add slow logging to the cassandra thrift client * add slow logging to cqlExecutor * fix typos * Add tracing to the CassandraClient * trace cqlExecutor queries * Add slow-logging in the CassandraClient * Delete InstrumentedCC and InstrumentedCqlExec * Fix small nits * Checkstyle * Add kvs method names to slow logs * Fix wrapping of exception * Extract CqlQuery * Move kvs-slow-log and tracing of CqlExecutor to CCI * Propagate execute_cql3_query api breaks * checkstyle * delete unused string * checkstyle * fix number of mutations on batch_mutate * some refactors * fix compile * Refactor cassandra client (#2676) * Extract TracingCassandraClient Extract ProfilingCassandraClient Move todos and some cleanup Cherry-pick QoS metrics to develop (#2679) * [QoS] Feature/qos meters (#2640) * Metrics for bytes and counts in each read/write * Refactors, dont throw if recordMetrics throws * Use meters instead of histograms * Multiget bytes * Batch mutate exact size * Cqlresult size * Calculate exact byte sizes for all thrift objects * tests and bugfixes - partial * More tests and bugs fixed * More tests and cr comments * byte buffer size * Remove register histogram * checkstyle * checkstyle * locks and license * Qos metrics CassandraClient * Exclude unused classes * fix cherry pick * use supplier for object size [no release notes] * fix merge in AtlasDbConfig * rate limiting * total-time * qos config * respect max backoff itme * query weights * extra tests * num rows * checkstyle * fix tests * no int casting * Qos ete tests * shouldFailIfWritingTooManyBytes * fix test * rm file * Remove metrics * Test shouldFailIfReadingTooManyBytes * canBeWritingLargeNumberOfBytesConcurrently * checkstyle * cannotWriteLargeNumberOfBytesConcurrently * fix tests * create tm in test * More read tests (after writing a lot of data at once) * WIP * Tests that should pas * Actually update the rate * Add another test * More tests and address comments * Dont extend etesetup * Make dumping data faster * cleanup * wip * Add back lost file * Cleanup * Write tests * numReadsPerThread -> numThreads * More write tests, cleanup, check style fixes * Refactor to avoid code duplication * Cleanup * cr comments * Small read/write after a rate-limited read/write * annoying no new linw at eof * Uniform parameters for hard limiting * Don't consume any estimated bytes for a _transaction or metadata table query * Add tests * cr comments * Merge develop to the feature branch (#2741) * Merge develop * Re-delete CqlQueryUtils * Nziebart/cell timestamps qos (#2745) * handle qos exceptions in cell timestamp loader [no release notes] * actually just remove checked exception * Remove the throws in the method signature * Differentiate between read and write limits when logging (#2751) * Differentiate between read and write limits when logging * Type -> name * Use longs in the rate limiter and handle negative adjustments. (#2758) * Differentiate between read and write limits when logging * handle negative adjustments * More tests * pr comments

Fix SweepBatchConfig values to properly decrease to 1 with each failu…

e229745

…re and increase with each success

tboam requested review from jeremyk-91 and hsaraogi November 3, 2017 18:47

tboam assigned hsaraogi Nov 3, 2017

hsaraogi suggested changes Nov 5, 2017

View reviewed changes

jeremyk-91 reviewed Nov 6, 2017

View reviewed changes

tboam added 4 commits November 6, 2017 13:56

add logging when we stop reducing the batch size multiplier

99670ca

tidy up tests

b904202

tidy release notes

4ad2c95

further improve the tests

f0b6372

Merge develop into fix/sweep-backoff

541df97

hsaraogi reviewed Nov 7, 2017

View reviewed changes

tboam added 2 commits November 7, 2017 16:28

Use an AtomicInteger to handle concurrent updates

5ecf806

Merge branch 'develop' into fix/sweep-backoff

41f7654

hsaraogi approved these changes Nov 7, 2017

View reviewed changes

tboam merged commit 2e8c960 into develop Nov 8, 2017

tboam deleted the fix/sweep-backoff branch November 8, 2017 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success #2630

Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success #2630

tboam commented Nov 3, 2017 •

edited by jboreiko

Loading

hsaraogi left a comment

hsaraogi Nov 5, 2017

jeremyk-91 Nov 6, 2017

tboam Nov 6, 2017

hsaraogi Nov 5, 2017

jeremyk-91 Nov 6, 2017

tboam Nov 6, 2017

hsaraogi Nov 5, 2017

tboam Nov 6, 2017

hsaraogi Nov 5, 2017

tboam Nov 6, 2017

hsaraogi Nov 5, 2017

tboam Nov 6, 2017

hsaraogi Nov 5, 2017

tboam Nov 6, 2017

hsaraogi Nov 5, 2017

tboam Nov 6, 2017

hsaraogi Nov 5, 2017

jeremyk-91 Nov 6, 2017 •

edited

Loading

jeremyk-91 left a comment

jeremyk-91 Nov 6, 2017

tboam Nov 6, 2017

jeremyk-91 Nov 6, 2017

tboam Nov 6, 2017

jeremyk-91 Nov 6, 2017

jeremyk-91 Nov 6, 2017

jeremyk-91 Nov 6, 2017 •

edited

Loading

codecov-io commented Nov 6, 2017 •

edited

Loading

tboam commented Nov 6, 2017

hsaraogi Nov 7, 2017

tboam Nov 7, 2017

hsaraogi left a comment

Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success #2630

Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success #2630

Conversation

tboam commented Nov 3, 2017 • edited by jboreiko Loading

hsaraogi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyk-91 Nov 6, 2017 • edited Loading

Choose a reason for hiding this comment

jeremyk-91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyk-91 Nov 6, 2017 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Nov 6, 2017 • edited Loading

Codecov Report

tboam commented Nov 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsaraogi left a comment

Choose a reason for hiding this comment

tboam commented Nov 3, 2017 •

edited by jboreiko

Loading

jeremyk-91 Nov 6, 2017 •

edited

Loading

jeremyk-91 Nov 6, 2017 •

edited

Loading

codecov-io commented Nov 6, 2017 •

edited

Loading