Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

[QoS] rate limiting #2684

Closed
wants to merge 28 commits into from
Closed

Conversation

nziebart
Copy link
Contributor

@nziebart nziebart commented Nov 15, 2017

Goals (and why):
Hook up rate limiting to QosClient

Implementation Description (bullets):

  • Move rate limiting and metrics recording into QosClient
  • Remove scheduled refreshing for now
  • Add some TODOs for follow up PRs

Concerns (what feedback would you like?):

Where should we start reviewing?:

Priority (whenever / two weeks / yesterday):


This change is Reviewable

tboam and others added 27 commits November 8, 2017 10:32
…re and increase with each success (#2630)

* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success

* add logging when we stop reducing the batch size multiplier

* further improve the tests

* Allow sweep to recover faster after backing off.  Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000.  Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time.  This means that when sweep has to back off it should speed up again quickly.

* Use an AtomicInteger to handle concurrent updates
* SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not

* Added sweep parameters to the log lines

* no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true.  Behaviour is unchanged but we can log a message when defaulting.
* Wrap next() and hasNext() in traces

* Use span names as safe

* Remove iterator wrappings

* checkstyle

* refactor methods and remove misleading traces

* Fix unit tests

* release notes

* Final nits

* fix java arrays usage
* [no release notes] Drive-by add test for when a batch is full
…#2636)

* change metrics manager to warn plus log the metric name

* more timestamp tracker tests

* release notes
* Create a CassandraClient

* Propagate CassandraClient to all classes but CKVS

* Use CassandraClient on CKVS

* Propagate CassandraClient to remaining Impl classes

* Use CassandraClient in tests

* [no release notes]
* Release notes banners

* fix pr numbers
* Add getNamespace [no release notes]

* Timelock client config cannot be empty

* Make it explicit that unspecified namespace is only possible for InMemoryKVS

* CR comments
* thoughts

* More tests for RIH

* Paranoid logging

* statics

* javadoc part 1

* polling refreshable

* Unit tests

* Remove the old RIH

* lock lock

* Tests that test how we deal with exceptions

* logging

* [no release notes]

* CR comments part 1

* Make interval configurable

* Standard nasty time edge cases

* lastSeenValue does not need to be volatile
…ng (#2622)

* ServiceCreator.applyDynamic()

* Propagate config through TMs

* Json Serialization fixes

* Some refactoring

* lock/lock

* Fixed checkstyle

* CR comments part 1

* Switch to RPIH

* add test

* [no release notes] forthcoming in part 4

* checkstyle
* [no release notes] Document behaviour regarding index rows

* fix compile bug

* ``List``
* Sanitize Client API

* Instrument CassandraClient

* checkstyle

* Address comment

* [no release notes]

* checkstyle

* Fix cas
* 0 nodes part 1

* add support for 0 servers in a ServerListConfig

* extend deserialization tests

* More tests

* code defensively

* [no release notes] defer to 2648

* Fixed CR nits

* singleton server list
* check immutable ts

* checkstyle

* release notes

* Fix TM creation

* checkstyle
* Propagate method names down to multiget_slice

* Add the corresponding KVS method to remaining methods

* Add TODO

* [no release notes]

* nit
* Instrument CqlExecutor

* [no release notes]
* Upgrade to newer Awaitility.

* locks [no release notes]

* unused import
* Bump Atlas on Tritium 0.8.4 to fix dependency conflicts

* Add changes into missing file

* Doc changes

* Exclude Tracing and HdrHistogram from Tritium dependencies

* update locks

* Add excluded dependencies explicitly

* Fix merge conflict in relase notes

* Uncomment dependencies

* Regenerate locks
* Log out Paxos values when recording Paxos events

* Updated release notes

* Checkstyle

* Pull request number

* Address comments

* fix docs
* Trace and instrument the thrift client

* Instrument CqlExecutor

* Fix metric names of IntrumentedCassandraClient

* Fix nit

* Also log internal table references

* Checkstyle

* simplify metric names

* Address comments

* add slow logging to the cassandra thrift client

* add slow logging to cqlExecutor

* fix typos

* Add tracing to the CassandraClient

* trace cqlExecutor queries

* Add slow-logging in the CassandraClient

* Delete InstrumentedCC and InstrumentedCqlExec

* Fix small nits

* Checkstyle

* Add kvs method names to slow logs

* Fix wrapping of exception

* Extract CqlQuery

* Move kvs-slow-log and tracing of CqlExecutor to CCI

* Propagate execute_cql3_query api breaks

* checkstyle

* delete unused string

* checkstyle

* fix number of mutations on batch_mutate

* some refactors

* fix compile
* Extract TracingCassandraClient

Extract ProfilingCassandraClient

Move todos and some cleanup

Cherry-pick QoS metrics to develop (#2679)

* [QoS] Feature/qos meters (#2640)

* Metrics for bytes and counts in each read/write

* Refactors, dont throw if recordMetrics throws

* Use meters instead of histograms

* Multiget bytes

* Batch mutate exact size

* Cqlresult size

* Calculate exact byte sizes for all thrift objects

* tests and bugfixes - partial

* More tests and bugs fixed

* More tests and cr comments

* byte buffer size

* Remove register histogram

* checkstyle

* checkstyle

* locks and license

* Qos metrics CassandraClient

* Exclude unused classes

* fix cherry pick
recordBytesRead(() -> getApproximateReadByteCount(result));
return result;
return qosClient.executeRead(
() -> DEFAULT_ESTIMATED_READ_BYTES,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we feed in Cassandra metrics here to estimate the byte read in advance? Otherwise, should we just adjust the ratelimiter after the cassandra call instead of trying to estimate here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do adjust the rate limiter after the call is done. however, we won't get rate limited unless we ask the ratelimiter for some bandwidth before the request starts

Copy link
Contributor

@hsaraogi hsaraogi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-io
Copy link

codecov-io commented Nov 17, 2017

Codecov Report

Merging #2684 into feature/qos-service-api will increase coverage by 0.06%.
The diff coverage is 52.35%.

Impacted file tree graph

@@                      Coverage Diff                      @@
##             feature/qos-service-api    #2684      +/-   ##
=============================================================
+ Coverage                      60.29%   60.36%   +0.06%     
- Complexity                      3413     3425      +12     
=============================================================
  Files                            886      891       +5     
  Lines                          40544    40587      +43     
  Branches                        4035     4035              
=============================================================
+ Hits                           24447    24499      +52     
+ Misses                         14608    14598      -10     
- Partials                        1489     1490       +1
Impacted Files Coverage Δ Complexity Δ
.../com/palantir/atlasdb/services/ServicesConfig.java 100% <ø> (ø) 0 <0> (ø) ⬇️
...va/com/palantir/atlasdb/containers/Containers.java 83.75% <ø> (ø) 19 <0> (ø) ⬇️
...lasdb/factory/ServiceDiscoveringAtlasSupplier.java 91.22% <ø> (+1.39%) 16 <0> (ø) ⬇️
...ava/com/palantir/atlasdb/sweep/SweeperService.java 100% <ø> (ø) 0 <0> (ø) ⬇️
...palantir/atlasdb/keyvalue/cassandra/Heartbeat.java 0% <ø> (ø) 0 <0> (ø) ⬇️
.../java/com/palantir/atlasdb/spi/AtlasDbFactory.java 83.33% <ø> (-2.39%) 3 <0> (ø)
...tlasdb/keyvalue/cassandra/CassandraClientImpl.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...atlasdb/keyvalue/cassandra/SchemaMutationLock.java 0% <0%> (ø) 0 <0> (ø) ⬇️
.../com/palantir/atlasdb/config/ServerListConfig.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...r/atlasdb/keyvalue/cassandra/paging/RowGetter.java 0% <0%> (ø) 0 <0> (ø) ⬇️
... and 103 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8f8312...2564543. Read the comment docs.

@nziebart nziebart force-pushed the nziebart/qos-rate-limiting branch from 3713723 to e3bd685 Compare November 17, 2017 14:53
Copy link
Contributor

@gsheasby gsheasby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few nits and comments here and there. Generally the design is much nicer than it was (especially around AtlasDbQosClient). Approving to unblock other things

Supplier<Integer> estimatedWeigher,
ReadQuery<T, E> query,
Function<T, Integer> weigher) throws E {
int estimatedWeight = getWeight(estimatedWeigher, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the always-oneness of this is confusing. DEFAULT_FALLBACK or even inlining the parameter?

If this is a very transient thing then I don't care.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also addressed in #2697

qosClient.checkLimit();

CASResult result = client.cas(tableReference, key, expected, updates, serial_consistency_level,
// TODO(nziebart): should this be considered as a write or do we need to treat is as both read and write?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds likely - I would expect that CAS operations are twice as heavy for C* as reads/writes.

} catch (Exception e) {
log.warn("Encountered an exception when recording write metrics.", e);
}
return qosClient.executeRead(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, should probably javadoc at some point that we treat this as if it's a read.

@@ -61,15 +62,15 @@ public void setUp() {
public void multigetSliceChecksLimit() throws TException, LimitExceededException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, test method names could now be improved

ScheduledExecutorService scheduler = new InstrumentedScheduledExecutorService(
Executors.newSingleThreadScheduledExecutor(),
AtlasDbMetrics.getMetricRegistry(),
"qos-client-executor");
return new AtlasDbQosClient(qosService, scheduler, config().getNamespaceString(), QosRateLimiter.create());
return new AtlasDbQosClient(QosRateLimiter.create(), new QosMetrics());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like qosService and scheduler are now unused (although I suspect there's some remoting magic going on here?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are removed in a subsequent PR

@@ -16,16 +16,22 @@

package com.palantir.atlasdb.qos;

import java.util.function.Function;
import java.util.function.Supplier;

public class FakeQosClient implements QosClient {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NeverLimitingQosClient?

<T, E extends Exception> T executeRead(
Supplier<Integer> estimatedWeight,
ReadQuery<T, E> query,
Function<T, Integer> weigher) throws E;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weigher seems sub-optimally named - doesn't capture that this gives us the actual weight, after query execution. actualWeight (or clunkier, actualWeightGivenResult) would dovetail nicely with estimatedWeight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is addressed somewhat in the refactor at #2697

@@ -123,7 +121,7 @@ final long reserveEarliestAvailable(int requiredPermits, long nowMicros) {
storedPermitsToWaitTime(this.storedPermits, storedPermitsToSpend)
+ (long) (freshPermits * stableIntervalMicros);

this.nextFreeTicketMicros = LongMath.saturatedAdd(nextFreeTicketMicros, waitMicros);
this.nextFreeTicketMicros = nextFreeTicketMicros + waitMicros;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This was edited, but not annotated with CHANGELOG as other edits are in #2703.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually re-copied the whole file in that PR, and used a different version that didn't rely on LongMath, so this line is no longer edited

import org.junit.Before;
import org.junit.Test;

import com.palantir.atlasdb.qos.client.AtlasDbQosClient;
import com.palantir.atlasdb.qos.ratelimit.QosRateLimiter;

public class AtlasDbQosClientTest {

private static final int ESTIMATED_BYTES = 10;
private static final int ACTUAL_BYTES = 51;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a test waiting to be written about a second read going through even though ACTUAL_BYTES > BYTE_LIMIT/2, and then another one about a third read not being allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that stuff is currently tested in QosRateLimiterTest, and the tests here just assert that we delegate to that class.

We do need ete tests for the whole setup here, which @hsaraogi is working on

* total-time

* [QoS] Client config (#2690)

* qos config

* respect max backoff itme

* [QoS] [Refactor] Query Weights (#2697)

* query weights

* extra tests

* [QoS] Number of rows per query (#2698)

* num rows

* checkstyle

* fix tests

* no int casting

* fix numRows calculation on batch_mutate
@nziebart nziebart changed the base branch from nziebart/merge-develop-into-qos to feature/qos-service-api November 17, 2017 16:58
@nziebart nziebart force-pushed the nziebart/qos-rate-limiting branch from 6d29a4b to 2564543 Compare November 17, 2017 17:59
@fsamuel-bs fsamuel-bs mentioned this pull request Nov 17, 2017
@nziebart
Copy link
Contributor Author

Moved to #2709

@nziebart nziebart closed this Nov 17, 2017
@fsamuel-bs fsamuel-bs deleted the nziebart/qos-rate-limiting branch January 5, 2018 15:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants