Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

Feature/qos service api #2629

Merged
merged 47 commits into from
Dec 2, 2017
Merged

Feature/qos service api #2629

merged 47 commits into from
Dec 2, 2017

Conversation

gsheasby
Copy link
Contributor

@gsheasby gsheasby commented Nov 3, 2017

Goals (and why): Work in progress - add a Quality of Service service, and call it when Cassandra.Client makes requests to Cassandra. Note that this is very basic so far; I'm throwing up the PR to solicit feedback on design decisions so far and future direction.

We would like to keep the QoS service implementation internal to OSS AtlasDB, and introduce a magic layer internally as a thin wrapper, similar to our approach for TimeLock. This will make our testing easier and more robust.

Implementation Description (bullets):

  • Create basic QoS service
  • Create basic QoS client that calls that service
  • Wrap Cassandra.Client with CassandraClient, and call qosClient.checkLimit when performing some operations
  • Service hands out MAX_VALUE "credits"; client uses up 1 "credit" per operation

Next:

  • CassandraClient should also call checkLimit for CQL query execution
  • Client periodically calls service.getLimit; this refreshes the "credits"
  • Centralise service (should be one service per stack, and one client per product node)
  • Client reports usage metrics to server
  • Server can configure different limits per client

Future (but still before merge):

  • Different amounts of credits are used up depending on the complexity of the request
  • Implement soft limiting
  • Server updates limits based on Cassandra load
  • Figure out exception handling

Concerns (what feedback would you like?): Design decisions; are the next/future steps reasonable?

Where should we start reviewing?: CassandraClient

Priority (whenever / two weeks / yesterday): do not merge - but will be iterating quickly on this over the next week.


This change is Reviewable

@codecov-io
Copy link

codecov-io commented Nov 3, 2017

Codecov Report

Merging #2629 into develop will decrease coverage by 0.07%.
The diff coverage is 68.17%.

Impacted file tree graph

@@              Coverage Diff              @@
##             develop    #2629      +/-   ##
=============================================
- Coverage      60.31%   60.23%   -0.08%     
- Complexity      4492     4759     +267     
=============================================
  Files            887      902      +15     
  Lines          40737    41121     +384     
  Branches        4048     4056       +8     
=============================================
+ Hits           24569    24769     +200     
- Misses         14678    14858     +180     
- Partials        1490     1494       +4
Impacted Files Coverage Δ Complexity Δ
...alantir/atlasdb/memory/InMemoryAtlasDbFactory.java 20.58% <ø> (ø) 5 <0> (ø) ⬇️
...antir/atlasdb/keyvalue/dbkvs/DbAtlasDbFactory.java 15% <ø> (ø) 3 <0> (ø) ⬇️
...tir/atlasdb/cassandra/CassandraAtlasDbFactory.java 25% <ø> (ø) 0 <0> (ø) ⬇️
...ntir/atlasdb/keyvalue/jdbc/JdbcAtlasDbFactory.java 21.42% <ø> (ø) 3 <0> (ø) ⬇️
...ue/cassandra/CassandraExpiringKeyValueService.java 0% <ø> (ø) 0 <0> (ø) ⬇️
...ain/java/com/palantir/atlasdb/qos/QueryWeight.java 0% <0%> (ø) 0 <0> (?)
...yvalue/cassandra/CassandraKeyValueServiceImpl.java 0.47% <0%> (-0.01%) 0 <0> (ø)
...r/atlasdb/keyvalue/cassandra/paging/RowGetter.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...db/keyvalue/cassandra/sweep/GetCellTimestamps.java 0% <0%> (ø) 0 <0> (ø) ⬇️
.../com/palantir/atlasdb/services/ServicesConfig.java 100% <100%> (ø) 0 <0> (ø) ⬇️
... and 75 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e32e9d...92aa59c. Read the comment docs.

return checkLimitAndCall(() -> super.get_range_slices(column_parent, predicate, range, consistency_level));
}

private <T> T checkLimitAndCall(Callable<T> callable) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of accepting a callable here? why not just checkLimit() before each call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of this method.

qosClient.checkLimit();
return callable.call();
} catch (Exception ex) {
throw Throwables.throwUncheckedException(ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably dont want to wrap these exceptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

hsaraogi and others added 8 commits November 10, 2017 15:55
* Create one qosCLient for each service

QosClientBuilder hooked up to KVS create

Create the QosClient in CassandraClientPoolImpl if the config is specified.

Create FakeQosClient if the config is not specified

Cleanup

get broken tests to pass

* Locks

* Fix failing tests

* Add getNamespace [no release notes]

* Create QosClient at the Top level

* fix test

* test and checkstyle fixes

* locks
* Metrics for bytes and counts in each read/write

* Refactors, dont throw if recordMetrics throws

* Use meters instead of histograms

* Multiget bytes

* Batch mutate exact size

* Cqlresult size

* Calculate exact byte sizes for all thrift objects

* tests and bugfixes - partial

* More tests and bugs fixed

* More tests and cr comments

* byte buffer size

* Remove register histogram

* checkstyle

* checkstyle

* locks and license
* QosClient with ratelimiter

* Checkstyle
* Create a jaxrs-client for the integ tests

* build fix

* clean up
* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630)

* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success

* add logging when we stop reducing the batch size multiplier

* further improve the tests

* Allow sweep to recover faster after backing off.  Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000.  Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time.  This means that when sweep has to back off it should speed up again quickly.

* Use an AtomicInteger to handle concurrent updates

* SweeperService logging improvements (#2618)

* SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not

* Added sweep parameters to the log lines

* no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true.  Behaviour is unchanged but we can log a message when defaulting.

* Refactor TracingKVS (#2643)

* Wrap next() and hasNext() in traces

* Use span names as safe

* Remove iterator wrappings

* checkstyle

* refactor methods and remove misleading traces

* Fix unit tests

* release notes

* Final nits

* fix java arrays usage

* Delete docs (#2657)

* [20 minute tasks] Add test for when a batch is full (#2655)

* [no release notes] Drive-by add test for when a batch is full

* MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636)

* change metrics manager to warn plus log the metric name

* more timestamp tracker tests

* release notes

* Extract interface for Cassandra client (#2660)

* Create a CassandraClient

* Propagate CassandraClient to all classes but CKVS

* Use CassandraClient on CKVS

* Propagate CassandraClient to remaining Impl classes

* Use CassandraClient in tests

* [no release notes]

* client -> namespace [no release notes] (#2654)

* 0.65.2 and 0.66.0 release notes (#2663)

* Release notes banners

* fix pr numbers

* [QoS] Add getNamespace to AtlasDBConfig (#2661)

* Add getNamespace [no release notes]

* Timelock client config cannot be empty

* Make it explicit that unspecified namespace is only possible for InMemoryKVS

* CR comments

* Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621)

* thoughts

* More tests for RIH

* Paranoid logging

* statics

* javadoc part 1

* polling refreshable

* Unit tests

* Remove the old RIH

* lock lock

* Tests that test how we deal with exceptions

* logging

* [no release notes]

* CR comments part 1

* Make interval configurable

* Standard nasty time edge cases

* lastSeenValue does not need to be volatile

* Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622)

* ServiceCreator.applyDynamic()

* Propagate config through TMs

* Json Serialization fixes

* Some refactoring

* lock/lock

* Fixed checkstyle

* CR comments part 1

* Switch to RPIH

* add test

* [no release notes] forthcoming in part 4

* checkstyle

* [TTT] [no release notes] Document behaviour regarding index rows (#2658)

* [no release notes] Document behaviour regarding index rows

* fix compile bug

* ``List``

* Refactor and Instrument CassandraClient api (#2665)

* Sanitize Client API

* Instrument CassandraClient

* checkstyle

* Address comment

* [no release notes]

* checkstyle

* Fix cas

* Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647)

* 0 nodes part 1

* add support for 0 servers in a ServerListConfig

* extend deserialization tests

* More tests

* code defensively

* [no release notes] defer to 2648

* Fixed CR nits

* singleton server list

* check immutable ts (#2406)

* check immutable ts

* checkstyle

* release notes

* Fix TM creation

* checkstyle

* Propagate top-level KVS method names to CassandraClient (#2669)

* Propagate method names down to multiget_slice

* Add the corresponding KVS method to remaining methods

* Add TODO

* [no release notes]

* nit

* Extract cql executor interface (#2670)

* Instrument CqlExecutor

* [no release notes]

* bump awaitility (#2668)

* Upgrade to newer Awaitility.

* locks [no release notes]

* unused import

* Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662)

* Bump Atlas on Tritium 0.8.4 to fix dependency conflicts

* Add changes into missing file

* Doc changes

* Exclude Tracing and HdrHistogram from Tritium dependencies

* update locks

* Add excluded dependencies explicitly

* Fix merge conflict in relase notes

* Uncomment dependencies

* Regenerate locks

* Correctly log Paxos events (#2674)

* Log out Paxos values when recording Paxos events

* Updated release notes

* Checkstyle

* Pull request number

* Address comments

* fix docs

* Slow log and tracing (#2673)

* Trace and instrument the thrift client

* Instrument CqlExecutor

* Fix metric names of IntrumentedCassandraClient

* Fix nit

* Also log internal table references

* Checkstyle

* simplify metric names

* Address comments

* add slow logging to the cassandra thrift client

* add slow logging to cqlExecutor

* fix typos

* Add tracing to the CassandraClient

* trace cqlExecutor queries

* Add slow-logging in the CassandraClient

* Delete InstrumentedCC and InstrumentedCqlExec

* Fix small nits

* Checkstyle

* Add kvs method names to slow logs

* Fix wrapping of exception

* Extract CqlQuery

* Move kvs-slow-log and tracing of CqlExecutor to CCI

* Propagate execute_cql3_query api breaks

* checkstyle

* delete unused string

* checkstyle

* fix number of mutations on batch_mutate

* some refactors

* fix compile

* Refactor cassandra client (#2676)

* Extract TracingCassandraClient

Extract ProfilingCassandraClient

Move todos and some cleanup

Cherry-pick QoS metrics to develop (#2679)

* [QoS] Feature/qos meters (#2640)

* Metrics for bytes and counts in each read/write

* Refactors, dont throw if recordMetrics throws

* Use meters instead of histograms

* Multiget bytes

* Batch mutate exact size

* Cqlresult size

* Calculate exact byte sizes for all thrift objects

* tests and bugfixes - partial

* More tests and bugs fixed

* More tests and cr comments

* byte buffer size

* Remove register histogram

* checkstyle

* checkstyle

* locks and license

* Qos metrics CassandraClient

* Exclude unused classes

* fix cherry pick

* use supplier for object size [no release notes]

* fix merge in AtlasDbConfig
});
}

public CqlResult readLocksTable() throws TException {
return clientPool.run(client -> {
String lockRowName = getHexEncodedBytes(CassandraConstants.GLOBAL_DDL_LOCK_ROW_NAME);
String lockColName = getHexEncodedBytes(CassandraConstants.GLOBAL_DDL_LOCK_COLUMN_NAME);
String selectCql = String.format(
CqlQuery selectCql = new CqlQuery(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the QoS service change?

@POST
@Consumes(MediaType.APPLICATION_JSON)
@Produces(MediaType.APPLICATION_JSON)
long getLimit(@Safe @PathParam("client") String client);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our meters are tracking #s of requests and bytes read/written. Which limit is this meant to represent - one of them or some combined limit across all of them?

fsamuel-bs and others added 6 commits November 20, 2017 17:59
* rate limiting

* [QoS] total time spent talking to Cassandra (#2687)

* total-time

* [QoS] Client config (#2690)

* qos config

* respect max backoff itme

* [QoS] [Refactor] Query Weights (#2697)

* query weights

* extra tests

* [QoS] Number of rows per query (#2698)

* num rows

* checkstyle

* fix tests

* no int casting

* fix numRows calculation on batch_mutate

* [QoS] CAS metrics (#2705)

* cas metrics

* exceptions (#2706)

* [QoS] Guava license (#2703)

* guava license

* Cleanup: class reference

* [QoS] live reload (#2710)

* live reload and logging

* millis

* checkpoint

* fix tests

* comments

* checkstyle

* [QoS] Don't rate limit CAS (#2711)

* dont limit cas

* Remove tests of deleted method

* Cherrypick/qos exception mapping (#2715)

* very simple ratelimitexceededexception

* Need to be able to throw RLEE directly from Cass, rather than ADDE(RLEE)s

* fix bug with ADDE(RLEE)

* Exception Mapper

* unravel a bad javadoc

* CR comments part 1

* lock lock

* split qos aware throwables

* visibility

* fix compile break

* checkstyle

* handle exceptions properly
* Refactor the name of the functions

* Estimate based on the number of rows

* Fix modifiers on ThriftQueryWeighers

* Add unit tests to estimation logic

* ThriftQueryWeighers.multigetSlice takes a List, not number of rows

* getRangeSlices takes KeyRange, not count
* Address #2683 comments

* Clarify query and add cause

* Add just the cqlQuery.queryFormat

* checkstyle

* Update test

We changed the error message...
* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630)

* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success

* add logging when we stop reducing the batch size multiplier

* further improve the tests

* Allow sweep to recover faster after backing off.  Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000.  Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time.  This means that when sweep has to back off it should speed up again quickly.

* Use an AtomicInteger to handle concurrent updates

* SweeperService logging improvements (#2618)

* SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not

* Added sweep parameters to the log lines

* no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true.  Behaviour is unchanged but we can log a message when defaulting.

* Refactor TracingKVS (#2643)

* Wrap next() and hasNext() in traces

* Use span names as safe

* Remove iterator wrappings

* checkstyle

* refactor methods and remove misleading traces

* Fix unit tests

* release notes

* Final nits

* fix java arrays usage

* Delete docs (#2657)

* [20 minute tasks] Add test for when a batch is full (#2655)

* [no release notes] Drive-by add test for when a batch is full

* MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636)

* change metrics manager to warn plus log the metric name

* more timestamp tracker tests

* release notes

* Extract interface for Cassandra client (#2660)

* Create a CassandraClient

* Propagate CassandraClient to all classes but CKVS

* Use CassandraClient on CKVS

* Propagate CassandraClient to remaining Impl classes

* Use CassandraClient in tests

* [no release notes]

* client -> namespace [no release notes] (#2654)

* 0.65.2 and 0.66.0 release notes (#2663)

* Release notes banners

* fix pr numbers

* [QoS] Add getNamespace to AtlasDBConfig (#2661)

* Add getNamespace [no release notes]

* Timelock client config cannot be empty

* Make it explicit that unspecified namespace is only possible for InMemoryKVS

* CR comments

* Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621)

* thoughts

* More tests for RIH

* Paranoid logging

* statics

* javadoc part 1

* polling refreshable

* Unit tests

* Remove the old RIH

* lock lock

* Tests that test how we deal with exceptions

* logging

* [no release notes]

* CR comments part 1

* Make interval configurable

* Standard nasty time edge cases

* lastSeenValue does not need to be volatile

* Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622)

* ServiceCreator.applyDynamic()

* Propagate config through TMs

* Json Serialization fixes

* Some refactoring

* lock/lock

* Fixed checkstyle

* CR comments part 1

* Switch to RPIH

* add test

* [no release notes] forthcoming in part 4

* checkstyle

* [TTT] [no release notes] Document behaviour regarding index rows (#2658)

* [no release notes] Document behaviour regarding index rows

* fix compile bug

* ``List``

* Refactor and Instrument CassandraClient api (#2665)

* Sanitize Client API

* Instrument CassandraClient

* checkstyle

* Address comment

* [no release notes]

* checkstyle

* Fix cas

* Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647)

* 0 nodes part 1

* add support for 0 servers in a ServerListConfig

* extend deserialization tests

* More tests

* code defensively

* [no release notes] defer to 2648

* Fixed CR nits

* singleton server list

* check immutable ts (#2406)

* check immutable ts

* checkstyle

* release notes

* Fix TM creation

* checkstyle

* Propagate top-level KVS method names to CassandraClient (#2669)

* Propagate method names down to multiget_slice

* Add the corresponding KVS method to remaining methods

* Add TODO

* [no release notes]

* nit

* Extract cql executor interface (#2670)

* Instrument CqlExecutor

* [no release notes]

* bump awaitility (#2668)

* Upgrade to newer Awaitility.

* locks [no release notes]

* unused import

* Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662)

* Bump Atlas on Tritium 0.8.4 to fix dependency conflicts

* Add changes into missing file

* Doc changes

* Exclude Tracing and HdrHistogram from Tritium dependencies

* update locks

* Add excluded dependencies explicitly

* Fix merge conflict in relase notes

* Uncomment dependencies

* Regenerate locks

* Correctly log Paxos events (#2674)

* Log out Paxos values when recording Paxos events

* Updated release notes

* Checkstyle

* Pull request number

* Address comments

* fix docs

* Slow log and tracing (#2673)

* Trace and instrument the thrift client

* Instrument CqlExecutor

* Fix metric names of IntrumentedCassandraClient

* Fix nit

* Also log internal table references

* Checkstyle

* simplify metric names

* Address comments

* add slow logging to the cassandra thrift client

* add slow logging to cqlExecutor

* fix typos

* Add tracing to the CassandraClient

* trace cqlExecutor queries

* Add slow-logging in the CassandraClient

* Delete InstrumentedCC and InstrumentedCqlExec

* Fix small nits

* Checkstyle

* Add kvs method names to slow logs

* Fix wrapping of exception

* Extract CqlQuery

* Move kvs-slow-log and tracing of CqlExecutor to CCI

* Propagate execute_cql3_query api breaks

* checkstyle

* delete unused string

* checkstyle

* fix number of mutations on batch_mutate

* some refactors

* fix compile

* Refactor cassandra client (#2676)

* Extract TracingCassandraClient

Extract ProfilingCassandraClient

Move todos and some cleanup

Cherry-pick QoS metrics to develop (#2679)

* [QoS] Feature/qos meters (#2640)

* Metrics for bytes and counts in each read/write

* Refactors, dont throw if recordMetrics throws

* Use meters instead of histograms

* Multiget bytes

* Batch mutate exact size

* Cqlresult size

* Calculate exact byte sizes for all thrift objects

* tests and bugfixes - partial

* More tests and bugs fixed

* More tests and cr comments

* byte buffer size

* Remove register histogram

* checkstyle

* checkstyle

* locks and license

* Qos metrics CassandraClient

* Exclude unused classes

* fix cherry pick

* use supplier for object size [no release notes]

* fix merge in AtlasDbConfig

* rate limiting

* total-time

* qos config

* respect max backoff itme

* query weights

* extra tests

* num rows

* checkstyle

* fix tests

* no int casting

* Qos ete tests

* shouldFailIfWritingTooManyBytes

* fix test

* rm file

* Remove metrics

* Test shouldFailIfReadingTooManyBytes

* canBeWritingLargeNumberOfBytesConcurrently

* checkstyle

* cannotWriteLargeNumberOfBytesConcurrently

* fix tests

* create tm in test

* More read tests (after writing a lot of data at once)

* WIP

* Tests that should pas

* Actually update the rate

* Add another test

* More tests and address comments

* Dont extend etesetup

* Make dumping data faster

* cleanup

* wip

* Add back lost file

* Cleanup

* Write tests

* numReadsPerThread -> numThreads

* More write tests, cleanup, check style fixes

* Refactor to avoid code duplication

* Cleanup

* cr comments

* Small read/write after a rate-limited read/write

* annoying no new linw at eof

* Uniform parameters for hard limiting
* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success (#2630)

* Fix SweepBatchConfig values to properly decrease to 1 with each failure and increase with each success

* add logging when we stop reducing the batch size multiplier

* further improve the tests

* Allow sweep to recover faster after backing off.  Before we would increase by 1% for each successive success, if we had reduced a value to 1 it would be 70 iterations before we got 2 and 700 iterations before we got back to 1000.  Now we always 25 iterations with the lower batch size and then try increasing the rate by doubling each time.  This means that when sweep has to back off it should speed up again quickly.

* Use an AtomicInteger to handle concurrent updates

* SweeperService logging improvements (#2618)

* SweeperServiceImpl now logs when it starts sweeping make it clear if it is running full sweep or not

* Added sweep parameters to the log lines

* no longer default the service parameter in the interface, this way we can see when the parameter isn't provided and we are defaulting to true.  Behaviour is unchanged but we can log a message when defaulting.

* Refactor TracingKVS (#2643)

* Wrap next() and hasNext() in traces

* Use span names as safe

* Remove iterator wrappings

* checkstyle

* refactor methods and remove misleading traces

* Fix unit tests

* release notes

* Final nits

* fix java arrays usage

* Delete docs (#2657)

* [20 minute tasks] Add test for when a batch is full (#2655)

* [no release notes] Drive-by add test for when a batch is full

* MetricRegistry log level downgrade + multiple timestamp tracker tests (#2636)

* change metrics manager to warn plus log the metric name

* more timestamp tracker tests

* release notes

* Extract interface for Cassandra client (#2660)

* Create a CassandraClient

* Propagate CassandraClient to all classes but CKVS

* Use CassandraClient on CKVS

* Propagate CassandraClient to remaining Impl classes

* Use CassandraClient in tests

* [no release notes]

* client -> namespace [no release notes] (#2654)

* 0.65.2 and 0.66.0 release notes (#2663)

* Release notes banners

* fix pr numbers

* [QoS] Add getNamespace to AtlasDBConfig (#2661)

* Add getNamespace [no release notes]

* Timelock client config cannot be empty

* Make it explicit that unspecified namespace is only possible for InMemoryKVS

* CR comments

* Live Reloading the TimeLock Block, Part 1: Pull to Push (#2621)

* thoughts

* More tests for RIH

* Paranoid logging

* statics

* javadoc part 1

* polling refreshable

* Unit tests

* Remove the old RIH

* lock lock

* Tests that test how we deal with exceptions

* logging

* [no release notes]

* CR comments part 1

* Make interval configurable

* Standard nasty time edge cases

* lastSeenValue does not need to be volatile

* Live Reloading the TimeLock Block, Part 2: TransactionManagers Plumbing (#2622)

* ServiceCreator.applyDynamic()

* Propagate config through TMs

* Json Serialization fixes

* Some refactoring

* lock/lock

* Fixed checkstyle

* CR comments part 1

* Switch to RPIH

* add test

* [no release notes] forthcoming in part 4

* checkstyle

* [TTT] [no release notes] Document behaviour regarding index rows (#2658)

* [no release notes] Document behaviour regarding index rows

* fix compile bug

* ``List``

* Refactor and Instrument CassandraClient api (#2665)

* Sanitize Client API

* Instrument CassandraClient

* checkstyle

* Address comment

* [no release notes]

* checkstyle

* Fix cas

* Live Reloading the TimeLock Block, Part 3: Working with 0 Nodes (#2647)

* 0 nodes part 1

* add support for 0 servers in a ServerListConfig

* extend deserialization tests

* More tests

* code defensively

* [no release notes] defer to 2648

* Fixed CR nits

* singleton server list

* check immutable ts (#2406)

* check immutable ts

* checkstyle

* release notes

* Fix TM creation

* checkstyle

* Propagate top-level KVS method names to CassandraClient (#2669)

* Propagate method names down to multiget_slice

* Add the corresponding KVS method to remaining methods

* Add TODO

* [no release notes]

* nit

* Extract cql executor interface (#2670)

* Instrument CqlExecutor

* [no release notes]

* bump awaitility (#2668)

* Upgrade to newer Awaitility.

* locks [no release notes]

* unused import

* Bump Atlas on Tritium 0.8.4 to fix dependency conflicts (#2662)

* Bump Atlas on Tritium 0.8.4 to fix dependency conflicts

* Add changes into missing file

* Doc changes

* Exclude Tracing and HdrHistogram from Tritium dependencies

* update locks

* Add excluded dependencies explicitly

* Fix merge conflict in relase notes

* Uncomment dependencies

* Regenerate locks

* Correctly log Paxos events (#2674)

* Log out Paxos values when recording Paxos events

* Updated release notes

* Checkstyle

* Pull request number

* Address comments

* fix docs

* Slow log and tracing (#2673)

* Trace and instrument the thrift client

* Instrument CqlExecutor

* Fix metric names of IntrumentedCassandraClient

* Fix nit

* Also log internal table references

* Checkstyle

* simplify metric names

* Address comments

* add slow logging to the cassandra thrift client

* add slow logging to cqlExecutor

* fix typos

* Add tracing to the CassandraClient

* trace cqlExecutor queries

* Add slow-logging in the CassandraClient

* Delete InstrumentedCC and InstrumentedCqlExec

* Fix small nits

* Checkstyle

* Add kvs method names to slow logs

* Fix wrapping of exception

* Extract CqlQuery

* Move kvs-slow-log and tracing of CqlExecutor to CCI

* Propagate execute_cql3_query api breaks

* checkstyle

* delete unused string

* checkstyle

* fix number of mutations on batch_mutate

* some refactors

* fix compile

* Refactor cassandra client (#2676)

* Extract TracingCassandraClient

Extract ProfilingCassandraClient

Move todos and some cleanup

Cherry-pick QoS metrics to develop (#2679)

* [QoS] Feature/qos meters (#2640)

* Metrics for bytes and counts in each read/write

* Refactors, dont throw if recordMetrics throws

* Use meters instead of histograms

* Multiget bytes

* Batch mutate exact size

* Cqlresult size

* Calculate exact byte sizes for all thrift objects

* tests and bugfixes - partial

* More tests and bugs fixed

* More tests and cr comments

* byte buffer size

* Remove register histogram

* checkstyle

* checkstyle

* locks and license

* Qos metrics CassandraClient

* Exclude unused classes

* fix cherry pick

* use supplier for object size [no release notes]

* fix merge in AtlasDbConfig

* rate limiting

* total-time

* qos config

* respect max backoff itme

* query weights

* extra tests

* num rows

* checkstyle

* fix tests

* no int casting

* Qos ete tests

* shouldFailIfWritingTooManyBytes

* fix test

* rm file

* Remove metrics

* Test shouldFailIfReadingTooManyBytes

* canBeWritingLargeNumberOfBytesConcurrently

* checkstyle

* cannotWriteLargeNumberOfBytesConcurrently

* fix tests

* create tm in test

* More read tests (after writing a lot of data at once)

* WIP

* Tests that should pas

* Actually update the rate

* Add another test

* More tests and address comments

* Dont extend etesetup

* Make dumping data faster

* cleanup

* wip

* Add back lost file

* Cleanup

* Write tests

* numReadsPerThread -> numThreads

* More write tests, cleanup, check style fixes

* Refactor to avoid code duplication

* Cleanup

* cr comments

* Small read/write after a rate-limited read/write

* annoying no new linw at eof

* Uniform parameters for hard limiting

* Don't consume any estimated bytes for a _transaction or metadata table query

* Add tests

* cr comments
@hsaraogi hsaraogi force-pushed the feature/qos-service-api branch from 0d59168 to be45487 Compare November 24, 2017 16:40
hsaraogi and others added 2 commits November 27, 2017 12:59
* Merge develop

* Re-delete CqlQueryUtils
* handle qos exceptions in cell timestamp loader [no release notes]

* actually just remove checked exception

* Remove the throws in the method signature
}

public static long getCasByteCount(List<Column> updates) {
// TODO(nziebart): CAS actually writes more bytes than this, because the associated Paxos negotations must
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nziebart Given that CAS is not going through the rate limiter, we might be able to skip this TODO?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we can leave this alone for now. ideally we still record metrics for CAS, we just don't actually apply rate limiting to it. but we don't need to block on this

// TODO(nziebart): we need to inspect the schema to see how many rows there are - a CQL row is NOT a
// partition. rows here will depend on the type of query executed in CqlExecutor: either (column, ts) pairs,
// or (key, column, ts) triplets
// Currently, transaction or metadata table queries dont use the CQL executor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a zero estimation for other transaction or metadata queries. Since its not using the cql executor we can get away with this for now.

: readWeigher(ThriftObjectSizeUtils::getColumnOrSuperColumnSize, ignored -> 1, 1);
}

static final QosClient.QueryWeigher<CqlResult> EXECUTE_CQL3_QUERY =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXECUTE_CQL3_QUERY is a misleading name for the weigher. Can we estimate the number of rows from the CqlResult, each CqlRow (partition) has a list of columns (each of which has a ts) - is CqlResult.getRows().stream().mapToInt(row -> row.getColumns.size()).sum() the estimate here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a CqlRow is not the same as a cassandra row. e.g., if i select (key, column1, column2) i will get 1 CqlRow per column, but there's a difference between having 1000 columns in a single row vs 1000 rows with a single column

}

// TODO(nziebart): we really shouldn't be needing to catch exceptions here
private static long safeGetNumBytesOrDefault(Supplier<Long> numBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most kinds of thrift objects with different fields filled in or empty have been tested, but I wouldnt stay the tests are comprehensive and cover all kinds of objects. Do we want to add more tests, is there some way to just generate a bunch of objects and ensure that the size calculating method doesn't throw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would think that the cassandra ETE tests would catch these errors. but i'm ok with leaving this for now, and removing it fairly soon after we can verify that this error doesn't occur in the field.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants