[coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all #2687

robskillington · 2020-10-02T02:54:59Z

What this PR does / why we need it:

Allows for writes to leaving shards to count towards consistency and adds read consistency level unstrict all. Combined these settings allow for clusters under stress to perform a node join without removing the availability of the leaving node. Since this can cause issues where writes made to unavailable nodes count towards consistency the read unstrict all consistency level is made available so you can attempt always to try to read from all replicas. This means you won't optimistically short-circuit once reading from majority of nodes (with unstrict majority) and will always give the cluster a chance to read from all replicas (so if a write during a node join with write to a leaving shard that count towards consistency leaves a write on two nodes but the shard that is given up after leaving drops the data and the joining node didn't persist it, you are most likely to always read it from the single shard that still has the data).

Only advised to use this when the cluster is under stress, and also advised to be using repairs in this situation.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

…sistency, add read level unstrict all

src/dbnode/client/config.go

arnikola

Lgtm w nits

src/dbnode/client/config.go

src/dbnode/client/types.go

rallen090

LGTM

…consistency-read-unstrict-all

commit 47c219a Author: Linas Medžiūnas <[email protected]> Date: Tue Oct 6 13:31:37 2020 +0300 [dbnode] GetNamespaceFn returns public Namespace interface (#2696) commit ac2ef9b Author: Linas Medžiūnas <[email protected]> Date: Tue Oct 6 11:11:50 2020 +0300 [coordinator] Store metrics type into the annotation (#2628) * Rename existing protobuf TimeSeries.type field to m3_type to avoid collision * Add new Prometheus protobuf fields * Rename the internal MetricsType to M3MetricsType * Implement conversions * Write metrics type as annotation payload * Avoid reusing annotation slices * Fix test * Introduce metric family type * Address review feedback * Revert "Introduce metric family type" This reverts commit d108b4f. * Introduce annotation.Payload.handle_value_resets field * Minor changes according to PR feedback commit 2df33bf Author: Linas Medžiūnas <[email protected]> Date: Tue Oct 6 09:23:48 2020 +0300 [dbnode] Extended namespace options (#2644) * [dbnode] Extended namespace options * Add ExtendedOptions to integration tests * Fix build * Fix build (OptionsToProto) * go.sum * Imports * Allow updating extended namespace options * Lint * Fixed volumeIndex increment * Revert "Fixed volumeIndex increment" This reverts commit 2e0342b. * Fix TestLocalTypeWithAggregatedNamespace * Added protection against golang/protobuf/jsonpb use * More protection against non-gogo protobuf package * Make namespace tests pass * Reduce reptitiveness of protobuf.Any construction * Extract ExtendedOptions test infra to x package * Add copyright * Move test ExtendedOptionsConverter to the packages where it is needed * xtest.NewExtendedOptionsProto returns error Co-authored-by: Gediminas <[email protected]> commit 84f5b98 Author: teddywahle <[email protected]> Date: Mon Oct 5 14:36:45 2020 -0700 [query] Implemented the Graphite `applyByNode` function (#2654) commit 08117d2 Author: teddywahle <[email protected]> Date: Mon Oct 5 13:28:07 2020 -0700 [query] Fix snapping bug affecting "moving" function (movingMedian, movingAverage, etc.) (#2694) commit db4e9a3 Author: teddywahle <[email protected]> Date: Mon Oct 5 12:49:06 2020 -0700 [query] ParseTime function supports 95%+ of `from / until` time formats (#2621) commit 89f6fcc Author: Gediminas Guoba <[email protected]> Date: Mon Oct 5 21:36:32 2020 +0300 [largetiles] Fixed volumeIndex increment (#2681) * Fixed volumeIndex increment * minor refactoring Co-authored-by: Linas Medžiūnas <[email protected]> Co-authored-by: Rob Skillington <[email protected]> commit 1df3eed Author: Linas Medžiūnas <[email protected]> Date: Mon Oct 5 20:45:50 2020 +0300 [dbnode] Fix granularity of LeaseManager.UpdateOpenLeases (#2695) commit 5bcb1ba Author: Linas Medžiūnas <[email protected]> Date: Fri Oct 2 16:59:22 2020 +0300 [dbnode] Support dynamically registered background processes in mediator (#2634) * [dbnode] Support dynamically registered background processes in mediator * [dbnode] Pass background processes via RunOptions * Add a comment for BackgroundProcess * Add integration test TestRunCustomBackgroundProcess * Include backgroundProcess in TestDatabaseMediatorOpenClose * Move the source of BackgroundProcess asynchrony outside for consistency/easier testing * Fix test * Formatting * Update BackgroundProcess doc * go mod tidy * Verify reporting in TestRunCustomBackgroundProcess * Address review feedback * Resurrect BackgroundProcess.Start() * Enforce mediator.RegisterBackgroundProcess before Open * Remove locking from mediator.Report * Fix comment commit ef9ba0a Author: Rob Skillington <[email protected]> Date: Fri Oct 2 02:39:33 2020 -0400 [changelog] Add changelog for 0.15.16 (#2688) commit 81c3e19 Author: Rob Skillington <[email protected]> Date: Fri Oct 2 01:54:34 2020 -0400 [coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all (#2687) commit a700d56 Author: Rob Skillington <[email protected]> Date: Thu Oct 1 23:09:09 2020 -0400 [coordinator] Continue write request when context is cancelled (#2682) commit 526da79 Author: Ryan Allen <[email protected]> Date: Thu Oct 1 15:48:37 2020 -0400 [query] - Include FetchQuery in InspectSeries arguments (#2685) commit 3dedba5 Author: arnikola <[email protected]> Date: Thu Oct 1 15:18:27 2020 -0400 [query] Additional testing on Prometheus engine path (#2686) * [query] BlockMetadata no longer required for Prometheus engine path * Revert code change * Exposes ApplyRangeWarnings commit f996e2d Author: Linas Medžiūnas <[email protected]> Date: Thu Oct 1 21:49:33 2020 +0300 [dbnode] Large tile aggregation improvements (#2668) * [dbnode] Large tile aggregation improvements * Fix mocking * Log block start and volume on error * Infra for reverse index sharing * StorageOptions.AfterNamespaceCreatedFn * Adjust test timing * Do not fail on fs.ErrCheckpointFileNotFound * mockgen * Improve handling of shared reverse index * Improve Shard.AggregateTiles UT coverage * Remove unused reverseIndex from test * Address review feedback (partial) * Rearrange imports commit d5fbe4b Author: Bo Du <[email protected]> Date: Thu Oct 1 10:12:38 2020 -0600 [dbnode] No empty TSDB snapshots (#2666) commit 06cca59 Author: Asaf Mesika <[email protected]> Date: Thu Oct 1 12:18:21 2020 +0300 [docs] Add short description of 2019 monitorama talk (#2515) * Add short description of 2019 monitorama talk * Update docs/overview/media.md Co-authored-by: Chris Chinchilla <[email protected]> Co-authored-by: Chris Chinchilla <[email protected]> Co-authored-by: Chris Chinchilla <[email protected]> commit 68ba5bb Author: teddywahle <[email protected]> Date: Wed Sep 30 12:18:42 2020 -0700 [query] Implemented the Graphite `interpolate` function (#2650) commit 77455be Author: teddywahle <[email protected]> Date: Wed Sep 30 07:41:27 2020 -0700 [query] Implemented the Graphite `grep` function (#2655) commit 43ff200 Author: nate <[email protected]> Date: Tue Sep 29 18:00:14 2020 -0400 [query] Update /database/create endpoint to support creating an aggregated namespace (#2670) commit 41ac33c Author: Ryan Hall <[email protected]> Date: Tue Sep 29 12:37:01 2020 -0700 Update debug/dump cmd with Cluster-Environment-Name header (#2672) This is needed since the dump includes the PlacmentSource commit 503d68d Author: nate <[email protected]> Date: Tue Sep 29 15:10:00 2020 -0400 [docs] Add Fileset Migration docs to Operational Guide nav (#2680) commit ac8259a Author: nate <[email protected]> Date: Tue Sep 29 14:44:18 2020 -0400 [dbnode][query] Add ability to set and retrieve AggregationOptions via namespace APis (#2661) commit 946161b Author: Ryan Hall <[email protected]> Date: Tue Sep 29 10:54:43 2020 -0700 Only acquire writeState exclusive lock when resetting connections (#2678) This is a shared lock across all goroutines, so ideally we limit the amount of sync points. There is no need to acquire the exclusive lock when flushing a connection. Instead the shared read lock can be acquired and individual connection exclusive locks can be acquired to ensure only a single writer is using the connection at a time. This is part of a broader fix to prevent deadlocks in the m3msg communication. Before this change and adding write timeouts, the producer and consumer could deadlock and stop all messages from flowing. If a producer was sending more messages than it was acking, it would eventually fill up the TCP buffers and begin to block writes. If the producer was flushing a connection and holding the exclusive lock, it would end up blocking with lock. Since the lock was held, the ACKing process could not clear the buffer and the flush would block forever. commit 192f611 Author: Ryan Hall <[email protected]> Date: Tue Sep 29 08:19:58 2020 -0700 Fix data race in TestSeekerManagerCacheShardIndices (#2679) commit c717d40 Author: Rob Skillington <[email protected]> Date: Mon Sep 28 23:27:09 2020 -0400 [documentation] Add documentation for disk read bytes per second limit (#2677) commit 3c81774 Author: Ryan Hall <[email protected]> Date: Mon Sep 28 11:25:08 2020 -0700 Add a configurable write timeout for consumer ACKs (#2675) Without a timeout this can potentially block forever. The producer and consumer can potentially dead lock trying to send/receive messages: Producer -> msg -> Consumer. Consumer is not attempting to read since it's trying to ACK. Consumer -> ack -> Producer. Producer is not attempting to read since it might be stuck trying to get a lock on the connection. For backwards compatibility this defaults to 0. We should set it to a sane default (5s) in the future. commit 80c9f9e Author: Ryan Hall <[email protected]> Date: Mon Sep 28 09:39:19 2020 -0700 Add queue processing metrics (#2671) * Add queue processing metrics I found these metrics useful when tracking down an issue with the queue processor. Additionally, added an optional consumer label for the message writer metrics so you can easily tell which downstream consumer is causing issues. Unfortunately, setting the consumer label is a little awkward since it only makes since for the replicated shard writer, which only has a single consumer instance at one time. For the shared shard writer, the empty string is used. commit 66b0b24 Author: teddywahle <[email protected]> Date: Mon Sep 28 03:51:22 2020 -0700 [query] Implemented the Graphite `sortByMinima` function commit a7b499d Author: teddywahle <[email protected]> Date: Sun Sep 27 20:12:44 2020 -0700 [query] Implemented the Graphite `cumulative` function commit 13290be Author: Ryan Hall <[email protected]> Date: Fri Sep 25 16:34:55 2020 -0700 Add metrics for bytes added/removed to the m3msg buffer (#2656) Helpful for understanding how to size the buffer to survive a m3aggregator deploy. Due to the graceful shutdown of a m3agregator, it can take almost a minute between connections closing on the old instances and connections start taking traffic on the new instances. If a coordinator takes 10MB/s of traffic, it only takes 10s to fill up a 100MB buffer and start dropping requests. commit 06d23e2 Author: Matt Schallert <[email protected]> Date: Fri Sep 25 15:06:20 2020 -0700 [deps] grpc v1.27 -> v1.29 (#2667) We're stuck in a tough spot with our gRPC version. We are on a relatively old version, and gRPC frequently introduces breaking changes to APIs that are experimental. Our own query code, and the version of etcd we depend on, use experimental APIs which were later deprecated. We can't upgrade too far as we won't be able to import etcd. However v1.29 is far more compatible with other parts of the ecosystem, and makes it easier to integrate M3 in other environments (as well as integrate newer tools into M3). commit 5b95438 Author: Ryan Hall <[email protected]> Date: Fri Sep 25 13:53:08 2020 -0700 [msg] Move writer README to top level msg README (#2669) commit 38bc187 Author: Rob Skillington <[email protected]> Date: Fri Sep 25 14:12:05 2020 -0400 [query] Add metrics for remote storage backends (#2657) commit 21cceab Author: Rob Skillington <[email protected]> Date: Fri Sep 25 10:42:52 2020 -0400 [dbnode] Add example for using the Go DB node client to query/write metrics (#2658) commit 41b4bc4 Author: Rob Skillington <[email protected]> Date: Fri Sep 25 10:42:00 2020 -0400 [query] Add ability to enable optimized Graphite fanout if all agg namespaces have all the data (#2665) commit aefca00 Author: Linas Medžiūnas <[email protected]> Date: Fri Sep 25 09:09:46 2020 +0300 [dbnode] Large tiles aggregation flow v2 (#2643) * Refactor and cleanup * Refactor interfaces to more closely match design * Update frame iterator read from a encoding.ReaderIterator * Removing unnecessary files * az * Add utility to apply tile calculations on data. * test fix * Added concurrency * Concurrency logging * [dbnode] A noop AggregateTiles thrift RPC * Add AggregateTilesRequest.rangeType * sourceNameSpace / targetNameSpace * Drop AggregateTilesRequest.shardId * A partial implementation of AggregateTiles * Open DataFileSetReader and iterate through it * Decompress the data read * Add explicit FileSetType * [dbnode] Add OrderedByIndex option for DataFileSetReader.Open * Remove dbShard.TagsFromSeriesID * Regenerate mocks * Unit tests * Mockgen * Fix test * Resurrect rpc_mock.go * Remove accidentally committed files * Trigger build * Add step parameter * Write aggregated data to other namespace * Fix tests * Introduced AggregateTilesOptions * Minor improvements * Cleanup * PR response * Add headers * Remove unnecessary stuff. * [dbnode] A noop AggregateTiles thrift RPC * Add AggregateTilesRequest.rangeType * sourceNameSpace / targetNameSpace * Drop AggregateTilesRequest.shardId * A partial implementation of AggregateTiles * Open DataFileSetReader and iterate through it * Decompress the data read * Add explicit FileSetType * Remove dbShard.TagsFromSeriesID * Regenerate mocks * Unit tests * Mockgen * Fix test * Resurrect rpc_mock.go * Remove accidentally committed files * Trigger build * Add step parameter * Write aggregated data to other namespace * Fix tests * Introduced AggregateTilesOptions * Minor improvements * Cleanup * [dbnode] Integrate arrow iterators into tile aggregation * Fix close error after EOF * Can already close the SeriesBlockIterator * Update to use concurrent iteration and prefer single metadata * [dbnode] Cross block series reader * Assert on OrderedByIndex * Tests * Mocks * Dont test just the happy path * Compute and validate block time frames * [dbnode] Integration test for large tiles (#2478) * [dbnode] Create a virtual reverse index for a computed namespace * Return processedBlockCount from AggregateTiles * Improve error handling * Validate AggregateTilesOptions * Unnest read locks * Use default instead of constant * Fix test * minor refactoring * Addressed review feedback * Legal stuff * Refactor recorder * Allow using flat buffers rather than arrow * [dbnode] persist manager for large tiles * revert of .ci * minor * Adding better comparisons for arrow vs flat * Some fixes for query_data_files * An option to read from all shards * Fix large_tiles_test * Fix TestDatabaseAggregateTiles * Read data ordered by index * Generate mocks * Fix TestAggregateTiles * Group Read() results by id * Remodel CrossBlockReader as an Iterator * Mockgen * Erase slice contents before draining them * Resolve merge conflicts * Align with master * Integrate CrossBlockReader * Make a defensive copy of dataFileSetReaders * avoid panics * Improve TestNamespaceAggregateTiles * Added TODO on TestAggregateTiles * Align query_data_files * Mockgen * Added cross block iterator to be able to read multiple BlockRecords. Also removed concurrency from tile iterators and cleaned up utility * Add HandleCounterResets to AggregateTilesOptions * Additional tests and cleanup. * [dbnode] Large Tiles fs.writer experimental implementation * Implement DownsampleCounterResets * Improve readablitiy * Use pointer arguments to get the results * Reduce code duplication * Refine comments * Remove dependency on SeriesBlockFrame * [dbnode] Add OrderedByIndex option for DataFileSetReader.Open (#2465) * [dbnode] Cross-block series reader (#2481) * Fix build * Integrate DownsampleCounterResets * Introduce DownsampledValue struct * Update DownsampleCounterResets integration * Preallocate capacity for downsampledValues * Large tiles parrallel indexing. * Checkpoint fixed * Successful write/fetch with some hardcoded values. * Some FIXME solved * [dbnode] AggregateTiles RPC - minimal E2E flow (#2466) * minor fixes * codegen fix * Address feedback from PR 2477 * TestShardAggregateTiles using 2 readers * Fix large_tiles_test.go * integration test fix * Bug fix and test * [large tiles] Fix refcounting in CrossBlockReader * Workaround for negative reference count * Integration test fix * [large-tiles] Try detect double finalize * [dbnode] Large tiles concurrency test * Batch writer is used for waster writes * race fix * Fix compilation error * Comment out some noise * Fix data races on time field (bootstrapManager, flushManager) * Fix misplaced wd.Add * Close context used for aggregation (in test code) * Close encoders during large tile writes * removing some debug code * Close series in test code * Move series.Close() after err check (test code) * Update to series frame API * Additional tests * PR * work on integ test * tags close * [large-tiles] Fix management of pooled objects * Fix query_data_files tool * Use mock checked bytes in cross_block_reader_test.go * query test * Query in place of fetch * test fix * Bug reproduced * Heavier concurrency test * Fix session related races in large_tiles_test.go * Fix string conversion * Use mocks for pooled objects in TestShardAggregateTiles * Add build integration tag * test fix * minor refactoring * increased the amount of series to 5k * Remove a noop * Log the finish of namespace aggregation * some minor refactorings * Streaming aggregation, reusing resources for each series * Do not allocate minHeapEntries * Cleanup * Fix query_data_files * Fix build * go.sum * Align with StreamingWriter API changes * Add FIXME WRT segment.Tail finalizing * Exclude query_data_files * Exclude counter_resets_downsampler.go * Remove arrow related code * Cleanup * Use explicit EncodedTags type * Rename processedBlockCount to processedTileCount * Fix build * Exclude read_index_ids changes * Address review feedback * Increase fetch timeout to stabilize TestAggregationAndQueryingAtHighConcurrency * Abort the writer in case of any error during aggregation Co-authored-by: Artem <[email protected]> Co-authored-by: Gediminas <[email protected]> commit ac530fa Author: Matt Schallert <[email protected]> Date: Thu Sep 24 14:07:46 2020 -0700 [deps] Bump tchannel v1.12 -> v1.14 (#2659) commit 62a31fc Author: Rob Skillington <[email protected]> Date: Thu Sep 24 15:06:15 2020 -0400 [dbnode] Better defaults for block retriever config (#2664) commit 767a517 Author: nate <[email protected]> Date: Thu Sep 24 10:11:56 2020 -0400 [dbnode] Load and cache info files during bootstrap (#2598) commit 4d81e4a Author: teddywahle <[email protected]> Date: Wed Sep 23 12:43:21 2020 -0700 [query] Implemented the Graphite `highest` and `lowest` functions (#2623) commit b4575f6 Author: arnikola <[email protected]> Date: Wed Sep 23 10:07:11 2020 -0400 [query] Add resolution exceeds query range warning (#2429) Signed-off-by: ChrisChinchilla <[email protected]>

[coordinator] Configurable writes to leaving shards count towards con…

c586429

…sistency, add read level unstrict all

rallen090 reviewed Oct 2, 2020

View reviewed changes

src/dbnode/client/config.go Outdated Show resolved Hide resolved

arnikola approved these changes Oct 2, 2020

View reviewed changes

src/dbnode/client/config.go Outdated Show resolved Hide resolved

src/dbnode/client/types.go Show resolved Hide resolved

rallen090 approved these changes Oct 2, 2020

View reviewed changes

robskillington and others added 5 commits October 1, 2020 23:23

Add tests for new consistency levels

4163bcc

Add test for shards leaving count towards consistency

7c2efac

Merge branch 'master' into r/configurable-write-shards-leaving-count-…

c58f14a

…consistency-read-unstrict-all

Fix capacity

7c2d25e

Fix tests

fff0d26

robskillington merged commit 81c3e19 into master Oct 2, 2020

robskillington deleted the r/configurable-write-shards-leaving-count-consistency-read-unstrict-all branch October 2, 2020 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all #2687

[coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all #2687

robskillington commented Oct 2, 2020 •

edited

Loading

arnikola left a comment

rallen090 left a comment

[coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all #2687

[coordinator] Configurable writes to leaving shards count towards consistency, add read level unstrict all #2687

Conversation

robskillington commented Oct 2, 2020 • edited Loading

arnikola left a comment

Choose a reason for hiding this comment

rallen090 left a comment

Choose a reason for hiding this comment

robskillington commented Oct 2, 2020 •

edited

Loading