[CCR] Re-evaluate shard follow parameter defaults #31717

martijnvg · 2018-07-02T07:28:20Z

in ShardFollowNodeTask and ShardFollowTask.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-07-02T07:28:21Z

Pinging @elastic/es-distributed

dliappis · 2018-09-28T21:36:49Z

TL;DR

It's hard to come up with general defaults that are optimal for all cases, esp. in multi node environments with different (primary+replica) shard settings.

this comment shows two categories of defaults that work well in different scenarios:

5P 1R shard settings, 3node clusters, Network storage, AWS+GCP
1P 1R shard settings, 3node clusters, SSD storage, AWS+GCP

The next comment will include optimal defaults for:

3P 1R shard settings, 3node clusters, SSD storage, AWS+GCP

The final defaults will be based on the results from all the shard setting scenarios, however, the detailed results per category are still useful for documentation and tuning guidance.

All benchmarks were done with the same commit from master (73417bf) that doesn't include the unmerged, as of now, append-only final optimizations in #34099.

5P 1R, Network SSD Storage, 3 node cluster

"max_concurrent_read_batches": 5,
"max_concurrent_write_batches": 5,
"max_batch_operation_count": 32768,
"max_batch_size_in_bytes": 4718592,
"max_write_buffer_size": 164000

NOTE: While these defaults have been benchmarked against a large amount of scenarios, they are still static parameters. Elasticsearch will not auto-adjust based on performance during CCR activity and performance depends also on the amount of shards. Therefore they will likely still need tuning depending on environmental peculiarities.

Benchmarks Executed

All benchmarks where executed on the Cloud (AWS and GCP).

GCP environment

1 cluster, 3 leader nodes in europe-west, each node in different AZ europe-west1-b, europe-west1-c, europe-west1-d.

1 cluster, 3 follower nodes in us-central, each node in different AZ us-central1-a, us-central1-b, us-central1-c.

1 load gen on europe-west (europe-west1-b).

Each node root disk size: 500GB EBS SSD.

Each instance is: n1-highcpu-16.

Minimum allowable processor, Skylake (https://cloud.google.com/compute/docs/machine-types).

14GiB RAM (7GiB heap)

AWS environment

1 cluster, 3 leader nodes in eu-central-1, each node in different AZ (us-east-2a, us-east-2b, us-east-2c).

1 cluster, 3 follower nodes in us-east-2, each node in different AZ (eu-central-1a, eu-central-1b, eu-central-1c).

Load gen node: (eu-central-1a).

Each node root disk size: 500GB Persistent Disk SSD.

Each instance is: c5.2xlarge.

15GiB RAM (7GiB heap).

Rally Details

(see https://github.com/elastic/rally-tracks, https://github.com/elastic/rally-eventdata-track)

Tracks used:

geopoint
pmc
http_logs
eventdata

Type of load: indexing only
Clients used (see schedule): 4

Results with recommended settings:

GCP:

Track	# Indices	Docs/index	store.size	pri.store.size	Total Duration	Time to catchup after indexing
geopoint	1	60844404	6.7GB	3.3GB	00:08:45	0:00:00.289002
pmc	1	574199	48.3GB	24.7GB	0:11:15	0:00:00.470718
http_logs	3	12406628/30700742/193424966	2/5/36.1GB	1/2.5/18.1GB	0:43:29	0:00:00.948574

AWS:

Track	# Indices	Docs/index	store.size	pri.store.size	Total Duration	Time to catchup after indexing
geopoint	1	60844404	6.7GB	3.3GB	0:09:53	0:00:00.209117
pmc	1	574199	41GB	20.5GB	0:35:46	0:00:10.334900
http_logs	3	12406628/30700742/193424966	2/5/33.1GB	1/2.5/16.6GB	0:41:17	0:01:41.313000
eventdata	1	1000040000	444.4GB	223.2GB	13:29:16	0:00:15.781900

Other details

Total benchmarks executed:

GCP: 43 (gradually increasing max_concurrent_batch_read/write values and max_batch_size_in_bytes)
AWS: 7

1P 1R, Local SSD Storage, 3 node cluster

"max_concurrent_read_batches": 9,
"max_concurrent_write_batches": 9,
"max_batch_operation_count": 163840,
"max_batch_size_in_bytes": 4718592,
"max_write_buffer_size": 2942700

Notes

This setup exerts high pressure on I/O (primarily) and cpu (secondarily) , as essentially only two out of three nodes deal with the load.
The setup used already with EBS/Persistent Storage proved a bottleneck and results with the same settings, or settings with higher concurrency + buffers didn't improve replication performance (see details).

Trying locally attached SSD disks shifted the bottleneck to the cpu (see details).

After finding the right instances in terms of cpu and i/o performance it was possible to tune settings to provide the optimal performance.

Benchmarks Executed

Final AWS environment

1 cluster, 3 leader nodes in `eu-central-1`, each node in different AZ (`us-east-2a`, `us-east-2b`, `us-east-2c`).

1 cluster, 3 follower nodes in us-east-2, each node in different AZ (eu-central-1a, eu-central-1b, eu-central-1c).

Load gen node: (eu-central-1a).

Each node root disk size: 100GB (EBS) but Elasticsearch data dir and Rally ~/.rally dirs use /dev/nvme1n1 local SSD of the instance.

Each instance is: m5d.2xlarge.

30GiB RAM (15GiB heap)

(Also tried i3.xlarge, which didn't have enough CPU capacity to deal with http_logs benchmarks at max indexing performance).

GCP environment

1 cluster, 3 leader nodes in `europe-west`, each node in different AZ `europe-west4-a`, `europe-west4-b`, `europe-west4-c` (had to switch from europe-west1-* due to lack of resources).

1 cluster, 3 follower nodes in us-central, each node in different AZ us-central1-a, us-central1-b, us-central1-c.

1 load gen on europe-west (europe-west4-a).

Each node root disk size: 500GBEBS SSD but~~/.rallyand~~/es` reside on locally attached SSD disk.

Each instance is: n1-highcpu-16.

Minimum allowable processor, Skylake (https://cloud.google.com/compute/docs/machine-types).

14GiB RAM (7GiB heap).

Rally Details

(see https://github.com/elastic/rally-tracks, https://github.com/elastic/rally-eventdata-track)

indexing was throttled to 45000 doc/s

Tracks used:

geopoint
pmc
http_logs
eventdata

Type of load: indexing only
Clients used (see schedule): 4

Results with recommended settings:

GCP:

Track	# Indices	Docs/index	store.size	pri.store.size	Total Duration	Time to catchup after indexing
geopoint	1	60844404	9.6GB	4.2GB	0:37:53	0:00:00.257276
pmc	1	574199	40.7GB	20.3GB	0:54:25	0:00:01.204230
http_logs	3	12406628/30700742/193424966	2/4.9/31.9GB	1/2.4/16.1GB	2:35:01	0:00:01.885290

AWS:

Track	# Indices	Docs/index	store.size	pri.store.size	Total Duration	Time to catchup after indexing
geopoint	1	60844404	8.1GB	4GB	0:23:45	0:00:00.213884
pmc	1	574199	43.1GB	22.5GB	0:23:29	0:00:00.951584
http_logs	3	12406628/30700742/193424966	2/5/33.1GB	1/2.5/16.6GB	1:29:10	0:00:18.619700

dliappis · 2018-10-03T07:20:12Z

3P 1R, Network SSD Storage, 3 node cluster

"max_concurrent_read_batches": 5,
"max_concurrent_write_batches": 5,
"max_batch_operation_count": 32768,
"max_batch_size_in_bytes": 4718592,
"max_write_buffer_size": 164000

Notes

As expected, using 3p 1r shard settings is lighter on cpu and I/O pressure than the 1/1 scenario.

All benchmarks were executed using local SSD drives and http_logs in particular throttled at 85% of max indexing performance, as otherwise normalized CPU consumption on AWS ranged between 95-98% (the instance on AWS has less cpus).

AWS environment

1 cluster, 3 leader nodes in eu-central-1, each node in different AZ (us-east-2a, us-east-2b, us-east-2c).

1 cluster, 3 follower nodes in us-east-2, each node in different AZ (eu-central-1a, eu-central-1b, eu-central-1c).

Load gen node: (eu-central-1a).

Each node root disk size: 100GB (EBS) but Elasticsearch data dir and Rally ~/.rally dirs use /dev/nvme1n1 local SSD of the instance.

Each instance is: m5d.2xlarge.

30GiB RAM (15GiB heap)

(Also tried i3.xlarge, which didn't have enough CPU capacity to deal with http_logs benchmarks at max indexing performance).

GCP environment

1 cluster, 3 leader nodes in europe-west, each node in different AZ europe-west4-a, europe-west4-b, europe-west4-c.

1 cluster, 3 follower nodes in us-central, each node in different AZ us-central1-a, us-central1-b, us-central1-c.

1 load gen on europe-west (europe-west4-a).

Each node root disk size: 500GB EBS SSD but .rally/ and es/ reside on locally attached SSD disk.

Each instance is: n1-highcpu-16.

Minimum allowable processor, Skylake (https://cloud.google.com/compute/docs/machine-types).

14GiB RAM (7GiB heap).

Rally Details

(see https://github.com/elastic/rally-tracks, https://github.com/elastic/rally-eventdata-track)

Tracks used:

geopoint
pmc
http_logs -> max indexing throughput
http_logs -> indexing throughput throttled to 85% of max performance (55000docs/s on AWS, 85000docs/s on GCP)
eventdata -> max indexing throughput; instance's CPU, esp follower was pegged at 98%
eventdata -> indexing throughput throttled to 85% of max performance

Type of load: indexing only
Clients used (see schedule): 4

Results with recommended settings:

GCP:

Track	# Indices	Docs/index	store.size	pri.store.size	Total Duration	Time to catchup after indexing
geopoint	1	60844404	7.2GB	3.5GB	0:13:26	0:00:00.265027
pmc	1	574199	70.4GB	34.7GB	0:13:06	0:00:00.477828
http_logs (max indexing throughput)	3	12406628/30700742/193424966	2/4.9/31.8GB	1/2.4/15.9GB	1:01:33	0:00:00.778742
http_logs (throttled 85% of max indexing)	3	12406628/30700742/193424966	2/5/32.1GB	1/2.5/16.1GB	1:13:01	0:00:00.779165
eventdata (max indexing throughput)	3	1000040000	427.7GB	214GB	21:37:14	4:48:11.400000
eventdata (throttled 85% of max indexing)	3	1000040000	428.1GB	214.1GB	22:14:09	0:00:00.532950

AWS:

Track	# Indices	Docs/index	store.size	pri.store.size	Total Duration	Time to catchup after indexing
geopoint	1	60844404	7.3GB	3.6GB	0:08:39	0:00:27.475100
pmc	1	574199	61.1GB	30.3GB	0:12:55	0:00:00.269416
http_logs (max indexing throughput)	3	12406628/30700742/193424966	2/4.9/31.1GB	1/2.4/15.5GB	0:45:06	0:05:00.847000
http_logs (throttled 85% of max indexing)	3	12406628/30700742/193424966	2/4.9/31.1GB	1/2.4/15.5GB	0:48:30	0:01:00.925000
eventdata (max indexing throughput)	3	1000040000	424.1GB	212GB	18:21:09	5:23:26.500000
eventdata (throttled 85% of max indexing)	3	1000040000	428.2GB	214GB	15:27:13	0:00:01.125280

dliappis · 2018-10-03T07:49:19Z

Next steps

With the append-only follower optimizations merged, next step is to rerun the above benchmarks (at least the ones that demonstrated high resource usage, esp CPU) and evaluate if the common defaults used for 5p/1r, 3p/1r scenario demonstrate less resource usage and whether they are good enough even for 1p/1r.

For reference here are some visualizations for CPU usage:

AWS http_logs 3P/1R unthrottled

CPU

AWS http_logs 3P/1R throttled to 85% of max indexing throughput

CPU:

Progress of follower vs leader with sequence numbers:

GCP http_logs 3P/1R unthrottled

CPU:

GCP http_logs 3P/1R throttled to 85% of max indexing throughput

CPU:

AWS http_logs 1P/1R throttled to 85% of max indexing throughput

CPU

GCP http_logs 1P/1R throttled to 85% of max indexing throughput

CPU

bleskes · 2018-10-23T12:06:33Z

@dliappis and I did some more research and here is a summary of our discoveries:

Cross region network becomes more efficient when multiple connections are used concurrently. When we run with one reader the network added roughly 700ms on average to each request. With 2 concurrent readers it dropped to 400ms. With 4 readers it dropped even further. We saw this behavior both on GCP and AWS.
The system worked nicely (i.e., the following index kept up nicely) when indexing to a single shard leader index on a 16 core CPU instance with 8 concurrent clients. The indexing rate was roughly 100K http log lines per second. The following index was set up on an identical machine and we used up to 8 concurrent read requests and 8 concurrent write requests. Increasing the the budget to 16r/16w didn't help performance and the read/write budget wasn't fully utilized. All of these experiments used a batch size of 5K ops, 5MB max batch size and an unlimited write buffer.

A few observations that we have made will doing the experiments:

Readers have more work than writers: readers need to unpack lucene stored field blocks and extract the source - this is more work than a write needs to do for small documents as indexing involves parsing them and putting them in an in memory buffer.
Both read and write settings are a maximum - concurrent requests will be allocated only if needed.
By default we use 6 channels to connect to a remote cluster. This value has been inherited from CCS.
On a busy cluster CCR caused ~10% reduction in indexing speed on the leader. On a non-saturated machine (same machine as above but with 4 concurrent writes), CCR cause ~5% slow downs. We are still researching why.

W.r.t defaults we have discussed multiple options, including being smart and using the number of processors on the machines to figure out our defaults. That said I feel we don't have a good enough grip of all the possible use cases/machines and it will take a considerable effort to get it, with unclear chance of actually getting a clear and simple picture. I'd prefer going with simple numbers that are easy to communicate and change. Once we gather more input we can adapt them.

With that in mind, I suggest going with the following defaults:

Keep the currently hard coded 6 channels per remote cluster. I think we should consider making it configurable.
Default to a maximum of 12 concurrent read requests (double the channel count). This a conservative setting that will allow for some room for faster networks (local LANs) where the ration between disk reads and network is different.
Default to a maximum of 8 concurrent write requests. This is less than the number of readers based on expected ratio of work.

@dliappis did I forget anything?

Per elastic#31717 this commits changes the defaults to the following: Batch size of 5120 ops. Maximum of 12 concurrent read requests. Maximum of 9 concurrent write requests.

Per #31717 this commit changes the defaults to the following: Batch size of 5120 ops. Maximum of 12 concurrent read requests. Maximum of 9 concurrent write requests. This is not necessarily our final values but it's good to have these as defaults for the purposes of initial testing.

dliappis · 2018-10-24T18:54:07Z

@bleskes Thank you for the concise summation.

Maybe worth mentioning that all experiments were done on instances using a locally attached SSDs (i.e. not AWS's EBS/GCP's persistent disks).

What initial defaults should we use for max_batch_size? So far we've been using 5MB but the current default is Long.MAX_VALUE, so we can also choose to leave it unchanged and rely only on the max_batch_operation_count == 5120 that you recently introduced in your PR.

I'd also have added the need to clarify defaults for max_write_buffer_count but this is now being worked on #34797 (and there's a new parameter max_write_buffer_bytes).

Per #31717 this commit changes the defaults to the following: Batch size of 5120 ops. Maximum of 12 concurrent read requests. Maximum of 9 concurrent write requests. This is not necessarily our final values but it's good to have these as defaults for the purposes of initial testing.

dliappis · 2018-11-19T11:07:29Z

Here's a summary of what has happened in the mean time.

We devised a testing plan comprising 3 stages:

Stage 0:
- Verify the impact of x-pack-security
- Verify the impact of non-append workloadds
Stage 1:
- Validate max_ defaults allow the follower to always catch up using 3 workloads (geopoint -> very small docs, http_logs -> medium doc size, large corpus and pmc -> very large docs) on both AWS and GCP cloud environments.
- Compare CCR overhead vs "no CCR andsoft_deletes: false" as well as vs "no CCR with soft_deletes: true".
Stage 2:
- Examine ability to catch up, system overhead, and stability with a larger worload (1bn docs, continuous execution over 3+ days) on both AWS and GCP cloud environments.
- Compare CCR overhead.

Observations/Actions:

We observed long GC pauses with work loads containing large doc sizes (pmc track) and changed the max_read_request_size to 32GB (from Long.MAX_VALUE i.e. practically unlimited).
Observed a penalty in indexing throughput when CCR deals with conflicts (25% conflicts from total size, favoring most recent 25% of docs) as opposed to the same workload without CCR and soft_deletes. Initial calculation on a smaller 1-node environment on GCP showed this penalty to be ~12.5%, however
more recent experiments shows this to be ~6% on a larger 3-node env in AWS.

Additionally, we observed a ~19% penalty for merges_total_time_ and ~108% penalty in refresh_total_time. This PR has been raised to improve both metrics via a configurable soft_deletes.reclaim.delay parameter, which defaults to 1minute.
Stage 1 (3 node) benchmarks across all tracks and cloud vendor combinations showed here and here that followers were always able to catch up with (<1s to catch up once indexing was over).
Stage 1 benchmarks demonstrated a ~2.37-5% penalty in median indexing throughput when CCR is enabled (compared to indexing without CCR and no soft_deletes) on AWS. On GCP we've have gotten more inconsistent results, sometimes exhibiting the same 4-5% penalty but also not showing any penalty (or higher performance) which isn't clear why. The assumption at this point is that it's due to environmental factors e.g. noisy neighours but requires further investigation with randomized benchmarks at different hours. (See also following bullet 6. for a comparison of CCR overhead using a larger track over a longer period of time).
3-day benchmarks (Stage 2), using throttled indexing throughput of 3800 docs/s, finished successfully, followers always able to catch up and stable indexing throughput on both AWS and GCP.
1bn doc workload at maximum indexing throughput (16 clients) also finished successfully, following cluster was able to catch up. Comparison of the same workload with CCR and soft_deletes disabled showed that CCR has a ~6% penalty on AWS and a ~4.3% penalty on GCP in median indexing throughput.
Finally, as CCR was progressing throughout the execution of the ^^ benchmarks we re-executed the 1bn doc track a few days before the 6.5.0 release to ensure there was no regression in the indexing throughput, ability for the follower to catch up or system resource utilization and didn't spot any issues.

@bleskes Did I miss anything?

NOTE: Some of the Kibana links in the gist point to a private Cloud cluster.

dnhatn · 2019-03-08T15:43:20Z

I am closing since we have good default parameters now. Thanks @dliappis for the awesome work.

martijnvg added the :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features label Jul 2, 2018

martijnvg mentioned this issue Jul 2, 2018

Rewrite shard follow node task logic #31581

Merged

elasticmachine mentioned this issue Sep 11, 2018

Introduce cross-cluster replication #30086

Closed

29 tasks

dliappis self-assigned this Sep 11, 2018

bleskes mentioned this issue Oct 24, 2018

[CCR] Change ShardFollowTask defaults #34793

Merged

dliappis mentioned this issue Nov 19, 2018

Cache numDeletesToMerge in MergePolicy #35594

Closed

dnhatn closed this as completed Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CCR] Re-evaluate shard follow parameter defaults #31717

[CCR] Re-evaluate shard follow parameter defaults #31717

martijnvg commented Jul 2, 2018

elasticmachine commented Jul 2, 2018

dliappis commented Sep 28, 2018 •

edited

Loading

dliappis commented Oct 3, 2018

dliappis commented Oct 3, 2018

bleskes commented Oct 23, 2018

dliappis commented Oct 24, 2018

dliappis commented Nov 19, 2018

dnhatn commented Mar 8, 2019

[CCR] Re-evaluate shard follow parameter defaults #31717

[CCR] Re-evaluate shard follow parameter defaults #31717

Comments

martijnvg commented Jul 2, 2018

elasticmachine commented Jul 2, 2018

dliappis commented Sep 28, 2018 • edited Loading

TL;DR

5P 1R, Network SSD Storage, 3 node cluster

Benchmarks Executed

Results with recommended settings:

1P 1R, Local SSD Storage, 3 node cluster

Notes

Benchmarks Executed

Results with recommended settings:

dliappis commented Oct 3, 2018

3P 1R, Network SSD Storage, 3 node cluster

Notes

Results with recommended settings:

dliappis commented Oct 3, 2018

Next steps

AWS http_logs 3P/1R unthrottled

CPU

AWS http_logs 3P/1R throttled to 85% of max indexing throughput

CPU:

Progress of follower vs leader with sequence numbers:

GCP http_logs 3P/1R unthrottled

CPU:

GCP http_logs 3P/1R throttled to 85% of max indexing throughput

CPU:

AWS http_logs 1P/1R throttled to 85% of max indexing throughput

CPU

GCP http_logs 1P/1R throttled to 85% of max indexing throughput

CPU

bleskes commented Oct 23, 2018

dliappis commented Oct 24, 2018

dliappis commented Nov 19, 2018

dnhatn commented Mar 8, 2019

dliappis commented Sep 28, 2018 •

edited

Loading