Optimize kinesis ingestion task assignment after resharding #12235

AmatyaAvadhanula · 2022-02-07T03:09:03Z

Description

When a kinesis stream is resharded, the original shards are closed. A large number of intermediate shards may also be created in the process which are eventually closed as well.
If a shard is closed before any records are put into it, it would be ideal to ignore this shard for ingestion, to increase efficiency. Skipping such tasks can help avoid waste of task resources due to unnecessary allocation.
While we read from kinesis for shards frequently, both open and closed shards are returned and it is expensive to determine if a closed shard was ever written to, since it requires polling each shard for its records.
Skipping "bad" shards when counting the partitions also helps since fewer slots may be allocated by the autoscaler.

KinesisRecordSupplier is used to get a list of all shards during Kinesis ingestion.
Additional methods have been added to determine which shards are closed and empty.
Repetitive calls to kinesis for shards' records are avoided by maintaining an in-memory cache in the supervisor.

Proposed Design:

The in memory cache is implemented in KinesisSupervisor to maintain closed shard (empty and non-empty, separately) to avoid making redundant expensive calls.
When the flag skipIgnorableShards is set to true, the cache is utilized and updated. This also means that "ignorable" shards no longer participate in task allocation for ingestion or for autoscaler estimations.
When a shard is discovered after being closed for the first time, it is added to the cache to ensure that unnecessary expensive calls are not made.
Active closed shards are computed to eliminate stale entries corresponding to expired shards

Limitations:

All closed shards (not just the empty ones) are polled once, which can be an overhead.
Since metadata is not updated for these shards, each supervisor restart would incur the above overhead again.

Alternative design:

Update the metdata with end offsets of closed and empty shards. This may be simpler to implement since it doesn't require a cache but would lead to waste of resources since a task would have to update the metadata

Key changed/added classes in this PR

KinesisSupervisor

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…nt shards

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java

kfaraz

Overall LGTM. Minor comments.

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java

...rc/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisorTuningConfig.java

kfaraz

Approach looks good to me. Need some minor changes.

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java

...exing-service/src/test/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplierTest.java

...ervice/src/test/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisorTest.java

zachjsh · 2022-02-17T00:36:45Z

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java

@@ -88,6 +91,11 @@
  private final AWSCredentialsConfig awsCredentialsConfig;
  private volatile Map<String, Long> currentPartitionTimeLag;

+  // Maintain sets of currently closed shards to find "bad" (closed and empty) shards
+  // Poll closed shards once and store the result to avoid redundant costly calls to kinesis
+  private final Set<String> emptyClosedShardIds = new TreeSet<>();


Should these be in thread-safe container, or have access be protected with lock?

Thanks for pointing this out, @zachjsh.
This code would be executed by the SeekableStreamSupervisor while executing a RunNotice (scheduled when status of a task changes) as well as a DynamicAllocationTasksNotice (scheduled for auto-scaling). There is a possibility of contention between these two executions.

We can make the part where the caches are updated synchronized.
Just changing these two caches to a Concurrent version might not be enough as a whole new list of active shards is fetched in updateClosedShardCache() and the caches must be updated with this new state before any other action is performed.

cc: @AmatyaAvadhanula

Synchronizing the whole method updateClosedShardCache would actually be preferable because the state returned by two subsequent calls to recordSupplier.getShards() can be different.
So this call should happen inside the synchronized block, as should the calls to recordSupplier.isClosedShardEmpty().

I hope this doesn't cause bottlenecks though.

The whole method has been synchronized. Thanks!

zachjsh · 2022-02-17T00:42:31Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

@@ -2291,9 +2291,30 @@ protected boolean supportsPartitionExpiration()
    return false;
  }

+  protected boolean shouldSkipIgnorablePartitions()


Should we not do something similar for Kafka? Why is this not an issue with Kafka?

We haven't encountered something similar in Kafka yet. But the API has been put in place so that if KafkaSupervisor needs to do something similar, it can override the methods shouldSkipIgnorablePartitions() and getIgnorablePartitionIds().

For Kinesis, the ignorable partitions translate to empty and closed shards, which is a concept specific to Kinesis.

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java

kfaraz

Thanks for the changes, @AmatyaAvadhanula . LGTM.

abhishekagarwal87 · 2022-02-17T09:06:57Z

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java

+   * @return set of shards ignorable by kinesis ingestion
+   */
+  @Override
+  protected Set<String> getIgnorablePartitionIds()


nit - this should be called computeIgnorablePartitionIds() or loadIgnorablePartitionIds()

the current verb indicates that it is just a getter but behind the scenes it can do network calls etc to fetch the partition ids.

zachjsh

Thanks for making changes. LGTM

This reverts commit 1ec57cb.

* Revert "Improve kinesis task assignment after resharding (#12235)" This reverts commit 1ec57cb.

Optimize kinesis ingestion task assignment by considering only releva…

dc96518

…nt shards

AmatyaAvadhanula marked this pull request as draft February 7, 2022 05:10

AmatyaAvadhanula added 2 commits February 7, 2022 15:28

Refactor for better readability

f8dd11e

Move shard cache from KinesisRecordSupplier to KinesisSupervisor

df9a72e

AmatyaAvadhanula marked this pull request as ready for review February 8, 2022 09:13

abhishekagarwal87 reviewed Feb 8, 2022

View reviewed changes

...-indexing-service/src/main/java/org/apache/druid/indexing/kinesis/KinesisRecordSupplier.java Outdated Show resolved Hide resolved

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java Outdated Show resolved Hide resolved

AmatyaAvadhanula added 4 commits February 9, 2022 09:04

Add KinesisRecordSupplier test

7c708dd

Add tests

5311126

Remove unnecessary import

fde9873

Fix indentation

2b1f4ad

kfaraz requested changes Feb 10, 2022

View reviewed changes

kfaraz added Area - Streaming Ingestion AWS Kinesis For changes in Kinesis ingestion Improvement labels Feb 10, 2022

AmatyaAvadhanula added 2 commits February 10, 2022 12:00

Minor refactoring

82a89a8

Revert to previous indentation for unmodified code

2353b97

AmatyaAvadhanula requested a review from kfaraz February 10, 2022 06:41

kfaraz requested changes Feb 15, 2022

View reviewed changes

Address review comments and avoid reset in new tests

a1c5b1a

AmatyaAvadhanula requested a review from kfaraz February 16, 2022 10:46

AmatyaAvadhanula added 2 commits February 16, 2022 17:35

Refactor tests and avoid unnecessary mocks, indenting changes

7239f05

Refactor tests and avoid unnecessary mocks

8908f87

zachjsh reviewed Feb 17, 2022

View reviewed changes

kfaraz reviewed Feb 17, 2022

View reviewed changes

...ng-service/src/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisor.java Outdated Show resolved Hide resolved

Synchronize the method to update the cache for closed shards

b975f82

kfaraz approved these changes Feb 17, 2022

View reviewed changes

abhishekagarwal87 reviewed Feb 17, 2022

View reviewed changes

Rename method appropriately

31fd84c

AmatyaAvadhanula requested a review from abhishekagarwal87 February 17, 2022 10:24

abhishekagarwal87 approved these changes Feb 17, 2022

View reviewed changes

zachjsh approved these changes Feb 17, 2022

View reviewed changes

kfaraz merged commit 1ec57cb into apache:master Feb 18, 2022

abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022

abhishekagarwal87 mentioned this pull request Jun 6, 2022

[Draft] 0.23.0 Release notes #12510

Closed

AmatyaAvadhanula added a commit to AmatyaAvadhanula/druid that referenced this pull request Oct 13, 2022

Revert "Improve kinesis task assignment after resharding (apache#12235)"

f4b1e38

This reverts commit 1ec57cb.

AmatyaAvadhanula mentioned this pull request Oct 18, 2022

Remove skip ignorable shards #13221

Merged

10 tasks

AmatyaAvadhanula added a commit that referenced this pull request Oct 28, 2022

Remove skip ignorable shards (#13221)

9cbda66

* Revert "Improve kinesis task assignment after resharding (#12235)" This reverts commit 1ec57cb.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize kinesis ingestion task assignment after resharding #12235

Optimize kinesis ingestion task assignment after resharding #12235

AmatyaAvadhanula commented Feb 7, 2022 •

edited

Loading

kfaraz left a comment

kfaraz left a comment

zachjsh Feb 17, 2022

kfaraz Feb 17, 2022 •

edited

Loading

kfaraz Feb 17, 2022

AmatyaAvadhanula Feb 17, 2022

zachjsh Feb 17, 2022

kfaraz Feb 17, 2022

kfaraz left a comment

abhishekagarwal87 Feb 17, 2022

abhishekagarwal87 Feb 17, 2022

AmatyaAvadhanula Feb 17, 2022

zachjsh left a comment

Optimize kinesis ingestion task assignment after resharding #12235

Optimize kinesis ingestion task assignment after resharding #12235

Conversation

AmatyaAvadhanula commented Feb 7, 2022 • edited Loading

Description

Proposed Design:

Alternative design:

Key changed/added classes in this PR

kfaraz left a comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachjsh left a comment

Choose a reason for hiding this comment

AmatyaAvadhanula commented Feb 7, 2022 •

edited

Loading

kfaraz Feb 17, 2022 •

edited

Loading