KAFKA-12788: improve KRaft replica placement #10494

cmccabe · 2021-04-06T22:43:23Z

Implement the existing Kafka replica placement algorithm for KRaft.
This also means implementing rack awareness. Previously, we just chose
replicas randomly in a non-rack-aware fashion. Also, allow replicas to
be placed on fenced brokers if there are no other choices. This was
specified in KIP-631 but previously not implemented.

Implement the existing Kafka replica placement algorithm for KRaft. This also means implementing rack awareness. Previously, we just chose replicas randomly in a non-rack-aware fashion. Also, allow replicas to be placed on fenced brokers if there are no other choices. This was specified in KIP-631 but previously not implemented.

junrao

@cmccabe : Thanks for the PR. A few comments below. Also, we probably want to associate a jira with the PR since it's a bit large.

junrao · 2021-05-13T17:00:36Z

metadata/src/main/java/org/apache/kafka/controller/StripedReplicaPlacer.java

+         * Randomly shuffle the brokers in this list.
+         */
+        void shuffle(Random random) {
+            Collections.shuffle(brokers, random);


Should we reset the offset here too?

Hmm, I don't think resetting the position is necessary... when RackList#place shuffles, it also sets the epoch to 0, so the offset will be reset anyway.

junrao · 2021-05-13T17:03:00Z

metadata/src/main/java/org/apache/kafka/controller/StripedReplicaPlacer.java

+        private final List<Integer> brokers = new ArrayList<>(0);
+        private int epoch = 0;
+        private int index = 0;
+        private int offset = 0;


Could we add a bit comment explaining epoch, index and offset?

junrao · 2021-05-13T17:26:57Z

metadata/src/test/java/org/apache/kafka/controller/StripedReplicaPlacerTest.java

+        assertEquals(Collections.singletonList(Optional.empty()), rackList.rackNames());
+        assertEquals(Arrays.asList(3, 2, 1), rackList.place(3));
+        assertEquals(Arrays.asList(2, 3, 1), rackList.place(3));
+        assertEquals(Arrays.asList(3, 2, 1), rackList.place(3));


Hmm, we should have shuffled the broker list here. Why is the assignment pattern repeating here?

Because broker 1 is fenced, we don't place a replica there until we need to (there are no more replicas remaining). So it will always be the last / least preferred replica if we have 3 brokers and need a partition with replication factor 3

junrao · 2021-05-13T17:37:02Z

metadata/src/test/java/org/apache/kafka/controller/StripedReplicaPlacerTest.java

+            Optional.of("2"),
+            Optional.of("3"),
+            Optional.of("4")), rackList.rackNames());
+        assertEquals(Arrays.asList(41, 11, 21, 30), rackList.place(4));


Why didn't the leader start from rack 1, which is the fist in the rack list? Also, why didn't rack 2 start with 20, which sorts first during initialization?

Why didn't the leader start from rack 1, which is the fist in the rack list?

The starting rack is randomized. If the first partition always put its leader on a specific rack, that would create skew, since a lot of topics are created with only one or two partitions.

Also, why didn't rack 2 start with 20, which sorts first during initialization?

BrokerList#offset is also randomized, for the same reason (to avoid favoring replicas with a lower id)

junrao · 2021-05-13T17:45:22Z

metadata/src/test/java/org/apache/kafka/controller/StripedReplicaPlacerTest.java

+    }
+
+    @Test
+    public void testSuccessfulPlacement() {


Could we add a test that verifies that not only the first replica is distributed evenly, but for partitions with the same first replica, their second replicas are also distributed evenly?

I will add a distribution test

junrao · 2021-05-13T17:49:14Z

metadata/src/main/java/org/apache/kafka/controller/StripedReplicaPlacer.java

+ * each partition would try to get a replica there.  In general racks are supposed to be
+ * about the same size -- if they aren't, this is a user error.
+ *
+ * Thirdly, we would prefer to place replicas on unfenced brokers, rather than on fenced


It would be useful to document another goal: for partitions with the same 1st replica, we want to distribute the second replica for those partitions evenly. This way, if the first broker fails, the new leaders will be evenly distributed among the surviving brokers. The algorithm achieves that by forcing a shuffle when the partition index is a multiple of the number of brokers.

I added some text saying that we want new leaders to be evenly distributed if any one broker is fenced

junrao · 2021-05-13T17:51:21Z

metadata/src/main/java/org/apache/kafka/controller/StripedReplicaPlacer.java

+            // If we have returned as many assignments as there are unfenced brokers in
+            // the cluster, shuffle the rack list and broker lists to try to avoid
+            // repeating the same assignments again.
+            if (epoch == numUnfencedBrokers) {


Should this be (epoch % numUnfencedBrokers) == 0?

The epoch gets reset to 0 once it reaches numUnfencedBrokers. This avoids doing the modulus check, which is expensive as I understand it

cmccabe · 2021-05-14T19:58:20Z

@cmccabe : Thanks for the PR. A few comments below. Also, we probably want to associate a jira with the PR since it's a bit large.

Fair. I created KAFKA-12788 for this PR

junrao · 2021-05-15T00:56:05Z

metadata/src/test/java/org/apache/kafka/controller/StripedReplicaPlacerTest.java

+        for (List<Integer> partitionReplicas : replicas) {
+            counts.put(partitionReplicas, counts.getOrDefault(partitionReplicas, 0) + 1);
+        }
+        assertEquals(14, counts.get(Arrays.asList(0, 1)));


For even distribution, it would be useful to verify 2 things. (1) The leaders are distributed evenly when all brokers are unfenced. (2) If any broker is fenced, the new leaders are still distributed evenly.

Hmm. Currently, we don't place partitions on fenced brokers unless there are no other options. We also never make the leader a fenced broker. So condition #2 does not hold in general.

The thinking is that when a broker is fenced, it may stay offline for a long time, potentially. So we don't really want to place anything there unless there is absolutely no other choice.

junrao

@cmccabe : Thanks for the updated PR. Just one more comment.

junrao

@cmccabe : Thanks for the explanation. The PR LGTM

#10494 introduced a bug in the KRaft controller where the controller will loop forever in StripedReplicaPlacer trying to identify the racks on which to place partition replicas if there is a single unfenced broker in the cluster and the number of requested partitions in a CREATE_TOPICS request is greater than 1. This patch refactors out some argument sanity checks and invokes those checks in both RackList and StripedReplicaPlacer, and it adds tests for this as well as the single broker placement issue. Reviewers: Jun Rao <[email protected]>

cmccabe added the kraft label Apr 6, 2021

cmccabe force-pushed the kraft_placement branch 6 times, most recently from a5f4da9 to 8c2cc53 Compare April 13, 2021 06:50

cmccabe assigned mumrah Apr 13, 2021

cmccabe requested a review from ijuma April 13, 2021 18:33

cmccabe force-pushed the kraft_placement branch from 8c2cc53 to f6c5f7f Compare April 13, 2021 20:22

cmccabe force-pushed the kraft_placement branch from f6c5f7f to aebd84f Compare May 12, 2021 15:55

cmccabe requested a review from junrao May 12, 2021 16:10

junrao reviewed May 13, 2021

View reviewed changes

cmccabe changed the title ~~MINOR: improve KRaft replica placement~~ KAFKA-12788: improve KRaft replica placement May 14, 2021

cmccabe added 2 commits May 14, 2021 13:53

Merge branch 'trunk' into kraft_placement

4d243fe

Add testEvenDistribution

a9a2e7c

junrao reviewed May 15, 2021

View reviewed changes

junrao approved these changes May 17, 2021

View reviewed changes

cmccabe merged commit 9e5b77f into apache:trunk May 17, 2021

cmccabe deleted the kraft_placement branch May 17, 2021 23:49

rondagostino mentioned this pull request Jun 4, 2021

KAFKA-12897: KRaft multi-partition placement on single broker #10823

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-12788: improve KRaft replica placement #10494

KAFKA-12788: improve KRaft replica placement #10494

cmccabe commented Apr 6, 2021

junrao left a comment

junrao May 13, 2021

cmccabe May 17, 2021 •

edited

Loading

junrao May 13, 2021

junrao May 13, 2021

cmccabe May 14, 2021

junrao May 13, 2021

cmccabe May 14, 2021

junrao May 13, 2021

cmccabe May 14, 2021

junrao May 13, 2021

cmccabe May 14, 2021

junrao May 13, 2021

cmccabe May 14, 2021

cmccabe commented May 14, 2021

junrao May 15, 2021

cmccabe May 17, 2021

junrao left a comment

junrao left a comment

KAFKA-12788: improve KRaft replica placement #10494

KAFKA-12788: improve KRaft replica placement #10494

Conversation

cmccabe commented Apr 6, 2021

junrao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmccabe May 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmccabe commented May 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

junrao left a comment

Choose a reason for hiding this comment

cmccabe May 17, 2021 •

edited

Loading