Retrieve topic/partition storage information using Admin `describeLogDirs` #42

MikeEdgar · 2023-09-18T14:00:52Z

Topic summary storage:

Partition storage:

ui/utils/format.ts

Signed-off-by: Michael Edgar <[email protected]>

ui/utils/format.ts

Signed-off-by: Michael Edgar <[email protected]>

api/src/main/java/com/github/eyefloaters/console/api/model/Node.java

api/src/main/java/com/github/eyefloaters/console/api/model/PartitionInfo.java

api/src/main/java/com/github/eyefloaters/console/api/model/ReplicaInfo.java

tombentley · 2023-09-21T22:13:58Z

api/src/main/java/com/github/eyefloaters/console/api/model/Node.java

+    private final String host;
+    private final int port;
+    private final String rack;
+    private Either<ReplicaInfo, Error> log;


A node hosts many replicas, so why is there just a single one here? TBH it's not clear to me why it's better to model the relationship between nodes and replicas on the Node, rather than on the Partition.

This goes back to my earlier comment about the Node model. Here, a Node represents the placement of a partition replica. Thus, the log is the storage used by that partition's replica on the node. As you're pointing out, this is not clear with the way it's modeled currently - or at least the way the models are named.

I did wonder about how many node instances there were floating about for a given node id! If you wanted to stick with this way of doing it then we should change the name Node→ReplicaNode or something.

Suppose we do something like this:

// replaces `Node` in the context of describing a replica class Replica { int nodeId; String nodeRack; boolean leader; boolean inSync; // localStorage? Depends on if/how we can get tiered storage info Either<ReplicaStorage, Error> storage; } // replaces `ReplicaInfo` (too generic?) record ReplicaStorage( long localSize, long offsetLag, boolean future ) { }

The partition model could then simply have a List of Replicas, and get rid of leader and isr as separate fields.

I would keep int leaderId in the partition model, then you could use List<Replica> replicas, in the same order as returned from the Admin client (thus preserving semantic of the first replica being the preferred leader). Of course you can encapsulate the lookup of the leader (on the partition model) via its id in a method so the caller doesn't have to know there isn't a Replica leader field.

I'd also wonder about inlining the fields of ReplicaStorage into Replica with a separate Error storageError. Doing it as you suggest means the structure of the objects reflects how the information was retrieved (i.e. it's only because we use a separate Admin method for getting the storage into that we have a separate class, and a indirection through an Either), which feels pretty invasive. It's one thing to do that in your Java code, but when it leaks out into the JSON structure exposed by the API it starts to feel like we're exposing internal details of the implementation. At that point it might be cleaner if all the fields were nullable and each class (Replica etc) had a single Error which was non-null whenever there was any problem. But that comes with it's own downsides for writing the code in a functional way. WDYT?

I do think there's a benefit to placing things in the Either, especially when it comes to a set of fields that may be inaccessible due to permissions or some combination of ACLs preventing the retrieval. Granted it does expose some info about the backend, but ultimately it does come down to the API operations that Kafka exposes, regardless of any server's implementation.

If we later add some additional remoteStorage for Tiered Storage information, it compartmentalizes that storage info or the reason it is not present neatly beneath a single field in the model.

As mentioned in the other thread, a new object PartitionReplica was added to carry the replica nodeId and nodeRack information, plus a boolean derived from the isr array. This embeds an Either<ReplicaLocalStorage, Error> for node-local storage of the log. I'll leave this thread open in case the discussion around how/whether to use Either like this should continue.

api/src/main/java/com/github/eyefloaters/console/api/model/PartitionInfo.java

api/src/main/java/com/github/eyefloaters/console/api/model/Topic.java

api/src/main/java/com/github/eyefloaters/console/api/service/TopicService.java

tombentley · 2023-09-21T22:31:29Z

api/src/main/java/com/github/eyefloaters/console/api/service/TopicService.java

-                .toList();
+                        .map(partition -> {
+                            var key = new TopicPartition(topic.getName(), partition.getPartition());
+                            List<Integer> value = partition.getReplicas().stream().map(Node::getId).toList();


Should partition have a getReplicaNodeIds() to encapsulate this?

tombentley · 2023-09-21T22:46:21Z

I have a UX question about the screenshots... the 'Replicas' and 'In-sync Replicas' columns are counts, not node ids, right? I wonder if the 'In-sync Replicas' being a count just puts cognitive load on the user. They're basically having to compare it to the 'Replicas' column to understand what it means. It's not such a problem if RF=3 uniformly, but that's not always the case:

There are legitimate uses for topics with RF=1.
During a partition reassignment (one which doesn't permanently change the RF) the replicas will increase temporarily. If replicas=5 and insync=3 there's not enough context to know whether this is a problem or not.

I don't know enough about the overall UX design to really understand how problematic this might be. For example how easy it is to see if there's an ongoing reassignment that's affecting a particular partition. Or if we had plans to highlight the count of insync replicas if it was unexpectedly less than replicas.

MikeEdgar · 2023-09-21T23:43:28Z

I've seen other UIs display the node IDs for replicas/ISRs, I'm wondering if we simply display the list of node IDs for the replicas, and somehow indicate the leader (first in the list?) and then differentiate in-sync and not in-sync for the others.

For the scenario you describe with a reassignment, perhaps there are additional Admin operations (listPartitionReassignments ?) that would give enough information to give hints in the UI.

tombentley · 2023-09-22T01:11:47Z

I've seen other UIs display the node IDs for replicas/ISRs, I'm wondering if we simply display the list of node IDs for the replicas, and somehow indicate the leader (first in the list?) and then differentiate in-sync and not in-sync for the others.

From Kafka's PoV replicas is already list and the first broker in it is the preferred leader. If we tried to use the order of the brokers in a list of replicas to represent the current leader it would confuse people. It's also quite subtle. A specific column for the leader is impossible to misinterpret.

I did wonder whether, rather than "in sync replicas" we showed a list of 2out of sync replicas" (i.e. just replicas - isr). That has the advantage that the normal happy state is always the empty list (i.e. blank) and would naturally draw the eye to those partitions where something less usual was going on. It also gives us a natural place to add extra information (e.g. a tool tip or small text explaining that we're moving a replica, or that a broker was recently restarted or whatever).

For the scenario you describe with a reassignment, perhaps there are additional Admin operations (listPartitionReassignments ?) that would give enough information to give hints in the UI.

Yes, listPartitionReassignments is how you'd determine whether there was a reassignment going on. But we should only write the code once we have a clear idea of what we're aiming for.

riccardo-forina · 2023-09-22T07:09:18Z

I loved reading all the comments, there is a lot of knowledge there. But it left me wondering if we need to expose a formally correct view of how Kafka works through our APIs at all. What if we decide beforehand what is it exactly that we want to show on the console, and have the API expose just that? It means aggregating data, simplify some concepts, put our own spices to the "Kafka console recipe". WDYT?

MikeEdgar · 2023-09-22T17:26:17Z

But it left me wondering if we need to expose a formally correct view of how Kafka works through our APIs at all. What if we decide beforehand what is it exactly that we want to show on the console, and have the API expose just that? It means aggregating data, simplify some concepts, put our own spices to the "Kafka console recipe". WDYT?

I'm looking through the model and thinking about that too. Ideally we can keep the model as DRY as possible and we'll need a way to communicate errors at a granular level + request additional data (which is supported currently with the Either server-side model and union in the Zod model).

@tombentley , I played around a bit with the replica display based on your latest comment and here's one possible view. Node 1 was stopped prior to publishing a batch of records. This shows both the out-of-sync replicas being populated as well as the leader Id being different from the preferred leader (first in list) for several partitions.

Response JSON

{
  "data": {
    "id": "xdgBIiIiREyDGmmFrJisSA",
    "type": "topics",
    "attributes": {
      "name": "t2",
      "internal": false,
      "partitions": [
        {
          "partition": 0,
          "replicas": [
            {
              "nodeId": 0,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 1,
              "inSync": false,
              "localStorage": {
                "meta": {
                  "type": "error"
                },
                "title": "Unable to fetch replica log metadata",
                "detail": "Timed out waiting for a node assignment. Call: describeLogDirs"
              }
            },
            {
              "nodeId": 2,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            }
          ],
          "leaderId": 0,
          "offsets": {
            "earliest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "latest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "maxTimestamp": {
              "offset": -1
            }
          },
          "recordCount": 0,
          "leaderLocalStorage": 0
        },
        {
          "partition": 1,
          "replicas": [
            {
              "nodeId": 2,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 0,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 1,
              "inSync": false,
              "localStorage": {
                "meta": {
                  "type": "error"
                },
                "title": "Unable to fetch replica log metadata",
                "detail": "Timed out waiting for a node assignment. Call: describeLogDirs"
              }
            }
          ],
          "leaderId": 2,
          "offsets": {
            "earliest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "latest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "maxTimestamp": {
              "offset": -1
            }
          },
          "recordCount": 0,
          "leaderLocalStorage": 0
        },
        {
          "partition": 2,
          "replicas": [
            {
              "nodeId": 1,
              "inSync": false,
              "localStorage": {
                "meta": {
                  "type": "error"
                },
                "title": "Unable to fetch replica log metadata",
                "detail": "Timed out waiting for a node assignment. Call: describeLogDirs"
              }
            },
            {
              "nodeId": 2,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 0,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            }
          ],
          "leaderId": 2,
          "offsets": {
            "earliest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "latest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "maxTimestamp": {
              "offset": -1
            }
          },
          "recordCount": 0,
          "leaderLocalStorage": 0
        },
        {
          "partition": 3,
          "replicas": [
            {
              "nodeId": 0,
              "inSync": true,
              "localStorage": {
                "size": 360,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 2,
              "inSync": true,
              "localStorage": {
                "size": 360,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 1,
              "inSync": false,
              "localStorage": {
                "meta": {
                  "type": "error"
                },
                "title": "Unable to fetch replica log metadata",
                "detail": "Timed out waiting for a node assignment. Call: describeLogDirs"
              }
            }
          ],
          "leaderId": 0,
          "offsets": {
            "earliest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "latest": {
              "offset": 32,
              "leaderEpoch": 1
            },
            "maxTimestamp": {
              "offset": 0,
              "timestamp": "2023-09-22T16:31:51.871Z",
              "leaderEpoch": 1
            }
          },
          "recordCount": 32,
          "leaderLocalStorage": 360
        },
        {
          "partition": 4,
          "replicas": [
            {
              "nodeId": 2,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 1,
              "inSync": false,
              "localStorage": {
                "meta": {
                  "type": "error"
                },
                "title": "Unable to fetch replica log metadata",
                "detail": "Timed out waiting for a node assignment. Call: describeLogDirs"
              }
            },
            {
              "nodeId": 0,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            }
          ],
          "leaderId": 2,
          "offsets": {
            "earliest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "latest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "maxTimestamp": {
              "offset": -1
            }
          },
          "recordCount": 0,
          "leaderLocalStorage": 0
        },
        {
          "partition": 5,
          "replicas": [
            {
              "nodeId": 1,
              "inSync": false,
              "localStorage": {
                "meta": {
                  "type": "error"
                },
                "title": "Unable to fetch replica log metadata",
                "detail": "Timed out waiting for a node assignment. Call: describeLogDirs"
              }
            },
            {
              "nodeId": 0,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            },
            {
              "nodeId": 2,
              "inSync": true,
              "localStorage": {
                "size": 0,
                "offsetLag": 0,
                "future": false
              }
            }
          ],
          "leaderId": 0,
          "offsets": {
            "earliest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "latest": {
              "offset": 0,
              "leaderEpoch": 1
            },
            "maxTimestamp": {
              "offset": -1
            }
          },
          "recordCount": 0,
          "leaderLocalStorage": 0
        }
      ],
      "authorizedOperations": [
        "DELETE",
        "ALTER_CONFIGS",
        "WRITE",
        "CREATE",
        "DESCRIBE",
        "DESCRIBE_CONFIGS",
        "ALTER",
        "READ"
      ],
      "recordCount": 32,
      "leaderLocalStorageTotal": 360
    }
  }
}

Signed-off-by: Michael Edgar <[email protected]>

ui/api/topics.ts

MikeEdgar · 2023-09-26T14:25:04Z

@tombentley , @riccardo-forina please take another look and let me know if the latest addresses your concerns (or raises new ones 😃)

riccardo-forina

👍

tombentley · 2023-09-26T22:06:19Z

About the revised UI screenshot I think that's better but still not perfect.

It's not clear that the the 'out of sync replicas' is a set of node ids. When the out of sync replicas is a singleton a user might misinterpret it as a count. (In fact in the corner case where node 0 is the one that's out of sync, the user would then think that all was well). To me it would be natural to use { and } affixes to make it clear that this is a set. But I was mathematician and I'm willing to be in a minority about whether that seems natural to people more generally. We would need to do something similar for 'Replicas' to be consistent (technically 'Replicas' is a list, and 'Out of sync replicas' is a set, but maybe we just treat them both as lists for the sake of the UI).
0 bits seems weird, it should be 0 bytes

api/src/main/java/com/github/eyefloaters/console/api/model/PartitionInfo.java

api/src/main/java/com/github/eyefloaters/console/api/model/ReplicaLocalStorage.java

api/src/main/java/com/github/eyefloaters/console/api/model/Topic.java

tombentley · 2023-09-26T22:33:36Z

api/src/main/java/com/github/eyefloaters/console/api/service/TopicService.java

                .thenApply(nothing -> promise.complete(result))
                .exceptionally(promise::completeExceptionally);


What are we getting out of these two calls? The final thenCompose() already returns a CompletableFuture, so these two calls just seem to be completing promise with the same outcome as the one returned by thenCompose().

The thenCompose returns are CompletableFuture<Void> whereas the method here returns CompletableFuture<Map<...>>. I find it more readable for the meaningful calls to "align" then finish up separately. Each one is enriching the result map, returning nothing, followed by the completion calls.

Aside - listOffsets and describeLogDirs could probably be called in parallel at some point.

current:

CompletableFuture.allOf(pendingDescribes) .thenCompose(nothing -> listOffsets(adminClient, result, offsetSpec)) .thenCompose(nothing -> describeLogDirs(adminClient, result)) .thenApply(nothing -> promise.complete(result)) .exceptionally(promise::completeExceptionally);

alternate:

CompletableFuture.allOf(pendingDescribes) .thenCompose(nothing -> listOffsets(adminClient, result, offsetSpec)) .thenCompose(nothing -> { describeLogDirs(adminClient, result); return result; });

api/src/main/java/com/github/eyefloaters/console/api/service/TopicService.java

Co-authored-by: Tom Bentley <[email protected]>

Signed-off-by: Michael Edgar <[email protected]>

MikeEdgar · 2023-09-27T19:54:17Z

About the https://github.com/eyefloaters/console/pull/42#issuecomment-1731784667 I think that's better but still not perfect.

I agree it needs improvement. Let's use this version as the starting point for discussion and iterate from there. There will no doubt be other things to clear up the presentation as well. Unless you have any objection, I'd like to merge this one and start doing some more serious exploration around the metrics piece.

0 bits seems weird, it should be 0 bytes

Fixed in the latest, it appears at 0 B.

sonarqubecloud · 2023-09-27T19:54:23Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

95.0% Coverage
0.0% Duplication

tombentley

Thanks!

Bumps [org.skyscreamer:jsonassert](https://github.com/skyscreamer/JSONassert) from 1.5.0 to 1.5.1. - [Release notes](https://github.com/skyscreamer/JSONassert/releases) - [Changelog](https://github.com/skyscreamer/JSONassert/blob/master/CHANGELOG.md) - [Commits](skyscreamer/JSONassert@jsonassert-1.5.0...jsonassert-1.5.1) --- updated-dependencies: - dependency-name: org.skyscreamer:jsonassert dependency-type: direct:development update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

MikeEdgar added this to the 0.0.7 milestone Sep 18, 2023

riccardo-forina reviewed Sep 18, 2023

View reviewed changes

ui/utils/format.ts Show resolved Hide resolved

Add storage data to topic and partition models, re-factor replica model

5f5ae01

Signed-off-by: Michael Edgar <[email protected]>

MikeEdgar force-pushed the topic-storage branch from 57ebc58 to 5f5ae01 Compare September 19, 2023 16:10

convert the storage to the best unit available, and poll for data

8c35682

MikeEdgar commented Sep 20, 2023

View reviewed changes

ui/utils/format.ts Outdated Show resolved Hide resolved

riccardo-forina and others added 4 commits September 21, 2023 15:15

restrict the byte conversion to the metric units

b33647c

Use the formatting from the convert library instead of the browser intl

dcf3bcd

Add logging of actual record sent in test

3ae8ce2

Signed-off-by: Michael Edgar <[email protected]>

Additional record consumer logging

25bb033

Signed-off-by: Michael Edgar <[email protected]>

MikeEdgar marked this pull request as ready for review September 21, 2023 18:51

MikeEdgar requested review from mstruk and tombentley September 21, 2023 19:00

tombentley reviewed Sep 21, 2023

View reviewed changes

Naming and model refactoring

afdfbb7

Signed-off-by: Michael Edgar <[email protected]>

MikeEdgar force-pushed the topic-storage branch from f4c4575 to afdfbb7 Compare September 22, 2023 20:41

MikeEdgar commented Sep 22, 2023

View reviewed changes

ui/api/topics.ts Show resolved Hide resolved

MikeEdgar requested review from tombentley and riccardo-forina September 26, 2023 14:25

riccardo-forina approved these changes Sep 26, 2023

View reviewed changes

tombentley reviewed Sep 26, 2023

View reviewed changes

riccardo-forina mentioned this pull request Sep 27, 2023

Console layout changes #44

Merged

Revise API doc annotations, clarify descriptions

6d4d1eb

Co-authored-by: Tom Bentley <[email protected]>

Re-factoring from PR feedback

0bbe3aa

Signed-off-by: Michael Edgar <[email protected]>

tombentley approved these changes Sep 27, 2023

View reviewed changes

MikeEdgar merged commit 198661d into main Sep 27, 2023

MikeEdgar deleted the topic-storage branch September 27, 2023 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve topic/partition storage information using Admin `describeLogDirs` #42

Retrieve topic/partition storage information using Admin `describeLogDirs` #42

MikeEdgar commented Sep 18, 2023 •

edited

Loading

tombentley Sep 21, 2023

MikeEdgar Sep 21, 2023

tombentley Sep 21, 2023

MikeEdgar Sep 21, 2023

tombentley Sep 22, 2023

MikeEdgar Sep 22, 2023

MikeEdgar Sep 22, 2023

tombentley Sep 21, 2023

tombentley commented Sep 21, 2023

MikeEdgar commented Sep 21, 2023

tombentley commented Sep 22, 2023

riccardo-forina commented Sep 22, 2023

MikeEdgar commented Sep 22, 2023 •

edited

Loading

MikeEdgar commented Sep 26, 2023

riccardo-forina left a comment

tombentley commented Sep 26, 2023

tombentley Sep 26, 2023

MikeEdgar Sep 27, 2023

MikeEdgar commented Sep 27, 2023

sonarqubecloud bot commented Sep 27, 2023

tombentley left a comment

		.thenApply(nothing -> promise.complete(result))
		.exceptionally(promise::completeExceptionally);

Retrieve topic/partition storage information using Admin describeLogDirs #42

Retrieve topic/partition storage information using Admin describeLogDirs #42

Conversation

MikeEdgar commented Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tombentley commented Sep 21, 2023

MikeEdgar commented Sep 21, 2023

tombentley commented Sep 22, 2023

riccardo-forina commented Sep 22, 2023

MikeEdgar commented Sep 22, 2023 • edited Loading

MikeEdgar commented Sep 26, 2023

riccardo-forina left a comment

Choose a reason for hiding this comment

tombentley commented Sep 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeEdgar commented Sep 27, 2023

sonarqubecloud bot commented Sep 27, 2023

tombentley left a comment

Choose a reason for hiding this comment

Retrieve topic/partition storage information using Admin `describeLogDirs` #42

Retrieve topic/partition storage information using Admin `describeLogDirs` #42

MikeEdgar commented Sep 18, 2023 •

edited

Loading

MikeEdgar commented Sep 22, 2023 •

edited

Loading