-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add JMX metric for commandRunner status #4019
feat: add JMX metric for commandRunner status #4019
Conversation
49e5d43
to
52bbdb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @stevenpyzhang! Feedback inline.
ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java
Outdated
Show resolved
Hide resolved
ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java
Outdated
Show resolved
Hide resolved
final String metricName = "liveness-indicator"; | ||
final String description = | ||
"A metric indicating the status of the commandRunner. " | ||
+ "If value 1, the commandRunner is processing commands normally." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use a Gauge
then you can use a Gauge<String>
, and have the metric value be a string value that defines the current status, e.g. "RUNNING"
vs "ERROR"
. With this approach, I'd use an enum to define the possible statuses, and then use the name()
method to get the String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we define the enum you suggested, what's the advantage of emitting the metric as a string rather than an integer? I'm not sure how well datadog (and other similar tools) plays with string-valued metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be ok I think. Wouldn't using strings be similar to how a query's status is tracked by string values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage is that it makes more sense if you're just dumping the JMX. The datadog agent lets you configure a mapping from a string to a numerical metric. The disadvantage there is that you have to keep the mapping in the dd config in sync with your code (so if you add a new status you have to update your config). I'm leaning toward emitting a numerical metric for that reason. I think we should still implement it as an enum though in our code.
ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java
Show resolved
Hide resolved
ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java
Outdated
Show resolved
Hide resolved
ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java
Outdated
Show resolved
Hide resolved
ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java
Outdated
Show resolved
Hide resolved
52bbdb1
to
be00fc8
Compare
11e943f
to
2c09bc3
Compare
2c09bc3
to
e0d55ab
Compare
e0d55ab
to
119a5b4
Compare
@@ -77,6 +77,13 @@ | |||
"Minimum time between consecutive health check evaluations. Health check queries before " | |||
+ "the interval has elapsed will receive cached responses."; | |||
|
|||
static final String KSQL_COMMAND_RUNNER_HEALTH_CHECK_MS = | |||
KSQL_CONFIG_PREFIX + "server.command.runner.healthcheck.ms"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies some sort of healtchecking interval. I'd name this something like: server.command.blocked.threshold.error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to server.command.blocked.threshold.error.ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Couple more bits of feedback inline.
ksql-rest-app/src/test/java/io/confluent/ksql/rest/server/computation/CommandRunnerTest.java
Outdated
Show resolved
Hide resolved
ksql-rest-app/src/test/java/io/confluent/ksql/rest/server/computation/CommandRunnerTest.java
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
private void checkCommandRunnerStatus( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spirit of this test is great - if it passes we're confident the code under test is doing what its supposed to do. The problem is that because it relies on timing, it's prone to spurious test failures. I think we can make a couple tweaks to make the test deterministic:
- pass a mock clock to command runner so we control the time changes
- instead of using a sleep to simulate delays, have the command runner wait on a condition or countdown latch.
so you get something like this:
...
givenQueuedCommands(queuedCommand1);
Producer<Long> clock = mock(Producer.class);
CountDownLatch latch = new CountDownLatch(1);
CommandRunner commandRunner = new CommandRunner(..., clock::get, ...);
when(clock.get()).thenReturn(0).thenReturn(500).thenReturn(1000).thenReturn(2000);
when(statementExecutor.handleStatement()).thenAnswer(i -> latch.await());
Thread t = new Thread(() -> {
commandRunner.fetchAndRunCommands();
});
assertThat(commandRunner.checkCommandRunnerStatus(), is(RUNNING));
assertThat(commandRunner.checkCommandRunnerStatus(), is(ERROR));
latch.countDown();
t.join();
assertThat(commandRunner.checkCommandRunnerStatus(), is(RUNNING));
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
3835391
to
6a27b99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
#3962 changes the commandRunner to never skip a command if it's gone through the transaction protocol. We need to expose some metric that can be used to alert if the commandRunner thread is stuck on a particular command.
This PR introduces a JMX metric for the commandRunner thread status.
Testing done
Cherry-picked #3962 to this branch for testing.
Put
CREATE STREAM qwerqweq(age BIGINT) WITH (KAFKA_TOPIC='foo', VALUE_FORMAT='DELIMITED');
into the command topicDeleted topic foo
Started server
Watched the metric value in JConsole, after 15 seconds it went from RUNNING to ERROR since it was stuck processing the above command.
Created topic foo with zookeeper
The server completed start up and the metric value went back to RUNNING
Reviewer checklist