Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add JMX metric for commandRunner status #4019

Merged

Conversation

stevenpyzhang
Copy link
Member

@stevenpyzhang stevenpyzhang commented Dec 2, 2019

Description

#3962 changes the commandRunner to never skip a command if it's gone through the transaction protocol. We need to expose some metric that can be used to alert if the commandRunner thread is stuck on a particular command.

This PR introduces a JMX metric for the commandRunner thread status.

Testing done

Cherry-picked #3962 to this branch for testing.
Put CREATE STREAM qwerqweq(age BIGINT) WITH (KAFKA_TOPIC='foo', VALUE_FORMAT='DELIMITED'); into the command topic

Deleted topic foo
Started server
Watched the metric value in JConsole, after 15 seconds it went from RUNNING to ERROR since it was stuck processing the above command.
Created topic foo with zookeeper
The server completed start up and the metric value went back to RUNNING

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@stevenpyzhang stevenpyzhang requested a review from a team as a code owner December 2, 2019 21:42
Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @stevenpyzhang! Feedback inline.

final String metricName = "liveness-indicator";
final String description =
"A metric indicating the status of the commandRunner. "
+ "If value 1, the commandRunner is processing commands normally."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use a Gauge then you can use a Gauge<String>, and have the metric value be a string value that defines the current status, e.g. "RUNNING" vs "ERROR". With this approach, I'd use an enum to define the possible statuses, and then use the name() method to get the String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define the enum you suggested, what's the advantage of emitting the metric as a string rather than an integer? I'm not sure how well datadog (and other similar tools) plays with string-valued metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok I think. Wouldn't using strings be similar to how a query's status is tracked by string values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage is that it makes more sense if you're just dumping the JMX. The datadog agent lets you configure a mapping from a string to a numerical metric. The disadvantage there is that you have to keep the mapping in the dd config in sync with your code (so if you add a new status you have to update your config). I'm leaning toward emitting a numerical metric for that reason. I think we should still implement it as an enum though in our code.

@stevenpyzhang stevenpyzhang changed the title feat: add metric for commandRunner status feat: add JMX metric for commandRunner status Dec 4, 2019
@@ -77,6 +77,13 @@
"Minimum time between consecutive health check evaluations. Health check queries before "
+ "the interval has elapsed will receive cached responses.";

static final String KSQL_COMMAND_RUNNER_HEALTH_CHECK_MS =
KSQL_CONFIG_PREFIX + "server.command.runner.healthcheck.ms";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies some sort of healtchecking interval. I'd name this something like: server.command.blocked.threshold.error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to server.command.blocked.threshold.error.ms

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Couple more bits of feedback inline.

}
}

private void checkCommandRunnerStatus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spirit of this test is great - if it passes we're confident the code under test is doing what its supposed to do. The problem is that because it relies on timing, it's prone to spurious test failures. I think we can make a couple tweaks to make the test deterministic:
- pass a mock clock to command runner so we control the time changes
- instead of using a sleep to simulate delays, have the command runner wait on a condition or countdown latch.

so you get something like this:

...
givenQueuedCommands(queuedCommand1);
Producer<Long> clock = mock(Producer.class);
CountDownLatch latch = new CountDownLatch(1);
CommandRunner commandRunner = new CommandRunner(..., clock::get, ...);
when(clock.get()).thenReturn(0).thenReturn(500).thenReturn(1000).thenReturn(2000);
when(statementExecutor.handleStatement()).thenAnswer(i -> latch.await());
Thread t = new Thread(() -> {
    commandRunner.fetchAndRunCommands();
});
assertThat(commandRunner.checkCommandRunnerStatus(), is(RUNNING));
assertThat(commandRunner.checkCommandRunnerStatus(), is(ERROR));
latch.countDown();
t.join();
assertThat(commandRunner.checkCommandRunnerStatus(), is(RUNNING));
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@rodesai rodesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants