Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-10654: Introduce output of Prometheus metrics directly from Solr #2405

Merged
merged 33 commits into from
Jul 9, 2024

Conversation

mlbiscoc
Copy link
Contributor

@mlbiscoc mlbiscoc commented Apr 17, 2024

https://issues.apache.org/jira/browse/SOLR-10654

Description

Introduce a new Prometheus response writer /solr/admin/metrics?wt=prometheus to output Dropwizards Solr Core registry metrics directly from Solr. Output looks similar to the prometheus exporter. This exports solr.core.* solr.node solr.jvm solr.jetty metric registries

Solution

Currently, Solr metrics are collected by Dropwizard and differentiated by metric names. This is an anti-pattern to Prometheus. Manually export Dropwizard Solr core registry metrics to a new SolrPrometheusExporter base class to parse out the labels from the metric name and output in a Prometheus readable format for the Prometheus Response writer. The exporter is just temporarily created to export a snapshot of what the current Dropwizard registry is at. These are not held in memory.

The exporter will categorize these metrics and export to prometheus format with labels depending on the category and metric names.

In standalone, all labels for the solr.core registry includescore. In cloud mode, all metrics include core, collection, shard, replica

NOTE: Not all metrics are possible to convert to Prometheus such as Dropwizard string metrics. This is because Prometheus does not support non-numeric types

Tests

Added MetricsHandlerTest, SolrPrometheusExporterTest, TestPrometheusReponseWriter

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

public SolrCoreMetric parseLabels() {
String[] parsedMetric = metricName.split("\\.");
if (dropwizardMetric instanceof Gauge) {
String type = parsedMetric[2];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for metricName to be size 1 at this point? If so, we should surround this in a try/throw or a check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen a metric with delimited with a size of 1 yet but in cases that does, I did surround with try/catch at a higher level here

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are enough additional dependencies that I feel strongly this should be another module.

It's a shame to see so much code to do the mapping but... shrug, ah well.

@github-actions github-actions bot added the tests label May 1, 2024
@mlbiscoc mlbiscoc marked this pull request as ready for review May 2, 2024 20:11
@mlbiscoc
Copy link
Contributor Author

mlbiscoc commented May 2, 2024

Moving this out of draft for review

@dsmiley
Copy link
Contributor

dsmiley commented May 7, 2024

There are enough additional dependencies that I feel strongly this should be another module.

WDYT?

@mlbiscoc
Copy link
Contributor Author

mlbiscoc commented May 7, 2024

There are enough additional dependencies that I feel strongly this should be another module.

WDYT?

Sorry, I somehow completely missed this comment. When you say be another module, do you mean move it to solr/modules that needs to be enabled? If so, I think that prometheus metrics is valuable enough to be part of core and enabled by default but I also only added 2 additional dependencies. Is that enough to be its own module?

@dsmiley
Copy link
Contributor

dsmiley commented May 7, 2024

io.prometheus:prometheus-metrics-config:1.1.0 (1 constraints: 3e164ed8)
io.prometheus:prometheus-metrics-exposition-formats:1.1.0 (1 constraints: 0405f335)
io.prometheus:prometheus-metrics-model:1.1.0 (2 constraints: 411b133b)
io.prometheus:prometheus-metrics-shaded-protobuf:1.1.0 (1 constraints: 

No, there are 4 dependencies. Only the last one concerns me. Why is protobuf needed at all? Can it be excluded? Can any of the others be excluded as well, while you're at it?

@dsmiley
Copy link
Contributor

dsmiley commented May 31, 2024

It doesn't feel right to merge this PR right now. Needs docs. Also the choice of what metrics are done so far is inverted from the value that would make most sense for this whole approach -- OS/JVM/node issues are highest value.

@mlbiscoc mlbiscoc changed the title SOLR-10654: Introduce output of Prometheus metrics for Solr Core registry SOLR-10654: Introduce output of Prometheus metrics directly from Solr Jun 10, 2024
@mlbiscoc
Copy link
Contributor Author

It doesn't feel right to merge this PR right now. Needs docs. Also the choice of what metrics are done so far is inverted from the value that would make most sense for this whole approach -- OS/JVM/node issues are highest value.

@dsmiley Just finished up exporting solr.node solr.jvm and solr.jetty metrics to prometheus and is ready for another round of review whenever you are. I updated the description above and attached files if you want to see what the prometheus output looks like. Last thing I did not finish writing up is docs. I'm assuming the 2 places to update would be the metrics-api section and possibly Monitoring with Prometheus and Grafana?

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 11, 2024
@mlbiscoc
Copy link
Contributor Author

Added ref-guide documentation on Prometheus endpoint

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest an integration test showing all the output (say to a file) stripped of numbers (to remove brittle/changing stuff) to show that we export what we mean to export. I could imagine someone renaming some metric and then forgetting to update the code here.

Comment on lines +28 to +30
public static final String NODE_REQUESTS = "solr_metrics_node_requests";
public static final String NODE_SECONDS_TOTAL = "solr_metrics_node_requests_time";
public static final String NODE_CONNECTIONS = "solr_metrics_node_connections";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay but FWIW IMO in situations like this class (parsing/exporting) constants like this only diffuse readability (add a needless indirection). Take it or leave it; your choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer constants so I'd leave it unless someone feels strongly about it.

}

SolrMetric solrNodeMetric = categorizeMetric(dropwizardMetric, metricName);
solrNodeMetric.parseLabels().toPrometheus(this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we forget to call parseLabels, will we recognize a mistake? not a big deal but just a thought

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say its a mistake. I split these into 2 abstract functions to split out the logic for getting the labels, then converting the correct dropwizard type to the correct prometheus type instead of 1 big function.

I think it's fine if not called. If down the line additional metrics want to be exported from dropwizard, it's possible to just have it exported without any labels and just it's value to which parselabels isn't needed.

@mlbiscoc
Copy link
Contributor Author

I'd like to suggest an integration test showing all the output (say to a file) stripped of numbers (to remove brittle/changing stuff) to show that we export what we mean to export. I could imagine someone renaming some metric and then forgetting to update the code here.

Thanks for reviewing again. Added in an integration test with a file stripped of values. Also the unit tests I wrote originally should catch if any changes that happen to the naming of the prometheus metric or labels.

Also thanks for catching missing javadocs. Made sure to add them in for each class and added sample values for what is being exported to prometheus.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this PR locally using the "techproducts" example. I thing I noticed is that the metrics filtering appears to not work, which is a big problem. A test should be added for this, and it should work.

http://localhost:8983/solr/admin/metrics?category=solr.jetty&wt=prometheus

return this;
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this feedback applies to most classes where you did this).
You've put this where javadocs go yet it's it's technically not javadocs since you didn't start with a leading double-asterisk. If you were to do that, you might need to add markup :-/ I suppose leaving this as-is is fine. Even moving it above e parseLabels and add a blank line to further show this is not (trying to be) javadoc. FWIW when I suggested adding input examples for parsing, I thought you would add some trivial "//" in-line at the point of parsing. But this is fine, as you wish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this. Went with your other suggestion of putting it above parseLabels and add an empty line below it.

If you still prefer just normal inline comments, I can make the appropriate changes.

*/
@Override
public void toPrometheus(SolrPrometheusExporter exporter) {
if (dropwizardMetric instanceof Meter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I re-read all this code in the PR, esp. for these toPrometheus methods, the instanceof checks seem more awkward with questionable value. If instead we see that the name matches a pattern, let's just cast. If we get a ClassCastException, a test will catch it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and removed most of the instanceof checks but kept in some places. In the places I kept, it seemed cleaner to export that way as not all my checks are just straight names checks and we don't export every metric. So we can just export a large amount by just checking its type. I can try refactoring further of removing instanceof checks if you still disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't go in the "solr" folder; that's a "solr home dir" and thus only solr config files go there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created it's own prometheus directory to place the file in a place we can place future test files relating to prometheus

Files.readString(
new File(TEST_PATH().toString(), "solr-prometheus-output.txt").toPath(),
StandardCharsets.UTF_8);
assertEquals(expectedOutput, actual.replaceAll("(?<=}).*", ""));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expect there will be more cleansing/normalization to do. For example the JVM used will result in different "item" labels on solr_metrics_jvm_gc. I got "G1-Old-Generation" vs "PS-MarkSweep". People will use different JVMs for tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsed out the JVM item tag from the test with additional regex

Files.readString(
new File(TEST_PATH().toString(), "solr-prometheus-output.txt").toPath(),
StandardCharsets.UTF_8);
assertEquals(expectedOutput, actual.replaceAll("(?<=}).*", ""));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment explaining that regexp please? It's all symbols 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a comment with explanation.

@mlbiscoc
Copy link
Contributor Author

I tried this PR locally using the "techproducts" example. I thing I noticed is that the metrics filtering appears to not work, which is a big problem. A test should be added for this, and it should work.

http://localhost:8983/solr/admin/metrics?category=solr.jetty&wt=prometheus

Filtering does work I think the parameter you are looking for is group. Try localhost:8983/solr/admin/metrics?group=jvm&wt=prometheus and you can change it with jetty node core for the corresponding registries

But still you are correct, I should have a test for this. Will add one in.

The only other thing I remember now about this is that regex, key, expr parameters for filtering will not work with this prometheus wt. I did that on purpose because those 3 parameters are for filtering based on the name of the metric while the others are kind of more generic such as registry or type filtering. I didn't think it made sense because prometheus naming is different from the dropwizard naming after exporting. For example: solr\.core\..*:QUERY\..*\.requestTimes:max_ms is not an output of prometheus but to allow expr filtering around that felt odd.

Even then if you need extensive filtering or trying to get some specific metric, maybe the Prometheus Exporter process with a custom config might be the better solution.

@dsmiley
Copy link
Contributor

dsmiley commented Jun 28, 2024

Oops; maybe I was hurried. I'm glad the category works at least! The other filtering options could be deferred to a future JIRA by matching the prometheus name. It's a very useful capability; some metrics implemented as gauges are expensive to compute, and others are just excessive output that aren't needed for O(N) for N cores if N is for some users quite high.

@mlbiscoc
Copy link
Contributor Author

mlbiscoc commented Jul 1, 2024

Oops; maybe I was hurried. I'm glad the category works at least! The other filtering options could be deferred to a future JIRA by matching the prometheus name. It's a very useful capability; some metrics implemented as gauges are expensive to compute, and others are just excessive output that aren't needed for O(N) for N cores if N is for some users quite high.

Added in a test for the filtering. I'd be happy expand on this and add further filtering feature specific to prometheus after this is ready to be merged.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super close now.
Can you propose a brief CHANGES.txt under 9.7 "New Features" (I think it's closer to this than an improvement).


actualSnapshot = getMetricSnapshot(actualSnapshots, "solr_metrics_jvm_gc_seconds");
actualGaugeDataPoint =
getGaugeDatapointSnapshot(actualSnapshot, Labels.of("item", "G1-Old-Generation"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion will fail for varying JVMs; skip this one. Some others too above... it's okay to make the test looser / less specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed a few more tests from the JVM registry

@@ -879,7 +879,7 @@ The response will look like this:
]}
----

Like other request handlers, the Metrics API can also take the `wt` parameter to define the output format.
Like other request handlers, the Metrics API can also take the `wt` parameter to define the output format. The Metrics API also provides an additional special `wt` parameter `prometheus` that will output metrics in Prometheus format. This can be used for xref:monitoring-with-prometheus-and-grafana.adoc#metrics-api-with-prometheus-format[Monitoring with Prometheus and Grafana]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After each sentence, put a newline (doesn't change the final render but it to keep editing & diffs more manageable).

Should document what options do not work with prometheus. The regex one doesn't, you said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a small note about the filter query parameters. Let me know if you prefer a different way to write it on the ref-guide

@mlbiscoc
Copy link
Contributor Author

mlbiscoc commented Jul 2, 2024

This is super close now. Can you propose a brief CHANGES.txt under 9.7 "New Features" (I think it's closer to this than an improvement).

Added entry into CHANGES.txt

@dsmiley dsmiley merged commit fd7d447 into apache:main Jul 9, 2024
3 checks passed
dsmiley pushed a commit that referenced this pull request Jul 17, 2024
An alternative to the "Prometheus Exporter" in which each Solr node can return the Prometheus format natively from the MetricsHandler using `wt=prometheus` param.  It's much faster and architecturally simpler, albeit less flexible.

Co-authored-by: mbiscocho <[email protected]>
Co-authored-by: Christine Poerschke <[email protected]>
Co-authored-by: David Smiley <[email protected]>

(cherry picked from commit fd7d447)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:metrics cat:search dependencies Dependency upgrades documentation Improvements or additions to documentation tests tool:build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants