-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add opentelemetry counters for sent and acked messages #2532
feat: add opentelemetry counters for sent and acked messages #2532
Conversation
private Attributes telemetryAttributes; | ||
private long incomingRequestCountBuffered; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, I would group all of them under telemetryMetrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"Reports time taken in milliseconds for a response to arrive once a message has been sent over the network.") | ||
.setExplicitBucketBoundariesAdvice(METRICS_LATENCY_BUCKETS) | ||
.build(); | ||
instrumentConnectionEstablishCount = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can be derived from network_response_latency, if you put connection_id as the metrics field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added writer_id as an attribute. I still have this metric, however, as it directly provides information about establishing a connection.
measurement.record(length, getTelemetryAttributes()); | ||
}); | ||
writeMeter | ||
.gaugeBuilder("inflight_queue_length") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will need connection_id as metrics field, otherwise it doesn't make too much sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added writer_id as an attribute.
.build(); | ||
instrumentSentRequestRows = | ||
writeMeter | ||
.counterBuilder("append_rows_sent") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should maintain less metrics. Can we just add a "result" field to append_requests/rows/bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have removed the following metrics: append_requests, append_request_bytes, append_rows, waiting_queue_length, connection_retry_count, append_requests_error, append_request_bytes_error, append_rows_error.
I now use the "error_code" attribute on each of the following metrics: append_requests_acked, append_request_bytes_acked, append_rows_acked, connection_end_count.
private LongCounter instrumentErrorRequestCount; | ||
private LongCounter instrumentErrorRequestSize; | ||
private LongCounter instrumentErrorRequestRows; | ||
private static final List<Long> METRICS_LATENCY_BUCKETS = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these millis/micros/nanos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed this to METRICS_MILLISECONDS_LATENCY_BUCKETS.
@VisibleForTesting | ||
Attributes getTelemetryAttributes() { | ||
return telemetryAttributes; | ||
} | ||
|
||
private void periodicallyReportOpenTelemetryMetrics() { | ||
Duration durationSinceLastRefresh = Duration.between(instantLastSentMetrics, Instant.now()); | ||
if (durationSinceLastRefresh.compareTo(METRICS_UPDATE_INTERVAL) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are metrics updates really that costly on the producer side that you don't just update metrics at time of event?
In opencensus, flushing/updates were mostly an exporter concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing using an exporter to Google Cloud Monitoring. I encountered "exceeded max frequency" errors with this exporter. To resolve this issue, I have switched to updating the instruments only once every second, which I believe should be sufficient for our needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further inspection, I narrowed down the issue I was seeing to the frequency of the exporter. I have restored all metrics to be instrumented in real time.
3caa290
to
b6b30cf
Compare
b6b30cf
to
796ae3e
Compare
0b87d97
to
ad164fb
Compare
private LongCounter instrumentIncomingRequestSize; | ||
private LongCounter instrumentIncomingRequestRows; | ||
private static final List<Long> METRICS_MILLISECONDS_LATENCY_BUCKETS = | ||
ImmutableList.of(0L, 50L, 100L, 500L, 1000L, 5000L, 10000L, 20000L, 30000L, 60000L, 120000L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GaoleMeng Do these buckets look good to you? Do we need a bucket at 50L? Maybe add a 2000L?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is too sparse, in backend we are using power of 1.5 bucket, that means it's
1 1.5 1.5^2 1.5^3.... millisecond
We were once using power of 4, but found that was too sparse, so we reduced it to power of 1.5
could we do similar bucketing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The power of 1.5 sequence looks like this:
1, 2, 3, 5, 8, 11, 17, 26, 38, 58, 86, 130, 195, 292, 438, 657, 985, 1478, 2217, 3325, 4988, 7482, 11223, 16834, 25251, 37877, 56815, 85223, 127834, 191751, 287627, 431440, 647160, 970740, 1456110
Would it be useful to provide all of these buckets? Alternatively, we could just provide every other bucket, so the list looks like this:
1, 3, 8, 17, 38, 86, 195, 438, 985, 2217, 4988, 11223, 25251, 56815, 127834, 287627, 647160, 1456110
@@ -333,6 +351,7 @@ private Attributes buildOpenTelemetryAttributes() { | |||
if (!tableName.isEmpty()) { | |||
builder.put(telemetryKeyTableId, tableName); | |||
} | |||
builder.put(telemetryKeyWriterId, writerId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comment to buildOpenTelemetryAttributes, what kind of attributes it is building and does this apply to all metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
ImmutableList.of(0L, 50L, 100L, 500L, 1000L, 5000L, 10000L, 20000L, 30000L, 60000L, 120000L); | ||
|
||
private static final class OpenTelemetryMetrics { | ||
private LongCounter instrumentSentRequestCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Gaole, we think that maybe the Sent and Ack won't make a significant difference. Let's just record Ack for now for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
writeMeter | ||
.counterBuilder("append_requests") | ||
.setDescription("Counts number of incoming requests") | ||
.counterBuilder("append_requests_acked") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a TODO, I am wondering if it is possible to have a Retry attribute to the metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ad164fb
to
73c9193
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please address the bucket length issue.
73c9193
to
0572a47
Compare
// Buckets are based on a list of 1.5 ^ n | ||
private static final List<Long> METRICS_MILLISECONDS_LATENCY_BUCKETS = | ||
ImmutableList.of( | ||
1L, 3L, 8L, 17L, 38L, 86L, 195L, 438L, 985L, 2217L, 4988L, 11223L, 25251L, 56815L, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have 1L, 3L and 8L?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed these. I now start at 0 (as the lowest bucket boundary) and end at 647160 which represents about 10 minutes.
Also add network latency, queue length and error counts. The metrics (other than error counts) are now reported periodically, every second.
0572a47
to
8ac0ed2
Compare
…pis#2532) Also add network latency, queue length and error counts. The metrics (other than error counts) are now reported periodically, every second.
Also add network latency, queue length and error counts.
The metrics (other than error counts) are now reported periodically, every second.