Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add opentelemetry counters for sent and acked messages #2532

Merged
merged 1 commit into from
Jul 24, 2024

Conversation

agrawal-siddharth
Copy link
Contributor

Also add network latency, queue length and error counts.

The metrics (other than error counts) are now reported periodically, every second.

@agrawal-siddharth agrawal-siddharth requested a review from a team as a code owner June 18, 2024 02:23
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. labels Jun 18, 2024
private Attributes telemetryAttributes;
private long incomingRequestCountBuffered;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I would group all of them under telemetryMetrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"Reports time taken in milliseconds for a response to arrive once a message has been sent over the network.")
.setExplicitBucketBoundariesAdvice(METRICS_LATENCY_BUCKETS)
.build();
instrumentConnectionEstablishCount =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can be derived from network_response_latency, if you put connection_id as the metrics field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added writer_id as an attribute. I still have this metric, however, as it directly provides information about establishing a connection.

measurement.record(length, getTelemetryAttributes());
});
writeMeter
.gaugeBuilder("inflight_queue_length")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will need connection_id as metrics field, otherwise it doesn't make too much sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added writer_id as an attribute.

.build();
instrumentSentRequestRows =
writeMeter
.counterBuilder("append_rows_sent")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should maintain less metrics. Can we just add a "result" field to append_requests/rows/bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have removed the following metrics: append_requests, append_request_bytes, append_rows, waiting_queue_length, connection_retry_count, append_requests_error, append_request_bytes_error, append_rows_error.

I now use the "error_code" attribute on each of the following metrics: append_requests_acked, append_request_bytes_acked, append_rows_acked, connection_end_count.

@yirutang yirutang requested a review from GaoleMeng June 21, 2024 00:13
private LongCounter instrumentErrorRequestCount;
private LongCounter instrumentErrorRequestSize;
private LongCounter instrumentErrorRequestRows;
private static final List<Long> METRICS_LATENCY_BUCKETS =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these millis/micros/nanos?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this to METRICS_MILLISECONDS_LATENCY_BUCKETS.

@VisibleForTesting
Attributes getTelemetryAttributes() {
return telemetryAttributes;
}

private void periodicallyReportOpenTelemetryMetrics() {
Duration durationSinceLastRefresh = Duration.between(instantLastSentMetrics, Instant.now());
if (durationSinceLastRefresh.compareTo(METRICS_UPDATE_INTERVAL) > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are metrics updates really that costly on the producer side that you don't just update metrics at time of event?

In opencensus, flushing/updates were mostly an exporter concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am testing using an exporter to Google Cloud Monitoring. I encountered "exceeded max frequency" errors with this exporter. To resolve this issue, I have switched to updating the instruments only once every second, which I believe should be sufficient for our needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further inspection, I narrowed down the issue I was seeing to the frequency of the exporter. I have restored all metrics to be instrumented in real time.

@agrawal-siddharth agrawal-siddharth force-pushed the openMetrics2 branch 2 times, most recently from 3caa290 to b6b30cf Compare July 19, 2024 15:36
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jul 19, 2024
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: l Pull request size is large. labels Jul 19, 2024
@agrawal-siddharth agrawal-siddharth force-pushed the openMetrics2 branch 3 times, most recently from 0b87d97 to ad164fb Compare July 22, 2024 16:58
private LongCounter instrumentIncomingRequestSize;
private LongCounter instrumentIncomingRequestRows;
private static final List<Long> METRICS_MILLISECONDS_LATENCY_BUCKETS =
ImmutableList.of(0L, 50L, 100L, 500L, 1000L, 5000L, 10000L, 20000L, 30000L, 60000L, 120000L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GaoleMeng Do these buckets look good to you? Do we need a bucket at 50L? Maybe add a 2000L?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too sparse, in backend we are using power of 1.5 bucket, that means it's
1 1.5 1.5^2 1.5^3.... millisecond

We were once using power of 4, but found that was too sparse, so we reduced it to power of 1.5
could we do similar bucketing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The power of 1.5 sequence looks like this:

1, 2, 3, 5, 8, 11, 17, 26, 38, 58, 86, 130, 195, 292, 438, 657, 985, 1478, 2217, 3325, 4988, 7482, 11223, 16834, 25251, 37877, 56815, 85223, 127834, 191751, 287627, 431440, 647160, 970740, 1456110

Would it be useful to provide all of these buckets? Alternatively, we could just provide every other bucket, so the list looks like this:

1, 3, 8, 17, 38, 86, 195, 438, 985, 2217, 4988, 11223, 25251, 56815, 127834, 287627, 647160, 1456110

@@ -333,6 +351,7 @@ private Attributes buildOpenTelemetryAttributes() {
if (!tableName.isEmpty()) {
builder.put(telemetryKeyTableId, tableName);
}
builder.put(telemetryKeyWriterId, writerId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comment to buildOpenTelemetryAttributes, what kind of attributes it is building and does this apply to all metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

ImmutableList.of(0L, 50L, 100L, 500L, 1000L, 5000L, 10000L, 20000L, 30000L, 60000L, 120000L);

private static final class OpenTelemetryMetrics {
private LongCounter instrumentSentRequestCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Gaole, we think that maybe the Sent and Ack won't make a significant difference. Let's just record Ack for now for simplicity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

writeMeter
.counterBuilder("append_requests")
.setDescription("Counts number of incoming requests")
.counterBuilder("append_requests_acked")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a TODO, I am wondering if it is possible to have a Retry attribute to the metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@yirutang yirutang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please address the bucket length issue.

// Buckets are based on a list of 1.5 ^ n
private static final List<Long> METRICS_MILLISECONDS_LATENCY_BUCKETS =
ImmutableList.of(
1L, 3L, 8L, 17L, 38L, 86L, 195L, 438L, 985L, 2217L, 4988L, 11223L, 25251L, 56815L,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have 1L, 3L and 8L?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed these. I now start at 0 (as the lowest bucket boundary) and end at 647160 which represents about 10 minutes.

Also add network latency, queue length and error counts.

The metrics (other than error counts) are now reported periodically,
every second.
@agrawal-siddharth agrawal-siddharth merged commit 2fc5c55 into googleapis:main Jul 24, 2024
19 checks passed
yifatgortler pushed a commit to yifatgortler/java-bigquerystorage that referenced this pull request Jul 25, 2024
…pis#2532)

Also add network latency, queue length and error counts.

The metrics (other than error counts) are now reported periodically,
every second.
@agrawal-siddharth agrawal-siddharth deleted the openMetrics2 branch August 14, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/java-bigquerystorage API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants