Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: Add ReadMe with tenets, metrics list to collect, design doc. #1304

Merged
merged 7 commits into from
Jan 15, 2020

Conversation

varunnvs92
Copy link
Contributor

No description provided.

@varunnvs92 varunnvs92 requested a review from spfink June 24, 2019 17:19
@varunnvs92 varunnvs92 changed the base branch from sdk-metrics-development to master June 24, 2019 18:13
@varunnvs92 varunnvs92 changed the title Add ReadMe with tenets, metrics list to collect, design doc. Metrics: Add ReadMe with tenets, metrics list to collect, design doc. Jun 25, 2019
@varunnvs92 varunnvs92 mentioned this pull request Jun 26, 2019
@varunnvs92 varunnvs92 force-pushed the varunkn/metrics-docs branch 2 times, most recently from 7c0bab1 to 5ecf61b Compare July 1, 2019 04:26
@varunnvs92 varunnvs92 force-pushed the varunkn/metrics-docs branch from 5ecf61b to 776d70c Compare July 1, 2019 04:28
@codecov-io
Copy link

codecov-io commented Jul 1, 2019

Codecov Report

Merging #1304 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1304      +/-   ##
============================================
- Coverage     75.58%   75.58%   -0.01%     
  Complexity      638      638              
============================================
  Files           898      898              
  Lines         28136    28140       +4     
  Branches       2221     2221              
============================================
+ Hits          21268    21269       +1     
- Misses         5848     5850       +2     
- Partials       1020     1021       +1
Flag Coverage Δ Complexity Δ
#unittests 75.58% <100%> (-0.01%) 638 <0> (ø)
Impacted Files Coverage Δ Complexity Δ
...n/awssdk/auth/signer/internal/Aws4SignerUtils.java 100% <100%> (ø) 0 <0> (ø) ⬇️
...igner/internal/BaseEventStreamAsyncAws4Signer.java 96.55% <100%> (+0.18%) 0 <0> (ø) ⬇️
...nio/netty/internal/OldConnectionReaperHandler.java 81.81% <0%> (-9.1%) 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0131c08...bb94adc. Read the comment docs.

## Concepts
### Metric
* A representation of data collected
* Metric can be one of the following types: Counter, Gauge, Timer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, Counter can be ambiguous as there are at least two pretty common definitions:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading these over. Will come back with thoughts.

| Api | ConstantGauge | The name of the AWS API the request is made to
| StreamingRequest | ConstantGauge | True if the request has streaming payload
| StreamingResponse | ConstantGauge | True if the response has streaming payload
| ApiCallStartTime | Timer | The start time of the request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a timestamp rather than a timer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I like timestamp over timer. Will update.

| ApiCallEndTime | Timer | The end time of the request
| ApiCallLatency | Timer | The total time taken to finish a request (inclusive of all retries), ApiCallEndTime - ApiCallStartTime
| MarshallingLatency | Timer | The time taken to marshall the request
| ApiCallAttemptCount | Counter | Total number of attempts that were made by the service client to fulfill this request before succeeding or failing. (Value is 1 if there are no retries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to know the maximum number of attempts for a given call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the ask here. Is this not what this metric is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the configured maximum number of attempts before it will give up. This metric shows the number of attempts that are actually used. Having both allows us to see how close it is to failing outright because it will exhaust all attempts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't care about the individual attempts but want to log whether the overall call from the client's perspective succeeded or not, I will need something like the last Http Status code at the APICall level. Same for the Exception. Thoughts?

# Project Details

1. Metrics are disabled by default and should be enabled explicitly by customers. Enabling metrics will introduce small overhead.
2. Metrics can be enabled quickly during large scale events with need for code change or deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without?

Meters define the way a metric is measured. Here are the list of meters:

**Counter :** Number of times a metric is reported. These kind of metrics can be incremented or decremented.
For example: number of requests made since the start of application
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition seems a bit odd.

Number of times a metric is reported.

Should this be something like the number of times an event has occurred?

These kind of metrics can be incremented or decremented.

Once an event has occurred you cannot take it back so decrementing doesn't make sense. for example with the number of requests made since the start of the application, allowing a decrement would give an incorrect result.

See other comment about different uses of the term Counter between systems. This definition seems to mix both.

**Counter :** Number of times a metric is reported. These kind of metrics can be incremented or decremented.
For example: number of requests made since the start of application

**Timer :** Records the time between start of an event and end of an event. An example is the time taken (latency) to complete a request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the table of metrics, the timestamps like start and end time also seem to be labeled as timers. Should those be gauges?

* By default, SDK creates and uses only CloudWatch publisher with default options (Default credential chain
* and region chain).
* To use CloudWatch publisher with custom options or any other publishers, create a
* #PublisherConfiguration object and set it in the ClientOverrideConfiguration on the client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have an application that uses various libraries using the SDK, then is there a way to configure the metrics consistently for all uses of the SDK without needing to have each library explicitly do the right thing or configure it?

We currently do it using an ExecutionInterceptor which can be loaded automatically if it is on the classpath. This has been handy for getting consistent instrumentation for all uses in an app without needing to worry about how the client was setup in the various libraries that are pulled in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can set environment variables or system properties that can globally enable metrics.

* A representation of data collected
* Metric is uniquely represented by its name
* Metric can be one of the following types: Constant, Counter, Gauge, Timer
* Metric can have tags. A Tag represent the category it belongs to (like Default, HttpClient, Streaming etc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tag -> tag

* SDK provides implementations to publish metrics to services like [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/), [Client Side Monitoring](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/sdk-metrics.html) (also known as AWS SDK Metrics for Enterprise Support)
* Customers can implement the interface and register the custom implementation to publish metrics to a platform not supported in the SDK.
* MetricPublishers can have different behaviors in terms of list of metrics to publish, publishing frequency, configuration needed to publish etc.
* Metrics can be explicitly published to the platform by calling publish() method. This can be useful in scenarios when the application fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the application is short-lived

| ApiCallEndTime | Timer | The end time of the request
| ApiCallLatency | Timer | The total time taken to finish a request (inclusive of all retries), ApiCallEndTime - ApiCallStartTime
| MarshallingLatency | Timer | The time taken to marshall the request
| ApiCallAttemptCount | Counter | Total number of attempts that were made by the service client to fulfill this request before succeeding or failing. (Value is 1 if there are no retries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the ask here. Is this not what this metric is?

* By default, SDK creates and uses only CloudWatch publisher with default options (Default credential chain
* and region chain).
* To use CloudWatch publisher with custom options or any other publishers, create a
* #PublisherConfiguration object and set it in the ClientOverrideConfiguration on the client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can set environment variables or system properties that can globally enable metrics.

| Api | ConstantGauge | The name of the AWS API the request is made to
| StreamingRequest | ConstantGauge | True if the request has streaming payload
| StreamingResponse | ConstantGauge | True if the response has streaming payload
| ApiCallStartTime | Timer | The start time of the request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I like timestamp over timer. Will update.

@varunnvs92 varunnvs92 force-pushed the varunkn/metrics-docs branch from ac9a527 to 1db31f2 Compare August 19, 2019 07:06
@millems millems mentioned this pull request Sep 13, 2019
Copy link

@swaranga swaranga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design does not account for flexible request level metrics logging. I have added some details as to why that is very much desirable and request it to be taken into consideration.

| ApiCallEndTime | Timer | The end time of the request
| ApiCallLatency | Timer | The total time taken to finish a request (inclusive of all retries), ApiCallEndTime - ApiCallStartTime
| MarshallingLatency | Timer | The time taken to marshall the request
| ApiCallAttemptCount | Counter | Total number of attempts that were made by the service client to fulfill this request before succeeding or failing. (Value is 1 if there are no retries)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't care about the individual attempts but want to log whether the overall call from the client's perspective succeeded or not, I will need something like the last Http Status code at the APICall level. Same for the Exception. Thoughts?


* Reporting is transferring the collected metrics to Publishers.
* To report metrics to a publisher, call the registerMetrics(MetricRegistry) method on the MetricPublisher.
* There is no requirement for Publisher to publish the reported metrics immediately after calling this method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a way to publish metric reports separately for each request? This is so that I can tie the metrics logging with my parent unit of work.

This is useful for instance when I am serving a request for my callers and as part of that I need to call an AWS API, I would like to log metrics for that particular call along with other metrics for the incoming request to my service. For this to work, I would need to be able to attach a publisher at the request level. This publisher will contain other custom metrics and at the end of my incoming request processing would flush the collected metrics to my custom destination (log file, some destination over Http).

Also if a publisher is configured at the request level, and another one is configured at the SDK level, only one should be invoked with the request level taking precedence.

I had mentioned more details in #23 (comment)

@swaranga
Copy link

swaranga commented Oct 4, 2019

Any updates on this?

@dagnir
Copy link
Contributor

dagnir commented Jan 14, 2020

Will be merging this as-is in the current state for now and will open PRs for further edits and amendments as we ramp up work on this feature again.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@dagnir dagnir merged commit 13d63eb into master Jan 15, 2020
@zoewangg zoewangg deleted the varunkn/metrics-docs branch August 19, 2020 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants