Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Remote configuration for the SDK proposal #121

Merged
merged 31 commits into from
Jul 22, 2020
Merged
Changes from 16 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ed7877e
Otep for dynamic configuration
churanc Jun 10, 2020
63cbe99
Revisions
churanc Jun 11, 2020
51a6f1d
Further explanation
churanc Jun 11, 2020
27d3b47
Fix link
churanc Jun 11, 2020
9901e1e
Revise proposal
churanchen Jun 11, 2020
dd2a31e
Revisions
churanc Jun 12, 2020
5ec717f
Make experimental
churanc Jun 17, 2020
83c9b96
Slight revisions
churanc Jun 17, 2020
baeb1cd
Revisions
churanc Jun 19, 2020
8b4e615
Phase 1
churanc Jun 19, 2020
4a03244
Revisions
churanc Jun 24, 2020
c2e6ac1
Change filename
churanc Jun 25, 2020
72569b2
Fix lint errors
churanc Jun 25, 2020
f2b8021
Update text/0121-config-service.md
churanchen Jul 6, 2020
440f411
Revisions
churanc Jul 6, 2020
613b463
Fix lint errors
churanc Jul 6, 2020
8b093eb
Add metadata explanation
churanc Jul 7, 2020
911aa42
Update text/0121-config-service.md
churanchen Jul 14, 2020
265debc
Update text/0121-config-service.md
churanchen Jul 14, 2020
f9d24e7
Update text/0121-config-service.md
churanchen Jul 14, 2020
181ec16
Update text/0121-config-service.md
churanchen Jul 14, 2020
2b976c9
Change period from enum to int
churanc Jul 14, 2020
bf284b7
Specified service for metrics, removed tracing
churanc Jul 15, 2020
1ecf1c9
Switch to MetricConfig
churanc Jul 15, 2020
e310cd8
Clarify period description
churanc Jul 15, 2020
45195d8
Merge branch 'master' into remote-configuration
SergeyKanzhelev Jul 15, 2020
31ef3ea
Remove metadata field
churanc Jul 16, 2020
3cb8013
Merge branch 'remote-configuration' of https://github.com/vmingchen/o…
churanc Jul 16, 2020
b4c8684
Merge branch 'master' into remote-configuration
bogdandrutu Jul 22, 2020
142a718
Move to experimental folder
churanc Jul 22, 2020
1a0dbb3
Merge branch 'remote-configuration' of https://github.com/vmingchen/o…
churanc Jul 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions text/0121-config-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# A Dynamic Configuration Service for the SDK
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to move this to experimental/... this way we know that it is experimental.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! @SergeyKanzhelev, would it be a good idea to move the z-pages otep there as well?


This proposal is a request to develop a prototype to configure metric collection periods. Per-metric and tracing configuration is also intended to be added, with details left for a later iteration.

It is related to [this pull request](https://github.com/open-telemetry/opentelemetry-proto/pull/155)

## Motivation

During normal use, users may wish to collect metrics every 10 minutes. Later, while investigating a production issue, the same user could easily increase information available for debugging by reconfiguring some of their processes to collect metrics every 30 seconds. Because this change is centralized and does not require redeploying with new configurations, there is lower friction and risk in updating the configurations.

## Explanation

This OTEP is a request for an experimental feature [open-telemetry/opentelemetry-specification#62](https://github.com/open-telemetry/opentelemetry-specification/pull/632). It is intended to seek approval to develop a proof of concept. This means no development will be done inside either the OpenTelemetry SDK or the collector. Since this will be implemented in [opentelemetry-go-contrib](https://github.com/open-telemetry/opentelemetry-go-contrib) and [opentelemetry-collector-contrib](https://github.com/open-telemetry/opentelemetry-collector-contrib), all of this functionality will be optional.
churanchen marked this conversation as resolved.
Show resolved Hide resolved

The user, when instrumenting their application, can configure the SDK with the endpoint of their remote configuration service, the associated Resource and a default config we revert to if we fail to read from the configuration service.
churanchen marked this conversation as resolved.
Show resolved Hide resolved

The user must then set up the config service. This can be done through the collector, which can be set up to expose an arbitrary configuration service implementation. Depending on implementation, this allows the collector to either act as a stand-alone configuration service, or as a bridge to remote configurations of the user's monitoring and tracing backend by 'translating' the monitoring backend's protocol to comply with the OpenTelemetry configuration protocol.

## Internal details

In the future, it is intended to add per-metric configuration. For example, this would allow the user to collect 5xx server error counts ever minute, and CPU usage statistics every 10 minutes. The remote configuration protocol was designed with this in mind, meaning that it includes more details than simply the metric collection period.
churanchen marked this conversation as resolved.
Show resolved Hide resolved

Our remote configuration protocol will support this call:

```
service DynamicConfig {
rpc GetConfig (ConfigRequest) returns (ConfigResponse);
churanchen marked this conversation as resolved.
Show resolved Hide resolved
}
```

A request to the config service will look like this:

```
message ConfigRequest{

// Required. The resource for which configuration should be returned.
opentelemetry.proto.resource.v1.Resource resource = 1;

// Optional. The value of ConfigResponse.fingerprint for the last configuration
// a resource received that was successfully applied.
churanchen marked this conversation as resolved.
Show resolved Hide resolved
bytes last_known_fingerprint = 2;
}
```

While the response will look like this:

```
message ConfigResponse {

// Optional. The fingerprint associated with this ConfigResponse. Each change
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should response fields be optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our intention for the fingerprint field was to have an identifier for configs that the user can set and use if they wish, similar to the metadata field. The idea being that there might be some fields that are nice and useful to include, but that we shouldn't force the end users to have in their configs.

// in configs yields a different fingerprint.
bytes fingerprint = 1;

// Dynamic configs specific to metrics
message MetricConfig {


// A Schedule is used to apply a particular scheduling configuration to
// a metric. If a metric name matches a schedule's patterns, then the metric
// adopts the configuration specified by the schedule.

message Schedule {

// A light-weight pattern that can match 1 or more
// metrics, for which this schedule will apply. The string is used to
// match against metric names. It should not exceed 100k characters.
message Pattern {
oneof match {
string equals = 1; // matches the metric name exactly
string starts_with = 2; // prefix-matches the metric name
}
}

// Metrics with names that match at least one rule in the inclusion_patterns are
// targeted by this schedule. Metrics that match at least one rule from the
// exclusion_patterns are not targeted for this schedule, even if they match an
// inclusion pattern.

// For this iteration, since we only want one Schedule that applies to all metrics,
// we will not check the inclusion_patterns and exclusion_patterns.
repeated Pattern inclusion_patterns = 1;
repeated Pattern exclusion_patterns = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

  1. because exclusion patterns have higher priority per the comment above, I would move it to the first position.
  2. I am not sure this grammar is expressive enough. Is there any prior art for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Changed!
  2. The choice to use exact match and starts with was made because they are more efficient to apply and clearer to read than more expressive alternatives, and have in practice been the only required pattern types for specifying metric names in Google's internal metric collection configuration protocol. However we are open to also adding a more expressive grammar such as regex or glob as a third option for a pattern type, so that monitoring backends can optionally choose to support a more expressive grammar without affecting efficiency for the simpler pattern types.


// CollectionPeriod describes the sampling period for each metric. All
// larger units are divisible by all smaller ones.
enum CollectionPeriod {
NONE = 0; // For non-periodic data (client sends points whenever)
SEC_1 = 1;
SEC_5 = 5;
SEC_10 = 10;
SEC_30 = 30;
MIN_1 = 60;
MIN_5 = 300;
MIN_10 = 600;
MIN_30 = 1800;
HR_1 = 3600;
HR_2 = 7200;
HR_4 = 14400;
HR_12 = 43200;
DAY_1 = 86400;
DAY_7 = 604800;
}
CollectionPeriod period = 3;
churanchen marked this conversation as resolved.
Show resolved Hide resolved

// Optional. Additional opaque metadata associated with the schedule.
// Interpreting metadata is implementation specific.
bytes metadata = 4;
churanchen marked this conversation as resolved.
Show resolved Hide resolved
}

// For this iteration, since we only want one Schedule that applies to all metrics,
// we will have a restriction that schedules must have a length of 1, and we will
// not check the patterns when we apply the collection period.
repeated Schedule schedules = 1;

}
MetricConfig metric_config = 2;

// Dynamic configs specific to trace, like the sampling rate of a resource.
message TraceConfig {
// TODO: unimplemented
// This proposal focuses on the metrics portion, but tracing is a natural
// candidate for something that might be useful to configure in the future,
// so this field will be left here for that eventuality.
}
TraceConfig trace_config = 3;

// Optional. The client is suggested to wait this long (in seconds) before
// pinging the configuration service again.
int32 suggested_wait_time_sec = 4;
}
```

The SDK will periodically read a config from the service using GetConfig. This reading interval can depend on If it fails to do so, it will just use either the default config or the most recent successfully read config. If it reads a new config, it will apply it.

Export frequency from the SDK depends on Schedules. There can only be one Schedule for now, which defines the schedule for all metrics. The schedule has a CollectionPeriod, which defines how often metrics are exported.

In the future, we will add per-metric configuration. Each Schedule also has inclusion_patterns and exclusion_patterns. Any metrics that match any of the inclusion_patterns and do not match any of the exclusion_patterns will be exported every CollectionPeriod (e.g. every minute). A component will be added that can export metrics that match a pattern from one Schedule at that Schedule's collection period, while exporting other metrics that match the patterns of another Schedule at that other Schedule's collection period.

The collector will support a new interface for a DynamicConfig service that can be used by an SDK, allowing a custom implementation of the configuration service protocol described above, to act as an optional bridge between an SDK and an arbitrary configuration service. This interface can be implemented as a shim to support accessing remote configurations from arbitrary backends. The collector is configured to expose an endpoint for requests to the DynamicConfig service, and returns results on that endpoint.
churanchen marked this conversation as resolved.
Show resolved Hide resolved

## Trade-offs and mitigations

This feature will be implemented purely as an experiment, to demonstrate its viability and usefulness. More investigation can be done after a rough prototype is demonstrated.

As mentioned [here](https://github.com/open-telemetry/opentelemetry-proto/pull/155#issuecomment-640582048), the configuration service can be a potential attack vector for an application instrumented with OpenTelemetry, depending on what we allow in the protocol. We can highlight in the remote configuration protocol that for future changes, caution is needed in terms of the sorts of configurations we allow.

Having a small polling interval (how often we read configs) would mean that config changes could be applied very quickly. However, this would also increase load on the configuration service. The typical use case probably does not need config changes to be applied immediately and config changes will likely be quite infrequent, so a typical polling interval probably needs to be no more frequent than every several minutes.

## Prior art and alternatives
churanchen marked this conversation as resolved.
Show resolved Hide resolved

Jaegar has the option of a Remote sampler, which allows reading from a central configuration, even dynamically with an Adaptive sampler.

The main comparative for remote configuration is a push vs. polling mechanism. The benefits of having a mechanism where the configuration service pushes new configs is that it's less work for the user, with it being not necessary for them to set up a configuration service. There is also no load associated with polling the configuration service in the instrumented application, which would keep the OpenTelemetry SDK more lightweight.

Using a polling mechanism may be more performant in the context of large distributed applications with many instrumented processes. This is a result of having the instrumented processes polling the configuration service, rather than the config service having to push to processes. A polling mechanism is also compatible with more network protocols, not just gRPC.

## Open questions

- As mentioned [here](https://github.com/open-telemetry/opentelemetry-proto/pull/155#issuecomment-640582048). what happens if a malicious/accidental config change overwhelms the application/monitoring system? Is it the responsibility of the user to be cautious while making config changes? Should we automatically decrease telemetry exporting if we can detect performance problems?

## Future possibilities
churanchen marked this conversation as resolved.
Show resolved Hide resolved

If this OTEP is implemented, there is the option to remotely and dynamically configure other things. As mentioned [here](https://github.com/open-telemetry/opentelemetry-proto/pull/155#issuecomment-639878490), possibilities include labels and aggregations. As mentioned [here](https://github.com/open-telemetry/oteps/pull/121#discussion_r447839301), it is also possible to configure the collector.

It is intended to add per-metric configuration and well as tracing configuration in the future.