Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for directly generating span graphs #745

Merged
merged 22 commits into from
Apr 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 20 additions & 8 deletions docs/sources/configure/options.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ This section allows specifying different selection criteria for different servic
as well as overriding some of their metadata, such as their reported name or
namespace.

For more details about this section, please go to the [discovery services section](#discovery-services-section)
For more details about this section, go to the [discovery services section](#discovery-services-section)
of this document.

| YAML | Environment variable | Type | Default |
Expand Down Expand Up @@ -447,8 +447,8 @@ a 'Traceparent' header value, it will use the provided 'trace id' to create its
This option does not have an effect on Go applications, where the 'Traceparent' field is always
processed, without additional tracking of the request headers.

Enabling this option may increase Beyla's performance overhead in high request volume scenarios.
Please note that this option is only useful when generating Beyla traces, it does not affect
Enabling this option may increase the performance overhead in high request volume scenarios.
This option is only useful when generating Beyla traces, it does not affect
generation of Beyla metrics.

## Configuration of metrics and traces attributes
Expand Down Expand Up @@ -522,7 +522,7 @@ attributes:
```

It is IMPORTANT to consider that enabling this feature requires a previous step of
providing some extra permissions to the Beyla Pod. Please check the
providing some extra permissions to the Beyla Pod. Consult the
["Configuring Kubernetes metadata decoration section" in the "Running Beyla in Kubernetes"]({{< relref "../setup/kubernetes.md" >}}) page.

| YAML | Environment variable | Type | Default |
Expand Down Expand Up @@ -643,7 +643,7 @@ Possible values for the `ignore_mode` property are:

Selectively ignoring only certain type of events might be useful in certain scenarios. For example, you may want to
know the performance metrics of your health check API, but you wouldn't want the overhead of those trace records in
your target traces database. In this this example scenario, you would set the `ignore_mode` property to `traces`, such
your target traces database. In this example scenario, you would set the `ignore_mode` property to `traces`, such
that only traces matching the `ignored_patterns` will be discarded, while metrics will still be recorded.

| YAML | Environment variable | Type | Default |
Expand Down Expand Up @@ -688,7 +688,7 @@ document/d/*/edit
## OTEL metrics exporter

> ℹ️ If you plan to use Beyla to send metrics to Grafana Cloud,
> please check the [Grafana Cloud OTEL exporter for metrics and traces](#using-the-grafana-cloud-otel-endpoint-to-ingest-metrics-and-traces)
> consult the [Grafana Cloud OTEL exporter for metrics and traces](#using-the-grafana-cloud-otel-endpoint-to-ingest-metrics-and-traces)
> section for easier configuration.

YAML section `otel_metrics_export`.
Expand Down Expand Up @@ -785,7 +785,13 @@ of Beyla: application-level metrics or network metrics.
process matching the entries in the `discovery` section.
- If the list contains `application_span`, the Beyla OpenTelemetry exporter exports application-level trace span metrics;
but only if there is defined an OpenTelemetry endpoint, and Beyla was able to discover any
process matching the entries in the `discovery` section.
process matching the entries in the `discovery` section.
- If the list contains `application_service_graph`, the Beyla OpenTelemetry exporter exports application-level service graph metrics;
but only if there is defined an OpenTelemetry endpoint, and Beyla was able to discover any
process matching the entries in the `discovery` section.
For best experience with generating service graph metrics, use a DNS for service discovery and make sure the DNS names match
the OpenTelemetry service names used in Beyla. In Kubernetes environments, the OpenTelemetry service name set by the service name
discovery is the best choice for service graph metrics.
- If the list contains `network`, the Beyla OpenTelemetry exporter exports network-level
metrics; but only if there is defined an OpenTelemetry endpoint and the
[network metrics are enabled]({{< relref "../network" >}}).
Expand Down Expand Up @@ -867,7 +873,7 @@ for more information.
## OTEL traces exporter
> ℹ️ If you plan to use Beyla to send metrics to Grafana Cloud,
> please check the [Grafana Cloud OTEL exporter for metrics and traces](#using-the-grafana-cloud-otel-endpoint-to-ingest-metrics-and-traces)
> consult the [Grafana Cloud OTEL exporter for metrics and traces](#using-the-grafana-cloud-otel-endpoint-to-ingest-metrics-and-traces)
> section for easier configuration.
YAML section `otel_traces_export`.
Expand Down Expand Up @@ -1096,6 +1102,12 @@ of Beyla: application-level metrics or network metrics.
- If the list contains `application_span`, the Beyla Prometheus exporter exports application-level metrics in traces span metrics format;
but only if the Prometheus `port` property is defined, and Beyla was able to discover any
process matching the entries in the `discovery` section.
- If the list contains `application_service_graph`, the Beyla Prometheus exporter exports application-level service graph metrics;
but only if the Prometheus `port` property is defined, and Beyla was able to discover any
process matching the entries in the `discovery` section.
For best experience with generating service graph metrics, use a DNS for service discovery and make sure the DNS names match
the OpenTelemetry service names used in Beyla. In Kubernetes environments, the OpenTelemetry service name set by the service name
discovery is the best choice for service graph metrics.
- If the list contains `network`, the Beyla Prometheus exporter exports network-level
metrics; but only if the Prometheus `port` property is defined and the
[network metrics are enabled]({{< relref "../network" >}}).
Expand Down
15 changes: 10 additions & 5 deletions pkg/beyla/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ var DefaultConfig = Config{
Submit: []string{"traces"},
},
},
NameResolver: &transform.NameResolverConfig{
CacheLen: 1024,
CacheTTL: 5 * time.Minute,
},
Metrics: otel.MetricsConfig{
Protocol: otel.ProtocolUnset,
MetricsProtocol: otel.ProtocolUnset,
Expand Down Expand Up @@ -106,11 +110,12 @@ type Config struct {

Attributes Attributes `yaml:"attributes"`
// Routes is an optional node. If not set, data will be directly forwarded to exporters.
Routes *transform.RoutesConfig `yaml:"routes"`
Metrics otel.MetricsConfig `yaml:"otel_metrics_export"`
Traces otel.TracesConfig `yaml:"otel_traces_export"`
Prometheus prom.PrometheusConfig `yaml:"prometheus_export"`
Printer debug.PrintEnabled `yaml:"print_traces" env:"BEYLA_PRINT_TRACES"`
Routes *transform.RoutesConfig `yaml:"routes"`
NameResolver *transform.NameResolverConfig `yaml:"name_resolver"`
Metrics otel.MetricsConfig `yaml:"otel_metrics_export"`
Traces otel.TracesConfig `yaml:"otel_traces_export"`
Prometheus prom.PrometheusConfig `yaml:"prometheus_export"`
Printer debug.PrintEnabled `yaml:"print_traces" env:"BEYLA_PRINT_TRACES"`

// Exec allows selecting the instrumented executable whose complete path contains the Exec value.
Exec services.RegexpAttr `yaml:"executable_name" env:"BEYLA_EXECUTABLE_NAME"`
Expand Down
4 changes: 4 additions & 0 deletions pkg/beyla/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,10 @@ network:
},
},
Routes: &transform.RoutesConfig{},
NameResolver: &transform.NameResolverConfig{
CacheLen: 1024,
CacheTTL: 5 * time.Minute,
},
}, cfg)
}

Expand Down
2 changes: 1 addition & 1 deletion pkg/beyla/network_cfg.go
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ var defaultNetworkConfig = NetworkConfig{

func (nc *NetworkConfig) Validate(isKubeEnabled bool) error {
if len(nc.AllowedAttributes) == 0 {
return errors.New("you must define some attributes in the allowed_attributes section. Please ceck documentation")
return errors.New("you must define some attributes in the allowed_attributes section. Please check documentation")
}
if isKubeEnabled {
return nil
Expand Down
4 changes: 2 additions & 2 deletions pkg/internal/export/debug/debug.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ func printFunc() (pipe.FinalFunc[[]request.Span], error) {
spans[i].Status,
spans[i].Method,
spans[i].Path,
spans[i].Peer,
spans[i].Host,
spans[i].Peer+" as "+spans[i].PeerName,
spans[i].Host+" as "+spans[i].HostName,
spans[i].HostPort,
spans[i].ContentLength,
&spans[i].ServiceID,
Expand Down
25 changes: 25 additions & 0 deletions pkg/internal/export/otel/common.go
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,11 @@ const (
SourceKey = attribute.Key("source")
ServiceKey = attribute.Key("service")
InstanceKey = attribute.Key("instance")
ClientKey = attribute.Key("client")
ClientNamespaceKey = attribute.Key("client_service_namespace")
ServerKey = attribute.Key("server")
ServerNamespaceKey = attribute.Key("server_service_namespace")
ConnectionTypeKey = attribute.Key("connection_type")
)

func HTTPRequestMethod(val string) attribute.KeyValue {
Expand Down Expand Up @@ -327,3 +332,23 @@ func StatusCodeMetric(val int) attribute.KeyValue {
func ServiceInstanceMetric(val string) attribute.KeyValue {
return InstanceKey.String(val)
}

func ClientMetric(val string) attribute.KeyValue {
return ClientKey.String(val)
}

func ClientNamespaceMetric(val string) attribute.KeyValue {
return ClientNamespaceKey.String(val)
}

func ServerMetric(val string) attribute.KeyValue {
return ServerKey.String(val)
}

func ServerNamespaceMetric(val string) attribute.KeyValue {
return ServerNamespaceKey.String(val)
}

func ConnectionTypeMetric(val string) attribute.KeyValue {
return ConnectionTypeKey.String(val)
}
106 changes: 105 additions & 1 deletion pkg/internal/export/otel/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ import (

"github.com/mariomac/pipes/pipe"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
instrument "go.opentelemetry.io/otel/metric"
Expand Down Expand Up @@ -41,6 +42,10 @@ const (
SpanMetricsCalls = "traces_spanmetrics_calls_total"
SpanMetricsSizes = "traces_spanmetrics_size_total"
TracesTargetInfo = "traces_target_info"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood, Tempo generates traces_target_info when the OTel Collector Service Graph Connector generates target_info and we have a configuration flag to choose the one used (screenshot below).

@domasx2 do we have best practices here?

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we may not need it? I simply went by what I saw Tempo generated for my cloud instance when using Alloy.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed for sure. I think what @cyrille-leclerc wondering is wether you should align with Tempo metric generator metric names, or OTEL spanmetrics/ service graph metrics connector metric names. I don't have a strong opinion TBH, though would lean towards OTEL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh absolutely OTel then, just confirming, so if we renamed the collection as target_info App O11y will continue to work, correct? I'll create a PR to fix this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that span metric names change as well for Otel, instead of traces_span_* it's duration_seconds_bucket, duration_seconds_count and duration_seconds_total

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @domasx2 said, the question of OTel vs Tempo naming conventions. I would see value in adopting OTel conventions.
That being said, what's our recommendation if a customer is getting started combining Beyla with a bunch of app instrumented with OTel SDKs the simplest possible way, relying on Tempo metrics generation?
Sorry it's more complicated than we would like.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with the Gang that it would be beneficial to run with the new otel default. Given that users can have more instrumentation than just a single beyla deployment, I would opt to make this configurable to be either Otel (default) OR tempo.

If it was flexible, people could rely on tempo generated metrics + the ones from Beyla.

I would also lean towards helping tempo to align with otel long-term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think we have a consensus. We'll add an OTel format and make it default.

I opened a new issue to track this here #821

ServiceGraphClient = "traces_service_graph_request_client"
ServiceGraphServer = "traces_service_graph_request_server"
ServiceGraphFailed = "traces_service_graph_request_failed_total"
ServiceGraphTotal = "traces_service_graph_request_total"

UsualPortGRPC = "4317"
UsualPortHTTP = "4318"
Expand All @@ -51,6 +56,7 @@ const (
FeatureNetwork = "network"
FeatureApplication = "application"
FeatureSpan = "application_span"
FeatureGraph = "application_service_graph"
)

type MetricsConfig struct {
Expand Down Expand Up @@ -132,12 +138,16 @@ func (m MetricsConfig) SpanMetricsEnabled() bool {
return slices.Contains(m.Features, FeatureSpan)
}

func (m MetricsConfig) ServiceGraphMetricsEnabled() bool {
return slices.Contains(m.Features, FeatureGraph)
}

func (m MetricsConfig) OTelMetricsEnabled() bool {
return slices.Contains(m.Features, FeatureApplication)
}

func (m MetricsConfig) Enabled() bool {
return m.EndpointEnabled() && (m.OTelMetricsEnabled() || m.SpanMetricsEnabled())
return m.EndpointEnabled() && (m.OTelMetricsEnabled() || m.SpanMetricsEnabled() || m.ServiceGraphMetricsEnabled())
}

// MetricsReporter implements the graph node that receives request.Span
Expand Down Expand Up @@ -167,6 +177,10 @@ type Metrics struct {
spanMetricsCallsTotal instrument.Int64Counter
spanMetricsSizeTotal instrument.Float64Counter
tracesTargetInfo instrument.Int64UpDownCounter
serviceGraphClient instrument.Float64Histogram
serviceGraphServer instrument.Float64Histogram
serviceGraphFailed instrument.Int64Counter
serviceGraphTotal instrument.Int64Counter
}

func ReportMetrics(
Expand Down Expand Up @@ -247,6 +261,19 @@ func (mr *MetricsReporter) spanMetricOptions(mlog *slog.Logger) []metric.Option
}
}

func (mr *MetricsReporter) graphMetricOptions(mlog *slog.Logger) []metric.Option {
if !mr.cfg.ServiceGraphMetricsEnabled() {
return []metric.Option{}
}

useExponentialHistograms := isExponentialAggregation(mr.cfg, mlog)

return []metric.Option{
metric.WithView(otelHistogramConfig(ServiceGraphClient, mr.cfg.Buckets.DurationHistogram, useExponentialHistograms)),
metric.WithView(otelHistogramConfig(ServiceGraphServer, mr.cfg.Buckets.DurationHistogram, useExponentialHistograms)),
}
}

func (mr *MetricsReporter) setupOtelMeters(m *Metrics, meter instrument.Meter) error {
if !mr.cfg.OTelMetricsEnabled() {
return nil
Expand Down Expand Up @@ -315,6 +342,36 @@ func (mr *MetricsReporter) setupSpanMeters(m *Metrics, meter instrument.Meter) e
return nil
}

func (mr *MetricsReporter) setupGraphMeters(m *Metrics, meter instrument.Meter) error {
if !mr.cfg.ServiceGraphMetricsEnabled() {
return nil
}

var err error

m.serviceGraphClient, err = meter.Float64Histogram(ServiceGraphClient, instrument.WithUnit("s"))
if err != nil {
return fmt.Errorf("creating service graph client histogram: %w", err)
}

m.serviceGraphServer, err = meter.Float64Histogram(ServiceGraphServer, instrument.WithUnit("s"))
if err != nil {
return fmt.Errorf("creating service graph server histogram: %w", err)
}

m.serviceGraphFailed, err = meter.Int64Counter(ServiceGraphFailed)
if err != nil {
return fmt.Errorf("creating service graph failed total: %w", err)
}

m.serviceGraphTotal, err = meter.Int64Counter(ServiceGraphTotal)
if err != nil {
return fmt.Errorf("creating service graph total: %w", err)
}

return nil
}

func (mr *MetricsReporter) newMetricSet(service svc.ID) (*Metrics, error) {
mlog := mlog().With("service", service)
mlog.Debug("creating new Metrics reporter")
Expand All @@ -328,6 +385,7 @@ func (mr *MetricsReporter) newMetricSet(service svc.ID) (*Metrics, error) {

opts = append(opts, mr.otelMetricOptions(mlog)...)
opts = append(opts, mr.spanMetricOptions(mlog)...)
opts = append(opts, mr.graphMetricOptions(mlog)...)

m := Metrics{
ctx: mr.ctx,
Expand Down Expand Up @@ -357,6 +415,15 @@ func (mr *MetricsReporter) newMetricSet(service svc.ID) (*Metrics, error) {
m.tracesTargetInfo.Add(mr.ctx, 1, attrOpt)
}

if mr.cfg.ServiceGraphMetricsEnabled() {
err = mr.setupGraphMeters(&m, meter)
if err != nil {
return nil, err
}
attrOpt := instrument.WithAttributeSet(mr.metricResourceAttributes(service))
m.tracesTargetInfo.Add(mr.ctx, 1, attrOpt)
}

return &m, nil
}

Expand Down Expand Up @@ -573,6 +640,30 @@ func (mr *MetricsReporter) spanMetricAttributes(span *request.Span) attribute.Se
return attribute.NewSet(attrs...)
}

func (mr *MetricsReporter) serviceGraphAttributes(span *request.Span) attribute.Set {
var attrs []attribute.KeyValue
if span.IsClientSpan() {
attrs = []attribute.KeyValue{
ClientMetric(span.PeerName),
ClientNamespaceMetric(span.ServiceID.Namespace),
ServerMetric(span.HostName),
ServerNamespaceMetric(span.OtherNamespace),
ConnectionTypeMetric("virtual_node"),
SourceMetric("beyla"),
}
} else {
attrs = []attribute.KeyValue{
ClientMetric(span.PeerName),
ClientNamespaceMetric(span.OtherNamespace),
ServerMetric(span.HostName),
ServerNamespaceMetric(span.ServiceID.Namespace),
ConnectionTypeMetric("virtual_node"),
SourceMetric("beyla"),
}
}
return attribute.NewSet(attrs...)
}

func (r *Metrics) record(span *request.Span, mr *MetricsReporter) {
t := span.Timings()
duration := t.End.Sub(t.RequestStart).Seconds()
Expand Down Expand Up @@ -602,6 +693,19 @@ func (r *Metrics) record(span *request.Span, mr *MetricsReporter) {
r.spanMetricsCallsTotal.Add(r.ctx, 1, attrOpt)
r.spanMetricsSizeTotal.Add(r.ctx, float64(span.ContentLength), attrOpt)
}

if mr.cfg.ServiceGraphMetricsEnabled() {
attrOpt := instrument.WithAttributeSet(mr.serviceGraphAttributes(span))
if span.IsClientSpan() {
r.serviceGraphClient.Record(r.ctx, duration, attrOpt)
} else {
r.serviceGraphServer.Record(r.ctx, duration, attrOpt)
}
r.serviceGraphTotal.Add(r.ctx, 1, attrOpt)
if SpanStatusCode(span) == codes.Error {
r.serviceGraphFailed.Add(r.ctx, 1, attrOpt)
}
}
}

func (mr *MetricsReporter) reportMetrics(input <-chan []request.Span) {
Expand Down
Loading
Loading