Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for directly generating span graphs #745

Merged
merged 22 commits into from
Apr 18, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add service graph series
grcevski committed Apr 15, 2024
commit b778a848109a22c4dc64fdb73d3de2219318ce38
25 changes: 25 additions & 0 deletions pkg/internal/export/otel/common.go
Original file line number Diff line number Diff line change
@@ -262,6 +262,11 @@ const (
SourceKey = attribute.Key("source")
ServiceKey = attribute.Key("service")
InstanceKey = attribute.Key("instance")
ClientKey = attribute.Key("client")
ClientNamespaceKey = attribute.Key("client_service_namespace")
ServerKey = attribute.Key("server")
ServerNamespaceKey = attribute.Key("server_service_namespace")
ConnectionTypeKey = attribute.Key("connection_type")
)

func HTTPRequestMethod(val string) attribute.KeyValue {
@@ -327,3 +332,23 @@ func StatusCodeMetric(val int) attribute.KeyValue {
func ServiceInstanceMetric(val string) attribute.KeyValue {
return InstanceKey.String(val)
}

func ClientMetric(val string) attribute.KeyValue {
return ClientKey.String(val)
}

func ClientNamespaceMetric(val string) attribute.KeyValue {
return ClientNamespaceKey.String(val)
}

func ServerMetric(val string) attribute.KeyValue {
return ServerKey.String(val)
}

func ServerNamespaceMetric(val string) attribute.KeyValue {
return ServerNamespaceKey.String(val)
}

func ConnectionTypeMetric(val string) attribute.KeyValue {
return ConnectionTypeKey.String(val)
}
55 changes: 55 additions & 0 deletions pkg/internal/export/otel/metrics.go
Original file line number Diff line number Diff line change
@@ -12,6 +12,7 @@ import (

"github.com/mariomac/pipes/pkg/node"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/codes"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
instrument "go.opentelemetry.io/otel/metric"
@@ -41,6 +42,10 @@ const (
SpanMetricsCalls = "traces_spanmetrics_calls_total"
SpanMetricsSizes = "traces_spanmetrics_size_total"
TracesTargetInfo = "traces_target_info"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood, Tempo generates traces_target_info when the OTel Collector Service Graph Connector generates target_info and we have a configuration flag to choose the one used (screenshot below).

@domasx2 do we have best practices here?

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean we may not need it? I simply went by what I saw Tempo generated for my cloud instance when using Alloy.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed for sure. I think what @cyrille-leclerc wondering is wether you should align with Tempo metric generator metric names, or OTEL spanmetrics/ service graph metrics connector metric names. I don't have a strong opinion TBH, though would lean towards OTEL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh absolutely OTel then, just confirming, so if we renamed the collection as target_info App O11y will continue to work, correct? I'll create a PR to fix this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that span metric names change as well for Otel, instead of traces_span_* it's duration_seconds_bucket, duration_seconds_count and duration_seconds_total

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @domasx2 said, the question of OTel vs Tempo naming conventions. I would see value in adopting OTel conventions.
That being said, what's our recommendation if a customer is getting started combining Beyla with a bunch of app instrumented with OTel SDKs the simplest possible way, relying on Tempo metrics generation?
Sorry it's more complicated than we would like.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with the Gang that it would be beneficial to run with the new otel default. Given that users can have more instrumentation than just a single beyla deployment, I would opt to make this configurable to be either Otel (default) OR tempo.

If it was flexible, people could rely on tempo generated metrics + the ones from Beyla.

I would also lean towards helping tempo to align with otel long-term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think we have a consensus. We'll add an OTel format and make it default.

I opened a new issue to track this here #821

ServiceGraphClient = "traces_service_graph_request_client"
ServiceGraphServer = "traces_service_graph_request_server"
ServiceGraphFailed = "traces_service_graph_request_failed_total"
ServiceGraphTotal = "traces_service_graph_request_total"

UsualPortGRPC = "4317"
UsualPortHTTP = "4318"
@@ -162,6 +167,10 @@ type Metrics struct {
spanMetricsCallsTotal instrument.Int64Counter
spanMetricsSizeTotal instrument.Float64Counter
tracesTargetInfo instrument.Int64UpDownCounter
serviceGraphClient instrument.Float64Histogram
serviceGraphServer instrument.Float64Histogram
serviceGraphFailed instrument.Int64Counter
serviceGraphTotal instrument.Int64Counter
}

func ReportMetrics(
@@ -235,6 +244,8 @@ func (mr *MetricsReporter) spanMetricOptions(mlog *slog.Logger) []metric.Option

return []metric.Option{
metric.WithView(otelHistogramConfig(SpanMetricsLatency, mr.cfg.Buckets.DurationHistogram, useExponentialHistograms)),
metric.WithView(otelHistogramConfig(ServiceGraphClient, mr.cfg.Buckets.DurationHistogram, useExponentialHistograms)),
metric.WithView(otelHistogramConfig(ServiceGraphServer, mr.cfg.Buckets.DurationHistogram, useExponentialHistograms)),
}
}

@@ -303,6 +314,26 @@ func (mr *MetricsReporter) setupSpanMeters(m *Metrics, meter instrument.Meter) e
return fmt.Errorf("creating span metric traces target info: %w", err)
}

m.serviceGraphClient, err = meter.Float64Histogram(ServiceGraphClient, instrument.WithUnit("s"))
if err != nil {
return fmt.Errorf("creating service graph client histogram: %w", err)
}

m.serviceGraphServer, err = meter.Float64Histogram(ServiceGraphServer, instrument.WithUnit("s"))
if err != nil {
return fmt.Errorf("creating service graph server histogram: %w", err)
}

m.serviceGraphFailed, err = meter.Int64Counter(ServiceGraphFailed)
if err != nil {
return fmt.Errorf("creating service graph failed total: %w", err)
}

m.serviceGraphTotal, err = meter.Int64Counter(ServiceGraphTotal)
if err != nil {
return fmt.Errorf("creating service graph total: %w", err)
}

return nil
}

@@ -564,6 +595,19 @@ func (mr *MetricsReporter) spanMetricAttributes(span *request.Span) attribute.Se
return attribute.NewSet(attrs...)
}

func (mr *MetricsReporter) serviceGraphAttributes(span *request.Span) attribute.Set {
attrs := []attribute.KeyValue{
ClientMetric(span.PeerName),
ClientNamespaceMetric(span.ServiceID.Namespace), // TODO: what do we do here?
ServerMetric(span.HostName),
ServerNamespaceMetric(span.ServiceID.Namespace),
ConnectionTypeMetric("virtual_node"),
SourceMetric("beyla"),
}

return attribute.NewSet(attrs...)
}

func (r *Metrics) record(span *request.Span, mr *MetricsReporter) {
t := span.Timings()
duration := t.End.Sub(t.RequestStart).Seconds()
@@ -592,6 +636,17 @@ func (r *Metrics) record(span *request.Span, mr *MetricsReporter) {
r.spanMetricsLatency.Record(r.ctx, duration, attrOpt)
r.spanMetricsCallsTotal.Add(r.ctx, 1, attrOpt)
r.spanMetricsSizeTotal.Add(r.ctx, float64(span.ContentLength), attrOpt)

attrOpt = instrument.WithAttributeSet(mr.serviceGraphAttributes(span))
if span.IsClientSpan() {
r.serviceGraphClient.Record(r.ctx, duration, attrOpt)
} else {
r.serviceGraphServer.Record(r.ctx, duration, attrOpt)
}
r.serviceGraphTotal.Add(r.ctx, 1, attrOpt)
if SpanStatusCode(span) == codes.Error {
r.serviceGraphFailed.Add(r.ctx, 1, attrOpt)
}
}
}

28 changes: 22 additions & 6 deletions pkg/internal/export/otel/traces.go
Original file line number Diff line number Diff line change
@@ -292,6 +292,22 @@ func SpanKindString(span *request.Span) string {
return "SPAN_KIND_INTERNAL"
}

func spanHost(span *request.Span) string {
if span.HostName != "" {
return span.HostName
}

return span.Host
}

func spanPeer(span *request.Span) string {
if span.PeerName != "" {
return span.PeerName
}

return span.Peer
}

func TraceAttributes(span *request.Span) []attribute.KeyValue {
var attrs []attribute.KeyValue

@@ -301,8 +317,8 @@ func TraceAttributes(span *request.Span) []attribute.KeyValue {
HTTPRequestMethod(span.Method),
HTTPResponseStatusCode(span.Status),
HTTPUrlPath(span.Path),
ClientAddr(span.Peer),
ServerAddr(span.Host),
ClientAddr(spanPeer(span)),
ServerAddr(spanHost(span)),
ServerPort(span.HostPort),
HTTPRequestBodySize(int(span.ContentLength)),
}
@@ -314,16 +330,16 @@ func TraceAttributes(span *request.Span) []attribute.KeyValue {
semconv.RPCMethod(span.Path),
semconv.RPCSystemGRPC,
semconv.RPCGRPCStatusCodeKey.Int(span.Status),
ClientAddr(span.Peer),
ServerAddr(span.Host),
ClientAddr(spanPeer(span)),
ServerAddr(spanHost(span)),
ServerPort(span.HostPort),
}
case request.EventTypeHTTPClient:
attrs = []attribute.KeyValue{
HTTPRequestMethod(span.Method),
HTTPResponseStatusCode(span.Status),
HTTPUrlFull(span.Path),
ServerAddr(span.Host),
ServerAddr(spanHost(span)),
ServerPort(span.HostPort),
HTTPRequestBodySize(int(span.ContentLength)),
}
@@ -332,7 +348,7 @@ func TraceAttributes(span *request.Span) []attribute.KeyValue {
semconv.RPCMethod(span.Path),
semconv.RPCSystemGRPC,
semconv.RPCGRPCStatusCodeKey.Int(span.Status),
ServerAddr(span.Host),
ServerAddr(spanHost(span)),
ServerPort(span.HostPort),
}
case request.EventTypeSQLClient:
10 changes: 9 additions & 1 deletion pkg/transform/name_resolver.go
Original file line number Diff line number Diff line change
@@ -6,6 +6,7 @@ import (
"strings"
"time"

"github.com/grafana/beyla/pkg/internal/kube"
"github.com/grafana/beyla/pkg/internal/request"
"github.com/grafana/beyla/pkg/internal/svc"
"github.com/hashicorp/golang-lru/v2/expirable"
@@ -84,7 +85,8 @@ func (nr *NameResolver) resolveNames(span *request.Span) {
} else {
host = nr.resolve(&span.ServiceID, span.Host)
if len(host) > 0 {
if strings.EqualFold(host, peer) && span.ServiceID.AutoName {
_, ok := span.ServiceID.Metadata[kube.PodName]
if ok && strings.EqualFold(host, peer) && span.ServiceID.AutoName {
span.HostName = span.ServiceID.Name
} else {
span.HostName = host
@@ -109,6 +111,12 @@ func (nr *NameResolver) resolve(svc *svc.ID, ip string) string {
n = strings.TrimSuffix(n, ".")
n = trimSuffixIgnoreCase(n, ".svc.cluster.local")
n = trimSuffixIgnoreCase(n, "."+svc.Namespace)

kubeNamespace, ok := svc.Metadata[kube.NamespaceName]
if ok && kubeNamespace != "" && kubeNamespace != svc.Namespace {
n = trimSuffixIgnoreCase(n, "."+kubeNamespace)
}

dashIP := strings.ReplaceAll(ip, ".", "-") + "."
n = trimPrefixIgnoreCase(n, dashIP)