Latency metrics contain spurious entries #3150

adrcunha · 2019-02-09T03:38:24Z

In what area(s)?

/area API
/area autoscale
/area build
/area monitoring
/area networking
/area test-and-release

What version of Knative?

HEAD

Expected Behavior

No spurious metrics.

Actual Behavior

Metrics "/", "/envvars" and "/filepath" are displayed in Testgrid:

Steps to Reproduce the Problem

Check the Testgrid tab.

adrcunha · 2019-02-09T03:40:03Z

The spurious entries come from:

I0209 02:31:52.621] 2019-02-09T02:31:52.621Z	info	TestDestroyPodInflight	trace/trace.go:266	metric / 1549679469576330274 1549679512621145241 43.044814967s
I0209 02:20:10.469] 2019-02-09T02:20:10.468Z	info	TestMustFileSystemPermissions	trace/trace.go:266	metric /filepath 1549678810380861283 1549678810468268299 87.407016ms
I0209 02:17:40.448] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric /envvars 1549678660402202372 1549678660448050963 45.848591ms

adrcunha · 2019-02-09T04:21:49Z

Looks like the metric names are the paths fetched by tests like envvars_test.go or filesystem_perm_test.go. For example:

I0209 02:17:32.032] === RUN   TestSecretsFromEnv
I0209 02:17:32.092] 2019-02-09T02:17:32.091Z	info	TestSecretsFromEnv	conformance/envpropagation_test.go:58	Successfully created test secret: %v&Secret{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:test-secret,GenerateName:,Namespace:serving-tests,SelfLink:/api/v1/namespaces/serving-tests/secrets/test-secret,UID:db3830a2-2c10-11e9-bf6e-42010a8a00e3,ResourceVersion:3715,Generation:0,CreationTimestamp:2019-02-09 02:17:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Data:map[string][]byte{testKey: [116 101 115 116 86 97 108 117 101],},Type:Opaque,StringData:map[string]string{},}
I0209 02:17:32.092] 2019-02-09T02:17:32.091Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:82	Creating a new Service
I0209 02:17:32.141] 2019-02-09T02:17:32.141Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:97	The Service will be updated with the name of the Revision once it is created
I0209 02:17:32.184] 2019-02-09T02:17:32.184Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForServiceState/yashikifwmnzwxl/ServiceUpdatedWithRevision 1549678652141204179 1549678652184284988 43.080809ms
I0209 02:17:32.185] 2019-02-09T02:17:32.184Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:110	When the Service reports as Ready, everything should be ready.
I0209 02:17:40.272] 2019-02-09T02:17:40.271Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForServiceState/yashikifwmnzwxl/ServiceIsReady 1549678652184535388 1549678660271911268 8.08737588s
I0209 02:17:40.273] 2019-02-09T02:17:40.272Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:115	When the Revision can have traffic routed to it, the Route is marked as Ready.
I0209 02:17:40.315] 2019-02-09T02:17:40.315Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForRouteState/yashikifwmnzwxl/RouteIsReady 1549678660272162986 1549678660315026705 42.863719ms
I0209 02:17:40.448] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric /envvars 1549678660402202372 1549678660448050963 45.848591ms
I0209 02:17:40.448] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric SpoofingClient-Trace 1549678660402166149 1549678660448273199 46.10705ms
I0209 02:17:40.449] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForEndpointState/EnvVarsServesText 1549678660358301123 1549678660448431718 90.130595ms
I0209 02:17:40.639] --- PASS: TestSecretsFromEnv (8.61s)

adrcunha · 2019-02-09T04:38:58Z

Looks like opencensus is the culprit.

metric is defined:

serving/vendor/go.opencensus.io/plugin/ochttp/server.go

Line 113 in cc62102

ctx, span = trace.StartSpan(ctx, name,
with name being the request path:

serving/vendor/go.opencensus.io/plugin/ochttp/server.go

Line 95 in cc62102

name = spanNameFromURL(r)

(see

serving/vendor/go.opencensus.io/plugin/ochttp/trace.go

Line 148 in cc62102

return req.URL.Path

)

adrcunha · 2019-02-09T04:48:03Z

Some possible solutions:

ignore metric names too short during data aggregation
ignore metric names starting with / during data aggregation (malformed name)
add a prefix (discarded later) to metric names intended to be aggregated (e.g., metric:WaitForRouteState)
apply some sort of filter to ExportSpan (see https://github.com/knative/pkg/blob/b670fe05a9a84f62b7b84b4bcaf18f415a1f200b/test/logging/logging.go#L71)

@dushyanthsc @srinivashegde86 please provide feedback.

srinivashegde86 · 2019-02-11T01:21:27Z

I dont think metric name length should decide this. We can explore either ignoring names starting with / and/or adding a prefix.

Adding a prefix seems unnecessary for most metrics and may depend on the test?

adrcunha · 2019-02-11T03:47:37Z

I dont think metric name length should decide this. We can explore either ignoring names starting with / and/or adding a prefix.
Adding a prefix seems unnecessary for most metrics and may depend on the test?

Your suggestions are conflicting.

A prefix tagging metrics is the easiest/quickest solution to this problem.

But actually I was hoping for a more clean solution, like having a separate logger for the metrics, so there's no mix-up. But I lack knowledge right now to say if it's feasible or not.

srinivashegde86 · 2019-02-11T17:46:16Z

Sorry, dont think I understand the problem completely here.

The problem is the metric name starts with / and opencensus parsing fails in these cases?

@dushyanthsc If we use a new logger, the zipkin tracing will be broken?

adrcunha · 2019-02-11T18:00:17Z

Sorry, dont think I understand the problem completely here.
The problem is the metric name starts with / and opencensus parsing fails in these cases?

No. The problem is that these "metrics" are logged by third-party code, and thus appear in our Testgrid tab. But they're unrelated, and thus shouldn't be parsed as (our) metrics.

dushyanthsc · 2019-02-11T18:00:25Z

Zipkin tracing would only depend on the Trace - ID being attached to the Http Header. It would not have any dependency on the logger.

I am trying to understand the advantages of the new logger. Even with the new logger would still need to have some mechanism to parse out spans whose names would be just / Unless we override Handler.FormatSpanName right?

adrcunha · 2019-02-11T18:01:21Z

We don't care about names starting with /. These are not our metrics.

adrcunha · 2019-02-11T18:03:37Z

For example:

metric WaitForRouteState/yashikifwmnzwxl/RouteIsReady 1549678660272162986 1549678660315026705 42.863719ms
metric /envvars 1549678660402202372 1549678660448050963 45.848591ms
metric SpoofingClient-Trace 1549678660402166149 1549678660448273199 46.10705ms

The only real metric for our dashboard is the first line. Lines 2 and 3 are garbage for us. But since we use a single, global trace, the traces on line 2 and 3 are (erroneously) formatted as our metrics and displayed in the dashboard.

adrcunha · 2019-02-11T18:04:22Z

More details about emitting metrics in https://github.com/knative/pkg/tree/master/test#emit-metrics

srinivashegde86 · 2019-02-11T18:26:31Z

is this only related to the ochttp transport plugin? We added the support for this for looking at http headers. This was added to enable support for zipkin tracing nad maybe not needed to others?

We can probably add a formatename method that does this filtering or adds a filterable prefix? http://go/gh/census-instrumentation/opencensus-go/blob/master/plugin/ochttp/server.go#L72

dushyanthsc · 2019-02-11T18:27:25Z

Got it, for the third trace, it has to be started for zipkin tracing to work. So we would still need that. But you are right that trace need not be logged.

I would lean towards adding a prefix - in this way we control what gets logged and the ExportSpan only logs for specific prefix. Should we use a new logger @srinivashegde86 can comment on it.

dushyanthsc · 2019-02-11T18:30:44Z

@srinivashegde86 that can be used to give the span a ignorable name, but logic to not log that trace has to be somewhere.

adrcunha · 2019-02-11T19:26:17Z

Congrats, you got a bug for yourself. :D

/assign @dushyanthsc

srinivashegde86 · 2019-02-11T19:40:16Z

I dont think adding a new logger will help here. Opencensus is logging this information using their logging mechanism and we are looking at STDOUT in the logs, so we will capture their logs as well.

Some kind of filtering/prefixing might be a better option.

dushyanthsc · 2019-02-11T23:41:58Z

Just to make sure we are all the same page, here is the plan of change:

Define a constant in knative/pkg/logging.go - "ExportMetric-".
Update all locations where we are starting spans that we need to export by pre-pending the metric name with the constant defined in step 1.
The places where we make this change include:
- All wait* methods in knative/serving/test/crd_checks.go
- WaitForEndpointState method in knative/pkg/test/request.go
Update ExportSpan method in knative/pkg/test/logging.go to only log if the metric name is prepended with the constant in step 1.

@adrcunha @srinivashegde86
Once the plan is signed off I will make the changes and send a PR.

srinivashegde86 · 2019-02-11T23:48:20Z

This is custom exporter that logs the metric for us - https://github.com/knative/pkg/blob/master/test/logging/logging.go#L71

We can update this log to prefix with something like the ExportMetric const and then update our latency parser to check for this?

adrcunha · 2019-02-11T23:51:51Z

Update all locations where we are starting spans that we need to export by pre-pending the metric
name with the constant defined in step 1.

This is error-prone and implies more boilerplate code. Create a helper, update the documentation.

We can update this log to prefix with something like the ExportMetric const and then update our
latency parser to check for this?

Won't work, the end result will be the same with just a more complicated way: you're still logging everything as a valid metric.

srinivashegde86 · 2019-02-11T23:57:50Z

2. The places where we make this change include:

Also https://github.com/knative/pkg/blob/master/test/kube_checks.go

dushyanthsc · 2019-02-12T00:01:34Z

Ok, besides the approach to adding a helper and additional place in kube_checks. Should I make the changes?

knative/serving#3150 describes the issue that currently exists in our test logging framework. This change fixes the problem by prefixing metrics that needs to be emited by a constant which the logging.ExpoxtSpan method then uses to identify the spans that needs to be emitted as logs Note this only fixes part of the issue: knative/serving#3150 this change needs to be ported to knative serving before the issue can be closed.

* Metrics logging fix in pkg/test: Issue-3150 knative/serving#3150 describes the issue that currently exists in our test logging framework. This change fixes the problem by prefixing metrics that needs to be emited by a constant which the logging.ExpoxtSpan method then uses to identify the spans that needs to be emitted as logs Note this only fixes part of the issue: knative/serving#3150 this change needs to be ported to knative serving before the issue can be closed. * Update test/logging/logging.go Adding required lines. Co-Authored-By: dushyanthsc <[email protected]>

This changeset fixes issue: knative#3150

This changeset fixes issue: #3150

adrcunha · 2019-02-21T19:24:10Z

Reopening since this also requires changes in pkg.

dushyanthsc · 2019-02-21T19:26:33Z

@adrcunha Did I miss something in this PR: knative/pkg#279

adrcunha · 2019-02-21T19:34:03Z

Did I miss something in this PR: knative/pkg#279

Sorry, no. It was I who missed your PR in pkg.

adrcunha added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2019

knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Feb 9, 2019

mattmoor added this to the Ice Box milestone Feb 11, 2019

knative-prow-robot assigned dushyanthsc Feb 11, 2019

dushyanthsc mentioned this issue Feb 14, 2019

Metrics logging fix in pkg/test: Issue-3150 knative/pkg#279

Merged

dushyanthsc added a commit to dushyanthsc/serving that referenced this issue Feb 21, 2019

Fix for latency metric spurious entries

ebfb278

This changeset fixes issue: knative#3150

dushyanthsc mentioned this issue Feb 21, 2019

Fix for latency metric spurious entries #3298

Merged

knative-prow-robot closed this as completed in #3298 Feb 21, 2019

knative-prow-robot pushed a commit that referenced this issue Feb 21, 2019

Fix for latency metric spurious entries (#3298)

3c38f5a

This changeset fixes issue: #3150

adrcunha reopened this Feb 21, 2019

adrcunha closed this as completed Feb 21, 2019

dprotaso removed this from the Ice Box milestone Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency metrics contain spurious entries #3150

Latency metrics contain spurious entries #3150

adrcunha commented Feb 9, 2019 •

edited

Loading

adrcunha commented Feb 9, 2019

adrcunha commented Feb 9, 2019 •

edited

Loading

adrcunha commented Feb 9, 2019

adrcunha commented Feb 9, 2019

srinivashegde86 commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

adrcunha commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

adrcunha commented Feb 11, 2019

adrcunha commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

dushyanthsc commented Feb 12, 2019

adrcunha commented Feb 21, 2019

dushyanthsc commented Feb 21, 2019

adrcunha commented Feb 21, 2019

Latency metrics contain spurious entries #3150

Latency metrics contain spurious entries #3150

Comments

adrcunha commented Feb 9, 2019 • edited Loading

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

adrcunha commented Feb 9, 2019

adrcunha commented Feb 9, 2019 • edited Loading

adrcunha commented Feb 9, 2019

adrcunha commented Feb 9, 2019

srinivashegde86 commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

adrcunha commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

adrcunha commented Feb 11, 2019

adrcunha commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

dushyanthsc commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

adrcunha commented Feb 11, 2019

srinivashegde86 commented Feb 11, 2019

dushyanthsc commented Feb 12, 2019

adrcunha commented Feb 21, 2019

dushyanthsc commented Feb 21, 2019

adrcunha commented Feb 21, 2019

adrcunha commented Feb 9, 2019 •

edited

Loading

adrcunha commented Feb 9, 2019 •

edited

Loading