Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency metrics contain spurious entries #3150

Closed
adrcunha opened this issue Feb 9, 2019 · 25 comments · Fixed by #3298
Closed

Latency metrics contain spurious entries #3150

adrcunha opened this issue Feb 9, 2019 · 25 comments · Fixed by #3298
Assignees
Labels
area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug.

Comments

@adrcunha
Copy link
Contributor

adrcunha commented Feb 9, 2019

In what area(s)?

/area API
/area autoscale
/area build
/area monitoring
/area networking
/area test-and-release

What version of Knative?

HEAD

Expected Behavior

No spurious metrics.

Actual Behavior

Metrics "/", "/envvars" and "/filepath" are displayed in Testgrid:

image

Steps to Reproduce the Problem

Check the Testgrid tab.

@adrcunha adrcunha added the kind/bug Categorizes issue or PR as related to a bug. label Feb 9, 2019
@knative-prow-robot knative-prow-robot added the area/test-and-release It flags unit/e2e/conformance/perf test issues for product features label Feb 9, 2019
@adrcunha
Copy link
Contributor Author

adrcunha commented Feb 9, 2019

The spurious entries come from:

I0209 02:31:52.621] 2019-02-09T02:31:52.621Z	info	TestDestroyPodInflight	trace/trace.go:266	metric / 1549679469576330274 1549679512621145241 43.044814967s
I0209 02:20:10.469] 2019-02-09T02:20:10.468Z	info	TestMustFileSystemPermissions	trace/trace.go:266	metric /filepath 1549678810380861283 1549678810468268299 87.407016ms
I0209 02:17:40.448] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric /envvars 1549678660402202372 1549678660448050963 45.848591ms

@adrcunha
Copy link
Contributor Author

adrcunha commented Feb 9, 2019

Looks like the metric names are the paths fetched by tests like envvars_test.go or filesystem_perm_test.go. For example:

I0209 02:17:32.032] === RUN   TestSecretsFromEnv
I0209 02:17:32.092] 2019-02-09T02:17:32.091Z	info	TestSecretsFromEnv	conformance/envpropagation_test.go:58	Successfully created test secret: %v&Secret{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:test-secret,GenerateName:,Namespace:serving-tests,SelfLink:/api/v1/namespaces/serving-tests/secrets/test-secret,UID:db3830a2-2c10-11e9-bf6e-42010a8a00e3,ResourceVersion:3715,Generation:0,CreationTimestamp:2019-02-09 02:17:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Data:map[string][]byte{testKey: [116 101 115 116 86 97 108 117 101],},Type:Opaque,StringData:map[string]string{},}
I0209 02:17:32.092] 2019-02-09T02:17:32.091Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:82	Creating a new Service
I0209 02:17:32.141] 2019-02-09T02:17:32.141Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:97	The Service will be updated with the name of the Revision once it is created
I0209 02:17:32.184] 2019-02-09T02:17:32.184Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForServiceState/yashikifwmnzwxl/ServiceUpdatedWithRevision 1549678652141204179 1549678652184284988 43.080809ms
I0209 02:17:32.185] 2019-02-09T02:17:32.184Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:110	When the Service reports as Ready, everything should be ready.
I0209 02:17:40.272] 2019-02-09T02:17:40.271Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForServiceState/yashikifwmnzwxl/ServiceIsReady 1549678652184535388 1549678660271911268 8.08737588s
I0209 02:17:40.273] 2019-02-09T02:17:40.272Z	info	TestSecretsFromEnv	conformance/conformancetest_helper.go:115	When the Revision can have traffic routed to it, the Route is marked as Ready.
I0209 02:17:40.315] 2019-02-09T02:17:40.315Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForRouteState/yashikifwmnzwxl/RouteIsReady 1549678660272162986 1549678660315026705 42.863719ms
I0209 02:17:40.448] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric /envvars 1549678660402202372 1549678660448050963 45.848591ms
I0209 02:17:40.448] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric SpoofingClient-Trace 1549678660402166149 1549678660448273199 46.10705ms
I0209 02:17:40.449] 2019-02-09T02:17:40.448Z	info	TestSecretsFromEnv	trace/trace.go:266	metric WaitForEndpointState/EnvVarsServesText 1549678660358301123 1549678660448431718 90.130595ms
I0209 02:17:40.639] --- PASS: TestSecretsFromEnv (8.61s)

@adrcunha
Copy link
Contributor Author

adrcunha commented Feb 9, 2019

Looks like opencensus is the culprit.

@adrcunha
Copy link
Contributor Author

adrcunha commented Feb 9, 2019

Some possible solutions:

@dushyanthsc @srinivashegde86 please provide feedback.

@srinivashegde86
Copy link
Contributor

I dont think metric name length should decide this. We can explore either ignoring names starting with / and/or adding a prefix.

Adding a prefix seems unnecessary for most metrics and may depend on the test?

@adrcunha
Copy link
Contributor Author

I dont think metric name length should decide this. We can explore either ignoring names starting with / and/or adding a prefix.
Adding a prefix seems unnecessary for most metrics and may depend on the test?

Your suggestions are conflicting.

A prefix tagging metrics is the easiest/quickest solution to this problem.

But actually I was hoping for a more clean solution, like having a separate logger for the metrics, so there's no mix-up. But I lack knowledge right now to say if it's feasible or not.

@mattmoor mattmoor added this to the Ice Box milestone Feb 11, 2019
@srinivashegde86
Copy link
Contributor

Sorry, dont think I understand the problem completely here.

The problem is the metric name starts with / and opencensus parsing fails in these cases?

@dushyanthsc If we use a new logger, the zipkin tracing will be broken?

@adrcunha
Copy link
Contributor Author

Sorry, dont think I understand the problem completely here.
The problem is the metric name starts with / and opencensus parsing fails in these cases?

No. The problem is that these "metrics" are logged by third-party code, and thus appear in our Testgrid tab. But they're unrelated, and thus shouldn't be parsed as (our) metrics.

@dushyanthsc
Copy link
Contributor

Zipkin tracing would only depend on the Trace - ID being attached to the Http Header. It would not have any dependency on the logger.

I am trying to understand the advantages of the new logger. Even with the new logger would still need to have some mechanism to parse out spans whose names would be just / Unless we override Handler.FormatSpanName right?

@adrcunha
Copy link
Contributor Author

We don't care about names starting with /. These are not our metrics.

@adrcunha
Copy link
Contributor Author

For example:

metric WaitForRouteState/yashikifwmnzwxl/RouteIsReady 1549678660272162986 1549678660315026705 42.863719ms
metric /envvars 1549678660402202372 1549678660448050963 45.848591ms
metric SpoofingClient-Trace 1549678660402166149 1549678660448273199 46.10705ms

The only real metric for our dashboard is the first line. Lines 2 and 3 are garbage for us. But since we use a single, global trace, the traces on line 2 and 3 are (erroneously) formatted as our metrics and displayed in the dashboard.

@adrcunha
Copy link
Contributor Author

More details about emitting metrics in https://github.com/knative/pkg/tree/master/test#emit-metrics

@srinivashegde86
Copy link
Contributor

is this only related to the ochttp transport plugin? We added the support for this for looking at http headers. This was added to enable support for zipkin tracing nad maybe not needed to others?

We can probably add a formatename method that does this filtering or adds a filterable prefix? http://go/gh/census-instrumentation/opencensus-go/blob/master/plugin/ochttp/server.go#L72

@dushyanthsc
Copy link
Contributor

Got it, for the third trace, it has to be started for zipkin tracing to work. So we would still need that. But you are right that trace need not be logged.

I would lean towards adding a prefix - in this way we control what gets logged and the ExportSpan only logs for specific prefix. Should we use a new logger @srinivashegde86 can comment on it.

@dushyanthsc
Copy link
Contributor

@srinivashegde86 that can be used to give the span a ignorable name, but logic to not log that trace has to be somewhere.

@adrcunha
Copy link
Contributor Author

Congrats, you got a bug for yourself. :D

/assign @dushyanthsc

@srinivashegde86
Copy link
Contributor

I dont think adding a new logger will help here. Opencensus is logging this information using their logging mechanism and we are looking at STDOUT in the logs, so we will capture their logs as well.

Some kind of filtering/prefixing might be a better option.

@dushyanthsc
Copy link
Contributor

Just to make sure we are all the same page, here is the plan of change:

  1. Define a constant in knative/pkg/logging.go - "ExportMetric-".
  2. Update all locations where we are starting spans that we need to export by pre-pending the metric name with the constant defined in step 1.
    The places where we make this change include:
    • All wait* methods in knative/serving/test/crd_checks.go
    • WaitForEndpointState method in knative/pkg/test/request.go
  3. Update ExportSpan method in knative/pkg/test/logging.go to only log if the metric name is prepended with the constant in step 1.

@adrcunha @srinivashegde86
Once the plan is signed off I will make the changes and send a PR.

@srinivashegde86
Copy link
Contributor

This is custom exporter that logs the metric for us - https://github.com/knative/pkg/blob/master/test/logging/logging.go#L71

We can update this log to prefix with something like the ExportMetric const and then update our latency parser to check for this?

@adrcunha
Copy link
Contributor Author

  1. Update all locations where we are starting spans that we need to export by pre-pending the metric
    name with the constant defined in step 1.

This is error-prone and implies more boilerplate code. Create a helper, update the documentation.

We can update this log to prefix with something like the ExportMetric const and then update our
latency parser to check for this?

Won't work, the end result will be the same with just a more complicated way: you're still logging everything as a valid metric.

@srinivashegde86
Copy link
Contributor

2. The places where we make this change include:

Also https://github.com/knative/pkg/blob/master/test/kube_checks.go

@dushyanthsc
Copy link
Contributor

Ok, besides the approach to adding a helper and additional place in kube_checks. Should I make the changes?

dushyanthsc added a commit to dushyanthsc/pkg that referenced this issue Feb 14, 2019
knative/serving#3150 describes the issue
that currently exists in our test logging framework. This change
fixes the problem by prefixing metrics that needs to be emited by
a constant which the logging.ExpoxtSpan method then uses to identify
the spans that needs to be emitted as logs

Note this only fixes part of the issue:
knative/serving#3150
this change needs to be ported to knative serving before the issue
can be closed.
knative-prow-robot pushed a commit to knative/pkg that referenced this issue Feb 15, 2019
* Metrics logging fix in pkg/test: Issue-3150

knative/serving#3150 describes the issue
that currently exists in our test logging framework. This change
fixes the problem by prefixing metrics that needs to be emited by
a constant which the logging.ExpoxtSpan method then uses to identify
the spans that needs to be emitted as logs

Note this only fixes part of the issue:
knative/serving#3150
this change needs to be ported to knative serving before the issue
can be closed.

* Update test/logging/logging.go

Adding required lines.

Co-Authored-By: dushyanthsc <[email protected]>
dushyanthsc added a commit to dushyanthsc/serving that referenced this issue Feb 21, 2019
@adrcunha
Copy link
Contributor Author

Reopening since this also requires changes in pkg.

@dushyanthsc
Copy link
Contributor

@adrcunha Did I miss something in this PR: knative/pkg#279

@adrcunha
Copy link
Contributor Author

Did I miss something in this PR: knative/pkg#279

Sorry, no. It was I who missed your PR in pkg.

@dprotaso dprotaso removed this from the Ice Box milestone Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants