Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added null guards to 'Process' when processing an incoming span #1723

Merged
merged 1 commit into from
Aug 10, 2019
Merged

Added null guards to 'Process' when processing an incoming span #1723

merged 1 commit into from
Aug 10, 2019

Conversation

jpkrohling
Copy link
Contributor

Which problem is this PR solving?

Before this PR, the collector would crash at different points (see #1722) depending on the storage being used. After this PR, the following message is shown in the logs:

{"level":"error","ts":1565269296.058833,"caller":"app/span_processor.go:114","msg":"Failed to save span","error":"process is empty for the span","stacktrace":"github.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).saveSpan\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:114\ngithub.com/jaegertracing/jaeger/cmd/collector/app.ChainedProcessSpan.func1\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/model_consumer.go:34\ngithub.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).processItemFromQueue\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:139\ngithub.com/jaegertracing/jaeger/cmd/collector/app.NewSpanProcessor.func1\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:68\ngithub.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/pkg/queue/bounded_queue.go:65"}

And this counter is incremented:

jaeger_collector_spans_saved_by_svc_total{debug="false",result="err",svc="__unknown"} 1

@jpkrohling
Copy link
Contributor Author

Another solution to this is what was suggested by @pavolloffay in the issue, by making sure the Process is never nil. We'd then have to come up with a default serviceName. For the metric reporting, I'm setting __unknown, which could also serve as the default service name in case it's empty.

Given that all clients currently fail in case the service name is empty, I'd argue that the right thing to do when the Process is empty is to indeed return an error from WriteSpan

@codecov
Copy link

codecov bot commented Aug 8, 2019

Codecov Report

Merging #1723 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1723      +/-   ##
==========================================
- Coverage   98.36%   98.32%   -0.05%     
==========================================
  Files         193      193              
  Lines        9358     9364       +6     
==========================================
+ Hits         9205     9207       +2     
- Misses        119      122       +3     
- Partials       34       35       +1
Impacted Files Coverage Δ
cmd/collector/app/metrics.go 100% <100%> (ø) ⬆️
cmd/collector/app/span_processor.go 100% <100%> (ø) ⬆️
cmd/query/app/static_handler.go 83.33% <0%> (-3.51%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98fd69a...c94efb5. Read the comment docs.

@pavolloffay
Copy link
Member

Does the storage work fine when the service name is nill? If no, we could have a sanitizer which sets the service name to some value.

@jpkrohling
Copy link
Contributor Author

Some storages fail (ES), some work (Memory). It's not really consistent across storage mechanisms.

@pavolloffay
Copy link
Member

The client libraries also fail if the service name is null.

In the codebase we do not do nil checks on the arguments. As the storage might also fails when the service name is empty I think maybe a better would be to set it to some value or fail early in the ingestion at one place.

@jpkrohling
Copy link
Contributor Author

In theory, each method should guard itself against bad data ("be conservative in what you do, be liberal in what you accept from others"). We could add a check at the spanProcessor#saveSpan, which calls all other WriteSpan methods, but that method itself isn't using the Process object.

@yurishkuro
Copy link
Member

I don't think it's necessary to spread this check so much around. Storage implementations have a contract with the caller, Process is a required element in the data model.

We should only check this on the receiving side, as close as possible to the servers, and not infect the rest of the business logic with input validation.

@jpkrohling
Copy link
Contributor Author

I just updated the PR to do the check before the call to WriteSpan. This doesn't make the WriteSpan methods more robust, but as long as we can guarantee that no such calls are made with nil as the Process we should be good.

Here's the error message for the current state of this PR when a batch with a nil Process is sent:

{"level":"error","ts":1565339952.5434508,"caller":"app/span_processor.go:113","msg":"process is empty for the span","stacktrace":"github.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).saveSpan\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:113\ngithub.com/jaegertracing/jaeger/cmd/collector/app.ChainedProcessSpan.func1\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/model_consumer.go:34\ngithub.com/jaegertracing/jaeger/cmd/collector/app.(*spanProcessor).processItemFromQueue\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:145\ngithub.com/jaegertracing/jaeger/cmd/collector/app.NewSpanProcessor.func1\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/cmd/collector/app/span_processor.go:68\ngithub.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1\n\t/mnt/storage/jpkroehling/Projects/src/github.com/jaegertracing/jaeger/pkg/queue/bounded_queue.go:65"}

And here are the metrics that are recorded when this happens:

$ curl localhost:14269/metrics 2>/dev/null | grep __unknown 
jaeger_collector_spans_received_total{debug="false",format="proto",svc="__unknown",transport="grpc"} 1
jaeger_collector_spans_saved_by_svc_total{debug="false",result="err",svc="__unknown"} 1
jaeger_collector_traces_received_total{debug="false",format="proto",sampler_type="unknown",svc="__unknown",transport="grpc"} 1
jaeger_collector_traces_saved_by_svc_total{debug="false",result="err",sampler_type="unknown",svc="__unknown"} 1

@jpkrohling
Copy link
Contributor Author

@pavolloffay , @yurishkuro , I believe this is now in line with that you suggested. Could you please take a look?

@yurishkuro yurishkuro merged commit 3d3483f into jaegertracing:master Aug 10, 2019
@jpkrohling jpkrohling deleted the 1722-NullGuardProcessOnSave branch July 28, 2021 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC batch without service name crashes the collector
3 participants