Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tailsampling: only send to next consumer once #1735

Merged
merged 6 commits into from
Sep 24, 2020

Conversation

chris-smith-zocdoc
Copy link
Contributor

Description:

Previously if multiple sampling policies chose to sample, the trace would be send to the next consumer multiple times. This fixes it to only send it once.

Link to tracking Issue: #1514

Testing:
Unit test added as well as a simple manual test using the reproduction case in the original issue.

Documentation:
None. Do I need to add any?

@codecov
Copy link

codecov bot commented Sep 3, 2020

Codecov Report

Merging #1735 into master will increase coverage by 0.10%.
The diff coverage is 97.67%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1735      +/-   ##
==========================================
+ Coverage   91.15%   91.26%   +0.10%     
==========================================
  Files         272      272              
  Lines       16256    16263       +7     
==========================================
+ Hits        14818    14842      +24     
+ Misses       1010      996      -14     
+ Partials      428      425       -3     
Impacted Files Coverage Δ
...mplingprocessor/tailsamplingprocessor/processor.go 74.03% <97.67%> (+7.37%) ⬆️
translator/internaldata/resource_to_oc.go 91.48% <0.00%> (+4.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8690937...fd7e40c. Read the comment docs.

@chris-smith-zocdoc
Copy link
Contributor Author

contrib-test is failing on an unrelated test.

--- FAIL: TestMetrics (2.00s)
    observability_test.go:93: timedout waiting for metrics to arrive

@tigrannajaryan tigrannajaryan self-assigned this Sep 11, 2020
statCountTracesSampled.M(int64(1)),
)
decisionNotSampled++
finalDecision := sampling.NotSampled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some comments to explain what the code is doing? Even better would be to break this function into smaller ones, it is very large and difficult to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this code is difficult to understand. Instead of trying to fix/refactor the existing implementation do you think its worthwhile to rewrite this processor to rely on the new groupbytrace? The implementation would be much simpler.

I asked @jpkrohling about this and he was open to the idea. I can create an issue to discuss further if you'd like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-smith-zocdoc are you suggesting to discard this PR and instead submit a bigger refactoring? I am open to the idea, but it depends on what exactly is the refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to open a separate issue for it. With the deadlock bug I found its not something we could do yet anyway.

)
}

func (tsp *tailSamplingSpanProcessor) makeDecision(id idbatcher.ID, trace *sampling.TraceData) (sampling.Decision, *Policy, int64, int64, int64) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan I pulled this method out, hopefully that makes the ontick a little more clear. A large amount of this code is purely for metrics

// any single policy that decides to sample will cause the decision to be sampled
// the nextConsumer will get the context from the first matching policy
finalDecision = sampling.Sampled
if matchingPolicy == nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason this is necessary is to get the matching context from the policy which contains tags for the policy name

Comment on lines +162 to +167
trace.Lock()
traceBatches := trace.ReceivedBatches
trace.ReceivedBatches = nil
trace.Unlock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the lock was taken twice when a policy matched, I just combined them into a single acquisition

@chris-smith-zocdoc
Copy link
Contributor Author

chris-smith-zocdoc commented Sep 22, 2020

unrelated test failure

=== RUN   TestHTTPInvalidTLSCredentials
    otlp_test.go:615: 
        	Error Trace:	otlp_test.go:615
        	Error:      	Error message not equal:
        	            	expected: "failed to load TLS config: for auth via TLS, either both certificate and key must be supplied, or neither"
        	            	actual  : "listen tcp 127.0.0.1:50000: bind: address already in use"
        	Test:       	TestHTTPInvalidTLSCredentials
--- FAIL: TestHTTPInvalidTLSCredentials (0.00s)

@chris-smith-zocdoc
Copy link
Contributor Author

another unrelated test failure

TestProcessManagerArgs - fluentbitextension

)
}

func (tsp *tailSamplingSpanProcessor) makeDecision(id idbatcher.ID, trace *sampling.TraceData) (sampling.Decision, *Policy, int64, int64, int64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very easy to have a bug when the function returns 3 int64 parameters if you accidentally swap the order here in the return statement or at the call site.
One possible way to avoid this is to have a struct with fields and return it instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

if err != nil {
trace.Decisions[i] = sampling.NotSampled
evaluateErrorCount++
tsp.logger.Error("Sampling policy error", zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan
Copy link
Member

Please fix the conflicting files.

@tigrannajaryan
Copy link
Member

related test failure

=== RUN   TestHTTPInvalidTLSCredentials
    otlp_test.go:615: 
        	Error Trace:	otlp_test.go:615
        	Error:      	Error message not equal:
        	            	expected: "failed to load TLS config: for auth via TLS, either both certificate and key must be supplied, or neither"
        	            	actual  : "listen tcp 127.0.0.1:50000: bind: address already in use"
        	Test:       	TestHTTPInvalidTLSCredentials
--- FAIL: TestHTTPInvalidTLSCredentials (0.00s)

Did you see this locally? I cannot find this failure on CircleCI.

@chris-smith-zocdoc
Copy link
Contributor Author

@tigrannajaryan the failure was in a previous commit, here is the run https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector/3605/workflows/78743f26-5ef1-4b37-a886-5cfadbd4877f/jobs/38229/steps

@chris-smith-zocdoc
Copy link
Contributor Author

Made requested changes.

Contrib tests are also failing on master

@tigrannajaryan tigrannajaryan merged commit 3cfaa77 into open-telemetry:master Sep 24, 2020
@chris-smith-zocdoc chris-smith-zocdoc deleted the issue_1514 branch October 1, 2020 18:45
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this pull request Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants