Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[agent] Process data loss stats from clients #2010

Merged
merged 23 commits into from
Jan 15, 2020

Conversation

yurishkuro
Copy link
Member

@yurishkuro yurishkuro commented Jan 9, 2020

Which problem is this PR solving?

Short description of the changes

  • Keeps track of stats reported by clients in jaegerThrift.Batch, per client-uuid
  • Emits corresponding metrics
  • Cleans-up the per-client-uuid cache periodically

This is how new metrics look:

$ curl -s http://localhost:14269/metrics | grep client_stats
# HELP jaeger_agent_client_stats_batches_sent_total Total count of batches sent by clients
# TYPE jaeger_agent_client_stats_batches_sent_total counter
jaeger_agent_client_stats_batches_sent_total 4
# HELP jaeger_agent_client_stats_connected_clients Total count of unique clients sending data to the agent
# TYPE jaeger_agent_client_stats_connected_clients gauge
jaeger_agent_client_stats_connected_clients 1
# HELP jaeger_agent_client_stats_spans_dropped_total Total count of spans dropped by clients
# TYPE jaeger_agent_client_stats_spans_dropped_total counter
jaeger_agent_client_stats_spans_dropped_total{cause="full-queue"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="send-failure"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="too-large"} 0

TODO

  • Pin correct jaeger-client-go and jaeger-idl once their respective PRs are merged
  • Add unit tests for new functionality

@yurishkuro yurishkuro changed the title [agent] Process data loss stats from clients WIP - [agent] Process data loss stats from clients Jan 9, 2020
Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could use a bit more logging to my taste, but looks like a promising first draft.

Makefile Outdated Show resolved Hide resolved
cmd/agent/app/reporter/client_metrics.go Show resolved Hide resolved
cmd/agent/app/reporter/client_metrics.go Outdated Show resolved Hide resolved
cmd/agent/app/reporter/client_metrics.go Outdated Show resolved Hide resolved
@yurishkuro yurishkuro force-pushed the data-loss-reporting branch from 2cbd9e3 to 756bc71 Compare January 9, 2020 18:33
@codecov
Copy link

codecov bot commented Jan 10, 2020

Codecov Report

Merging #2010 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2010      +/-   ##
==========================================
+ Coverage   97.38%   97.41%   +0.02%     
==========================================
  Files         206      207       +1     
  Lines       10136    10243     +107     
==========================================
+ Hits         9871     9978     +107     
  Misses        222      222              
  Partials       43       43
Impacted Files Coverage Δ
cmd/agent/app/processors/thrift_processor.go 100% <ø> (ø) ⬆️
cmd/agent/app/reporter/client_metrics.go 100% <100%> (ø)
cmd/agent/app/reporter/tchannel/collector_proxy.go 100% <100%> (ø) ⬆️
cmd/agent/app/reporter/metrics.go 100% <100%> (ø) ⬆️
cmd/agent/app/reporter/grpc/collector_proxy.go 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 641dd92...4d2f581. Read the comment docs.

Yuri Shkuro added 17 commits January 13, 2020 14:55
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
@yurishkuro yurishkuro changed the title WIP - [agent] Process data loss stats from clients [agent] Process data loss stats from clients Jan 14, 2020
@yurishkuro
Copy link
Member Author

this is ready for review

Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far. This PR assumes that clients will only talk to a single agent for the whole lifecycle of the client, which is a reasonable assumption. Once there's a PR ready to be tested on the client side, I'll run an end-to-end manual test.

cmd/agent/app/reporter/client_metrics_test.go Outdated Show resolved Hide resolved
cmd/agent/app/reporter/client_metrics_test.go Outdated Show resolved Hide resolved
cmd/agent/app/reporter/client_metrics.go Show resolved Hide resolved
cmd/agent/app/reporter/client_metrics.go Show resolved Hide resolved
cmd/agent/app/reporter/client_metrics_test.go Show resolved Hide resolved
cmd/agent/app/reporter/metrics_test.go Outdated Show resolved Hide resolved
@yurishkuro
Copy link
Member Author

Once there's a PR ready to be tested on the client side, I'll run an end-to-end manual test.

@jpkrohling the working client version is already included here as dependency (jaegertracing/jaeger-client-go#482), and the functionality can be tested with all-in-one (since query service is self-tracing).

Signed-off-by: Yuri Shkuro <[email protected]>
Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just figure out the debug-logging discussion and add the expected gauges/counters to a test, and it's ready to be merged.

Yuri Shkuro added 4 commits January 15, 2020 11:32
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
@@ -25,22 +25,28 @@ import (

// ProxyBuilder holds objects communicating with collector
type ProxyBuilder struct {
reporter reporter.Reporter
reporter *reporter.ClientMetricsReporter
manager configmanager.ClientConfigManager
tchanRep *Reporter
}

// NewCollectorProxy creates ProxyBuilder
func NewCollectorProxy(builder *Builder, mFactory metrics.Factory, logger *zap.Logger) (*ProxyBuilder, error) {
Copy link
Contributor

@guo0693 guo0693 Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a little bit irrelevant to data loss, we don't need to change it in this PR. But usually Builder.NewXXX means build a new XXX, seems more reasonable for NewCollectorProxy to create a CollectorProxy instead of ProxyBuilder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the naming here has become weird overtime.

@@ -43,10 +43,6 @@ func (c *tchanCollectorClient) SubmitBatches(ctx thrift.Context, batches []*Batc
}
success, err := c.client.Call(ctx, c.thriftService, "submitBatches", &args, &resp)
if err == nil && !success {
switch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems generated deletion looks weird. Why no logging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help it, it's how the recent generator works. But the new code looks correct - propagating the error up, rather than introducing that odd override.

Also, low-level libraries are not supposed to log.

tchanRep: r,
reporter: reporter.WrapWithMetrics(r, tchannelMetrics),
manager: configmanager.WrapWithMetrics(tchannel.NewConfigManager(r.CollectorServiceName(), r.Channel()), tchannelMetrics),
tchanRep: tchanRep,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to export tchanRep? Is tchanRep used any where? If not, we could remove tchanRep from below struct:
type ProxyBuilder struct { reporter reporter.Reporter manager configmanager.ClientConfigManager tchanRep *Reporter }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's unfortunate side effect of the way constructors are defined. I think the Sampling server depends on the underlying tchannel that is being created here.

@@ -39,9 +39,16 @@ func NewCollectorProxy(builder *ConnBuilder, agentTags map[string]string, mFacto
return nil, err
}
grpcMetrics := mFactory.Namespace(metrics.NSOptions{Name: "", Tags: map[string]string{"protocol": "grpc"}})
r1 := NewReporter(conn, agentTags, logger)
r2 := reporter.WrapWithMetrics(r1, grpcMetrics)
r3 := reporter.WrapWithClientMetrics(reporter.ClientMetricsReporterParams{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have a generic WrappedReporter? Essentially r2 & r3 are wrapped reporter with callbacks around EmitBatch and EmitZipkinBatch. It will be great if we can have chained wrapper, something like:
NewReporter(...).wrapWith().wrapWith().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • in what sense "generic"?

Copy link
Contributor

@guo0693 guo0693 Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought it through, not sure if it's totally possible or not. Some rough idea as below:

type WrappedReporter interface {
	EmitZipkinBatch(spans []*zipkincore.Span) (err error)
	EmitBatch(batch *jaeger.Batch) (err error)
        WrapWith(func onEmitZipkinBatch, func onEmitBatch) (*WrappedReporter, error)
}

We may need to solve the problem whether the callback is called before or after EmitBatch, seems like metrics.go & client_metrics.go do those operations in different order, one before while the other one after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what benefits would this provide? The two existing decorators have specific things they are doing before or after calling the underlying, there's little room to generalize there. And WrapWith as part of the interface smells like an anti-pattern due to unnecessary coupling, i.e. each of the two decorators does not need to know if it itself is being decorated further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read the code, I felt confused when I first saw metrics.go and client_metrics.go, after spent some time reading, I understood they are reporters wrapped with different call backs.

Regarding detailed implementation of the callbacks, yes they will be very different and no need to be generalized. The generalization is for better readability and extendability I guess, it will be easy to wrap another callback around emit calls in the future instead of implement a new reader if necessary .

I agree, the decorators don't need to know if they will be decorated further or not. I just got the idea from chained promise from javascript, it will be good if we can support similar feature here. This way, you don't need:

r1 := NewReporter(conn, agentTags, logger)
r2 := reporter.WrapWithMetrics(r1, grpcMetrics)
r3 := reporter.WrapWithClientMetrics(...)

You just need:

NewReporter(...).wrapWith(//r2 functionalities).wrapWith(//r3 functinoalities) 

Copy link
Contributor

@guo0693 guo0693 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, it just take me sometime to understand the difference between metrics.go (reporter wrapped with metrics) and client_metrics.go (reporter wrapped with client metrics). It will be great if we can generalize the wrapped reporter, if it's not feasible, we can merge the PR as of now.

@yurishkuro
Copy link
Member Author

I think I addressed @jpkrohling comments, so merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants