Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled DNS as the service discovery mechanism for agent => collector communication #333

Conversation

jpkrohling
Copy link
Contributor

@jpkrohling jpkrohling commented Mar 19, 2019

Closes #332

Signed-off-by: Juraci Paixão Kröhling [email protected]

@jpkrohling jpkrohling requested a review from pavolloffay March 19, 2019 15:24
@jpkrohling
Copy link
Contributor Author

This change is Reviewable

@codecov
Copy link

codecov bot commented Mar 19, 2019

Codecov Report

Merging #333 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #333      +/-   ##
==========================================
+ Coverage   89.73%   89.75%   +0.02%     
==========================================
  Files          64       64              
  Lines        2991     2997       +6     
==========================================
+ Hits         2684     2690       +6     
  Misses        207      207              
  Partials      100      100
Impacted Files Coverage Δ
pkg/util/util.go 100% <100%> (ø) ⬆️
pkg/inject/sidecar.go 100% <100%> (ø) ⬆️
pkg/storage/elasticsearch.go 81.53% <100%> (-0.69%) ⬇️
pkg/deployment/agent.go 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ed25b64...0d9c2a7. Read the comment docs.

@jpkrohling jpkrohling changed the title Enabled DNS as the service discovery mechanism for agent => collector communication WIP - Enabled DNS as the service discovery mechanism for agent => collector communication Mar 19, 2019
@jpkrohling jpkrohling force-pushed the 332-Turn-on-service-discovery-for-gRPC branch from dc65359 to ff3856e Compare March 19, 2019 15:51
@jpkrohling jpkrohling changed the title WIP - Enabled DNS as the service discovery mechanism for agent => collector communication Enabled DNS as the service discovery mechanism for agent => collector communication Mar 19, 2019
@jpkrohling
Copy link
Contributor Author

Tested this manually by using the simple-prod.yaml as base but setting Replicas to 2 for the collector:

# setup an elasticsearch with `make es`
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simple-prod
spec:
  strategy: production
  collector:
    replicas: 2
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200

Then, hit the Jaeger UI a few times and queried each collector individually:

$ kubectl get pods
NAME                                    READY     STATUS    RESTARTS   AGE
elasticsearch-0                         1/1       Running   0          29m
simple-prod-collector-bf94c5d9f-6qhr8   1/1       Running   0          28m
simple-prod-collector-bf94c5d9f-pzkkp   1/1       Running   0          27m
simple-prod-query-695f65d7f-77st7       2/2       Running   0          13m

$ kubectl port-forward simple-prod-collector-bf94c5d9f-pzkkp 14268:14268

$ curl localhost:14268/metrics | grep jaeger_collector_spans_received_total | grep jaeger-query
jaeger_collector_spans_received_total{debug="false",format="jaeger",svc="jaeger-query"} 30

$ kubectl port-forward simple-prod-collector-bf94c5d9f-6qhr8 14268:14268

$ curl localhost:14268/metrics | grep jaeger_collector_spans_received_total | grep jaeger-query
jaeger_collector_spans_received_total{debug="false",format="jaeger",svc="jaeger-query"} 41

@jpkrohling
Copy link
Contributor Author

E2E tests are passing:

$ make test
Running unit tests...
?   	github.com/jaegertracing/jaeger-operator/cmd	[no test files]
?   	github.com/jaegertracing/jaeger-operator/cmd/manager	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/account	0.023s	coverage: 100.0% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/apis	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/apis/io/v1alpha1	0.020s	coverage: 15.7% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/apis/jaegertracing/v1	0.025s	coverage: 16.6% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/cmd/start	[no test files]
?   	github.com/jaegertracing/jaeger-operator/pkg/cmd/version	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/config/sampling	0.025s	coverage: 94.1% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/config/ui	0.014s	coverage: 94.1% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/controller	[no test files]
?   	github.com/jaegertracing/jaeger-operator/pkg/controller/deployment	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/controller/jaeger	2.237s	coverage: 70.5% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/controller/legacy	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/cronjob	0.022s	coverage: 97.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/deployment	0.027s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/ingress	0.020s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/inject	0.015s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/inventory	0.017s	coverage: 91.8% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/route	0.052s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/service	0.013s	coverage: 100.0% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/storage	0.620s	coverage: 94.0% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/storage/elasticsearch/v1alpha1	[no test files]
ok  	github.com/jaegertracing/jaeger-operator/pkg/strategy	0.083s	coverage: 89.5% of statements
ok  	github.com/jaegertracing/jaeger-operator/pkg/util	0.005s	coverage: 100.0% of statements
?   	github.com/jaegertracing/jaeger-operator/pkg/version	[no test files]
Formatting code...
Building...
Sending build context to Docker daemon    159MB
Step 1/6 : FROM centos
 ---> 1e1148e4cc2c
Step 2/6 : RUN INSTALL_PKGS="       openssl       " &&     yum install -y $INSTALL_PKGS &&     rpm -V $INSTALL_PKGS &&     yum clean all &&     mkdir /tmp/_working_dir &&     chmod og+w /tmp/_working_dir
 ---> Using cache
 ---> 82728e7bfc7b
Step 3/6 : COPY scripts/* /scripts/
 ---> Using cache
 ---> 80a00c2ba02b
Step 4/6 : USER nobody
 ---> Using cache
 ---> f576eac46724
Step 5/6 : ADD build/_output/bin/jaeger-operator /usr/local/bin/jaeger-operator
 ---> 3e0d485fc23c
Step 6/6 : ENTRYPOINT ["/usr/local/bin/jaeger-operator"]
 ---> Running in be69a78e5860
Removing intermediate container be69a78e5860
 ---> 91ae6d3d07a6
Successfully built 91ae6d3d07a6
Successfully tagged jpkroehling/jaeger-operator:latest
Pushing image jpkroehling/jaeger-operator:latest...
Running Smoke end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	196.789s
Creating namespace default
service/cassandra created
statefulset.apps/cassandra created
Running Cassandra end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	136.221s
Running Elasticsearch end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	206.634s

@objectiser
Copy link
Contributor

@jpkrohling Before merging, do we want to make this change conditional on a property in the CR? So if headless services, then dns: would be used, but if non-headless services are used, then it would remain http:?

@jkandasa
Copy link
Member

@jpkrohling does this change work, If I change the reporter type to tchannel? (by passing the option reporter.tchannel.host-port)

@pavolloffay
Copy link
Member

@objectiser there is never http for gprc. Having a property in CR to load balance seems silly - we should be doing it always. If the dns does not work with non-headless we should try to do load balancing differently.

It would be worth to test the behavior with non-headless collector service.

@jpkrohling
Copy link
Contributor Author

does this change work, If I change the reporter type to tchannel? (by passing the option reporter.tchannel.host-port)

No.

@jpkrohling
Copy link
Contributor Author

Before merging, do we want to make this change conditional on a property in the CR?

I think we can merge this and keep discussing about the possible load balancing alternatives + headless services. Instead of changing the headless service, we could perhaps provide an extra service.

@objectiser
Copy link
Contributor

@jpkrohling sure

@jkandasa
Copy link
Member

does this change work, If I change the reporter type to tchannel? (by passing the option reporter.tchannel.host-port)

No.

I mean, if user supplies reporter.tchannel.host-port as agent option in CR, what will happen?
Does the agent work ok with tchannel protocol?

@objectiser
Copy link
Contributor

objectiser commented Mar 19, 2019

@jpkrohling saw your message on IRC - just wondering if, before considering multiple collector services (headless and non-headless), whether it would first be worth testing just having non-headless collector and revert the dns:/// change? So use kube-proxy (ClusterIP) to see if that load balances the grpc connections.
From the article Pavol found, it seems to imply non-headless services won't work - routing all requests to a single pod, but might be worth testing it out to be sure, as it would simplify the backend config if there was only a single service for collector.

@jpkrohling
Copy link
Contributor Author

Just tried it out and it does not load balance: I got 138 spans reported to one collector and 0 to the second. The code is basically master + this diff:

diff --git a/pkg/inventory/service.go b/pkg/inventory/service.go
index 6b45e62..d257d5d 100644
--- a/pkg/inventory/service.go
+++ b/pkg/inventory/service.go
@@ -21,6 +21,11 @@ func ForServices(existing []v1.Service, desired []v1.Service) Service {
                if t, ok := mdelete[k]; ok {
                        tp := t.DeepCopy()
 
+                       // we keep the ClusterIP that got assigned by the cluster, if it's empty in the "desired" and not empty on the "current"
+                       if v.Spec.ClusterIP == "" && len(tp.Spec.ClusterIP) > 0 {
+                               v.Spec.ClusterIP = tp.Spec.ClusterIP
+                       }
+
                        // we can't blindly DeepCopyInto, so, we select what we bring from the new to the old object
                        tp.Spec = v.Spec
                        tp.ObjectMeta.OwnerReferences = v.ObjectMeta.OwnerReferences
diff --git a/pkg/service/collector.go b/pkg/service/collector.go
index c8ccbd3..c386b47 100644
--- a/pkg/service/collector.go
+++ b/pkg/service/collector.go
@@ -41,7 +41,7 @@ func NewCollectorService(jaeger *v1.Jaeger, selector map[string]string) *corev1.
                },
                Spec: corev1.ServiceSpec{
                        Selector:  selector,
-                       ClusterIP: "None",
+                       ClusterIP: "",
                        Ports: []corev1.ServicePort{
                                {
                                        Name: "zipkin",

@jpkrohling
Copy link
Contributor Author

In summary: it looks like the only combination where we can get load balancing by default, without efforts by the user, is the current code in this PR.

@jkandasa had a great question about what happens when the user attempts to override the reporter type. I'm currently testing that and this will likely result in changes to this PR.

@jpkrohling jpkrohling changed the title Enabled DNS as the service discovery mechanism for agent => collector communication WIP - Enabled DNS as the service discovery mechanism for agent => collector communication Mar 20, 2019
@objectiser
Copy link
Contributor

@jpkrohling pity - so will you be adding the second collector service (so have headless and non-headless) as part of this PR? We also need to make the query service non-headless - separate PR or this?

@jpkrohling
Copy link
Contributor Author

so will you be adding the second collector service (so have headless and non-headless) as part of this PR

As per #264, I think we might not actually need a non-headless service for the collector, only for query, which might not need to be a headless service. In any case, I'll handle the query in another PR.

@objectiser
Copy link
Contributor

@jpkrohling We may still need a non-headless service for collector for the tracers when using HTTPSender? We would want the tracers to be load balanced across collectors.

@jpkrohling jpkrohling force-pushed the 332-Turn-on-service-discovery-for-gRPC branch from ff3856e to bd1518f Compare March 20, 2019 13:30
@jpkrohling
Copy link
Contributor Author

I just tried changing the PR to include a new service for the collector, so that we'd have two services. It has no effect, possibly because OkHttp will keep the TCP connections open, making it always land (preferably) at the same server.

Here's a gist with more details: https://gist.github.com/jpkrohling/406617419fa8de47910c832c44e5e2be

The application used for the test is the deploy/examples/business-application-injected-sidecar.yaml, with the JAEGER_ENDPOINT set. Logs from the application confirm it's using the HttpSender (see gist).

Here's the branch I used for the test:
https://github.com/jpkrohling/jaeger-operator/tree/332-Dual-services-for-collector

I understand the scope of this PR here is about the agent => collector communication. I just fixed the issue that @jkandasa reported, so, I believe this is now ready for a final review + merge.

@jpkrohling jpkrohling changed the title WIP - Enabled DNS as the service discovery mechanism for agent => collector communication Enabled DNS as the service discovery mechanism for agent => collector communication Mar 20, 2019
pkg/util/util.go Outdated Show resolved Hide resolved
@objectiser
Copy link
Contributor

@jpkrohling Yes a persistent connection would be used, so we need to test that solution with multiple app instances. But you are right, it is not necessary for this PR.

pkg/util/util.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor Author

Yes a persistent connection would be used, so we need to test that solution with multiple app instances. But you are right, it is not necessary for this PR.

The example application has actually two tracers, and it does show that spans from one tracer instance (order) lands at one collector and the other instance (inventory) in the second collector. The third collector received zero spans as a result.

@objectiser
Copy link
Contributor

Ok, so the two service approach look like it would work - so we just need to decide on service names (but possibly still better in separate PR).

Signed-off-by: Juraci Paixão Kröhling <[email protected]>
@jpkrohling
Copy link
Contributor Author

E2E tests are passing:

Running Smoke end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	106.717s
Creating namespace default
Running Cassandra end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	136.134s
Running Elasticsearch end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	91.733s

@jpkrohling jpkrohling merged commit 0772935 into jaegertracing:master Mar 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants