Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[oc] Auto create TLS cert in collector deployment #914

Merged
merged 21 commits into from
Mar 5, 2020

Conversation

annanay25
Copy link
Member

@annanay25 annanay25 commented Feb 18, 2020

For OpenShift platform -

  • Annotate the collector service with tag service.beta.openshift.io/serving-cert-secret-name
  • Mount the auto-created secret (with the crt and key)on the pod and use it in the cli for the collector grpc server

Resolves: #599

Signed-off-by: Annanay [email protected]

Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, thanks for this PR! I have only a couple of comments that would make our lives easier in the future.

The lint step also failed because of tls.TLSConfig. Rename to tls.Config and it's all good :-)

pkg/config/tls/tls.go Outdated Show resolved Hide resolved
pkg/config/tls/tls.go Outdated Show resolved Hide resolved
pkg/deployment/collector.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor

jpkrohling commented Feb 18, 2020

cc @kevinearls, @jkandasa could one of you please give this a try? It should be sufficient to just run make run instead of building an assembly via openshift-courier.

@annanay25
Copy link
Member Author

Thanks for the review @jpkrohling. I've addressed comments :)

Signed-off-by: Annanay <[email protected]>
Signed-off-by: Annanay <[email protected]>
@jpkrohling
Copy link
Contributor

Looks good! The mentioned PR has been merged, so, this can be rebased against master again.

@kevinearls, @jkandasa, could one of you please test this?

pkg/config/tls/tls.go Outdated Show resolved Hide resolved
pkg/config/tls/tls.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Contributor

@annanay25 do you need a cluster to test this?

@annanay25
Copy link
Member Author

do you need a cluster to test this?

Yes, that would be great @jpkrohling :) . I need to confirm if service.beta.openshift.io/inject-cabundle=true injects both the cert and the key (I think it should, since it says ca-bundle).

@jpkrohling
Copy link
Contributor

Just sent you the details to your OpenShift cluster in private, via Gitter.

@annanay25
Copy link
Member Author

annanay25 commented Feb 25, 2020

There seems to be one warning -

Annanays-Mac:jaeger-operator annanay$ kubectl logs -f pod/simple-prod-collector-649fddfc7c-xxd65
2020/02/25 14:21:03 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
{"level":"info","ts":1582640463.727751,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1582640463.728069,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1582640463.7281487,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14269}
{"level":"info","ts":1582640463.7281687,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14269,"health-status":"unavailable"}
{"level":"info","ts":1582640463.783358,"caller":"config/config.go:172","msg":"Elasticsearch detected","version":5}
{"level":"info","ts":1582640463.8457026,"caller":"collector/main.go:128","msg":"Starting jaeger-collector TChannel server","port":14267}
{"level":"warn","ts":1582640463.8457642,"caller":"collector/main.go:129","msg":"TChannel has been deprecated and will be removed in a future release"}
{"level":"info","ts":1582640463.9107623,"caller":"grpcserver/grpc_server.go:64","msg":"Starting jaeger-collector gRPC server","grpc-port":"14250"}
{"level":"info","ts":1582640463.9108763,"caller":"collector/main.go:148","msg":"Starting jaeger-collector HTTP server","http-port":14268}
{"level":"info","ts":1582640463.9109273,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1582640463.936088,"caller":"collector/main.go:240","msg":"Listening for Zipkin HTTP traffic","zipkin.http-port":9411}
WARNING: 2020/02/25 14:21:32 grpc: Server.Serve failed to complete security handshake from "10.129.2.19:60554": tls: first record does not look like a TLS handshake
WARNING: 2020/02/25 14:21:33 grpc: Server.Serve failed to complete security handshake from "10.129.2.19:60560": tls: first record does not look like a TLS handshake

I think its a client trying to connect to the collector without using TLS, but @jpkrohling could you PTAL?

Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes in the right direction and gets things deployed, but might need a couple of adjustments in order to work.

pkg/config/tls/tls.go Outdated Show resolved Hide resolved
pkg/strategy/production.go Outdated Show resolved Hide resolved
pkg/strategy/production.go Outdated Show resolved Hide resolved
pkg/deployment/agent.go Outdated Show resolved Hide resolved
pkg/deployment/agent.go Outdated Show resolved Hide resolved
pkg/deployment/agent.go Outdated Show resolved Hide resolved
Signed-off-by: Annanay <[email protected]>
@jpkrohling
Copy link
Contributor

Looks like it's still not quite working yet. This is what I see in the Agent logs for the Query:

{"level":"info","ts":1582707258.2944024,"caller":"base/balancer.go:83","msg":"base.baseBalancer: got new ClientConn state: {{[{10.129.2.18:14250 0  <nil>}] <nil>} <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1582707260.305737,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc0002256b0, CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1582707260.3057969,"caller":"roundrobin/roundrobin.go:50","msg":"roundrobinPicker: newPicker called with readySCs: map[]","system":"grpc","grpc_log":true}
{"level":"info","ts":1582707260.3066912,"caller":"base/balancer.go:140","msg":"base.baseBalancer: handle SubConn state change: 0xc0002256b0, TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1582707260.3067415,"caller":"transport/log.go:30","msg":"transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"","system":"grpc","grpc_log":true}

And this is from the collector logs:

$ kubectl logs simple-prod-collector-5d556d5db-dxmpr
2020/02/26 08:50:59 maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined
{"level":"info","ts":1582707059.8038766,"caller":"flags/service.go:115","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1582707059.804306,"caller":"flags/admin.go:108","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1582707059.8043773,"caller":"flags/admin.go:114","msg":"Starting admin HTTP server","http-port":14269}
{"level":"info","ts":1582707059.8044014,"caller":"flags/admin.go:100","msg":"Admin server started","http-port":14269,"health-status":"unavailable"}
{"level":"info","ts":1582707059.8288152,"caller":"config/config.go:172","msg":"Elasticsearch detected","version":6}
{"level":"info","ts":1582707059.917191,"caller":"collector/main.go:128","msg":"Starting jaeger-collector TChannel server","port":14267}
{"level":"warn","ts":1582707059.917238,"caller":"collector/main.go:129","msg":"TChannel has been deprecated and will be removed in a future release"}
{"level":"info","ts":1582707059.9429655,"caller":"grpcserver/grpc_server.go:64","msg":"Starting jaeger-collector gRPC server","grpc-port":"14250"}
{"level":"info","ts":1582707059.9430513,"caller":"collector/main.go:148","msg":"Starting jaeger-collector HTTP server","http-port":14268}
{"level":"info","ts":1582707059.9430842,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1582707059.9643774,"caller":"collector/main.go:240","msg":"Listening for Zipkin HTTP traffic","zipkin.http-port":9411}
WARNING: 2020/02/26 08:51:18 grpc: Server.Serve failed to complete security handshake from "10.129.2.19:38274": tls: first record does not look like a TLS handshake
WARNING: 2020/02/26 08:51:19 grpc: Server.Serve failed to complete security handshake from "10.129.2.19:38278": tls: first record does not look like a TLS handshake
WARNING: 2020/02/26 08:51:20 grpc: Server.Serve failed to complete security handshake from "10.129.2.19:38294": tls: first record does not look like a TLS handshake

I just got all the certs locally, and I was able to get a working setup. Looks like you are just missing the Agent parameter specifying the service-ca bundle, which makes the agent trust the certs generated by OpenShift's CA. Here's how to test it locally:

$ kubectl get secrets simple-prod-collector-headless-tls -o=go-template='{{index .data "tls.crt"}}' | base64 -d > /tmp/cert.crt

$ kubectl get secrets simple-prod-collector-headless-tls -o=go-template='{{index .data "tls.key"}}' | base64 -d > /tmp/cert.key

$ kubectl apply -f deploy/examples/business-application-injected-sidecar.yaml 

$ kubectl exec -c myapp myapp-7cb8b69d-nvvkv -- cat /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt > /tmp/service-ca.crt

$ SPAN_STORAGE_TYPE=memory go run -tags ui ./cmd/collector --collector.grpc.tls.enabled=true --collector.grpc.tls.cert=/tmp/cert.crt --collector.grpc.tls.key=/tmp/cert.key 

$ sudo vi /etc/hosts # add a line with the cert's hostname: 127.0.0.1 simple-prod-collector-headless.default.svc

$ SPAN_STORAGE_TYPE=memory go run -tags ui ./cmd/agent --reporter.grpc.host-port=simple-prod-collector-headless.default.svc:14250 --reporter.grpc.tls.enabled=true --reporter.grpc.tls.ca=/tmp/service-ca.crt

In the last command, you should see this in the logs:

{"level":"info","ts":1582708364.6002507,"caller":"base/balancer.go:181","msg":"base.baseBalancer: handle SubConn state change: 0xc0002882b0, READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1582708364.6003134,"caller":"roundrobin/roundrobin.go:48","msg":"roundrobinPicker: newPicker called with info: {map[0xc0002882b0:{{simple-prod-collector-headless.default.svc:14250  <nil> 0 <nil>}}]}","system":"grpc","grpc_log":true}

TL;DR: it should be sufficient to add the following arg to the agent: --reporter.grpc.tls.ca=/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt

@annanay25
Copy link
Member Author

annanay25 commented Feb 26, 2020

Thanks @jpkrohling for the analysis. I've updated the PR.

Also, thanks for provisioning the cluster, which can be taken down once this PR is merged :)

@annanay25
Copy link
Member Author

annanay25 commented Mar 3, 2020

Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very close!

pkg/deployment/agent.go Outdated Show resolved Hide resolved
pkg/inject/sidecar.go Outdated Show resolved Hide resolved
@annanay25
Copy link
Member Author

Done :)

Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now, and I confirm that it works.

Could you add a couple of simple unit tests to cover the new lines? Especially about the issues we faced during these last tests.

When the platform is set to openshift:

  • the agent should have the TLS options, and the "server-name" should contain ${namespace}.svc.cluster.local
  • same for the sidecar
  • the collector should have the new TLS options

pkg/strategy/production.go Show resolved Hide resolved
pkg/deployment/agent_test.go Show resolved Hide resolved
pkg/deployment/collector_test.go Show resolved Hide resolved
pkg/deployment/collector_test.go Outdated Show resolved Hide resolved
pkg/deployment/collector_test.go Outdated Show resolved Hide resolved
pkg/inject/sidecar_test.go Show resolved Hide resolved
pkg/deployment/agent_test.go Outdated Show resolved Hide resolved
Signed-off-by: Annanay <[email protected]>
Copy link
Contributor

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just update the deprecated flag and it's ready to be merged (provided that the CI tests pass).

pkg/config/tls/tls.go Outdated Show resolved Hide resolved
@annanay25
Copy link
Member Author

@jpkrohling - Could you please re-run CI on this? I'm not sure if the operator deployment is failing because of this PR

@jpkrohling
Copy link
Contributor

Merging, as the tests are passing locally:

Running Smoke end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	317.085s
Running Cassandra end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	245.803s
Running Elasticsearch end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	407.445s
Running Self provisioned Elasticsearch end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	0.798s
Running Streaming end-to-end tests...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	286.869s
Running Example end-to-end tests part 1...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	123.368s
Running Example end-to-end tests part 2...
ok  	github.com/jaegertracing/jaeger-operator/test/e2e	164.915s

Note that the last one had to be executed twice, because of #945.

@jpkrohling jpkrohling merged commit deec90d into jaegertracing:master Mar 5, 2020
@annanay25
Copy link
Member Author

Thanks! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configure secure connection between agent and collector (TLS)
2 participants