Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: TLS handshake log spam from health checks #32102

Closed
markharding opened this issue Nov 1, 2018 · 13 comments · Fixed by #55279
Closed

server: TLS handshake log spam from health checks #32102

markharding opened this issue Nov 1, 2018 · 13 comments · Fixed by #55279
Labels
A-kv-server Relating to the KV-level RPC server C-question A question rather than an issue. No code/spec/doc change needed. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@markharding
Copy link

markharding commented Nov 1, 2018

Summary

We are using k8s on AWS and experience the following errors despite cockroachdb working without issues.


2018-11-01 10:38:34 | ERROR | http: TLS handshake error from 10.0.9.140:36238: EOF | server.go:2921
-- | -- | -- | --
2018-11-01 10:38:36 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.140:62179": EOF | vendor/google.golang.org/grpc/server.go:603
2018-11-01 10:38:37 | ERROR | http: TLS handshake error from 10.0.9.66:4587: EOF | server.go:2921
2018-11-01 10:38:38 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.66:1101": EOF

k8s configs -> https://github.com/cockroachdb/cockroach/pull/27921/files

Load Balancer configs:

apiVersion: v1
kind: Service
metadata:
  # This service is meant to be used by clients of the database. It exposes a ClusterIP that will
  # automatically load balance connections to the different database pods.
  name: cockroachdb-external
  labels:
    app: cockroachdb
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "4000"
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  ports:
  # The main port, served by gRPC, serves Postgres-flavor SQL, internode
  # traffic and the cli.
  - port: 26257
    targetPort: 26257
    name: grpc
  # The secondary port serves the UI as well as health and debug endpoints.
  - port: 8080
    targetPort: 8080
    name: http
  selector:
    app: cockroachdb

Steps to reproduce

  1. Setup k8s with the above configurations
  2. View error logs

Expected Result

No error log entry

Actual Result

Error log entry

Log files/version

Node 1


2018-11-01 10:22:43 | INFO | [config] clusterID: 0e8897c8-ce0f-498b-95c1-e56892697e47 | util/log/clog.go:1067
-- | -- | -- | --
2018-11-01 10:22:43 | INFO | [config] arguments: [/cockroach/cockroach start --logtostderr --certs-dir /cockroach/cockroach-certs --advertise-host cockroachdb-0.cockroachdb.default.svc.cluster.local --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%] | util/log/clog.go:1067
2018-11-01 10:22:43 | INFO | [config] binary: CockroachDB CCL v2.1.0 (x86_64-unknown-linux-gnu, built 2018/10/30 12:32:34, go1.10.3) | util/log/clog.go:1067
2018-11-01 10:22:43 | INFO | [config] running on machine: cockroachdb-0 | util/log/clog.go:1067
2018-11-01 10:22:43 | INFO | [config] file created at: 2018/11/01 10:22:43


---


2018-11-01 10:38:34 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.66:50273": EOF | vendor/google.golang.org/grpc/server.go:603
-- | -- | -- | --
2018-11-01 10:38:34 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.66:56280": EOF | vendor/google.golang.org/grpc/server.go:603
2018-11-01 10:38:34 | ERROR | http: TLS handshake error from 10.0.9.140:36238: EOF | server.go:2921
2018-11-01 10:38:36 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.140:62179": EOF | vendor/google.golang.org/grpc/server.go:603
2018-11-01 10:38:37 | ERROR | http: TLS handshake error from 10.0.9.66:4587: EOF | server.go:2921
2018-11-01 10:38:38 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.66:1101": EOF

Node 2


2018-11-01 10:07:28 | INFO | [config] clusterID: 0e8897c8-ce0f-498b-95c1-e56892697e47 | util/log/clog.go:1067
-- | -- | -- | --
2018-11-01 10:07:28 | INFO | [config] arguments: [/cockroach/cockroach start --logtostderr --certs-dir /cockroach/cockroach-certs --advertise-host cockroachdb-2.cockroachdb.default.svc.cluster.local --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%] | util/log/clog.go:1067
2018-11-01 10:07:28 | INFO | [config] binary: CockroachDB CCL v2.1.0 (x86_64-unknown-linux-gnu, built 2018/10/30 12:32:34, go1.10.3) | util/log/clog.go:1067
2018-11-01 10:07:28 | INFO | [config] running on machine: cockroachdb-2 | util/log/clog.go:1067
2018-11-01 10:07:28 | INFO | [config] file created at: 2018/11/01 10:07:28



---


2018-11-01 10:22:37 | ERROR | http: TLS handshake error from 10.0.9.140:7586: EOF | server.go:2921
-- | -- | -- | --
2018-11-01 10:22:41 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.140:18280": EOF | vendor/google.golang.org/grpc/server.go:603
2018-11-01 10:22:42 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.66:59662": EOF | vendor/google.golang.org/grpc/server.go:603
2018-11-01 10:22:44 | WARNING | grpc: Server.Serve failed to complete security handshake from "10.0.9.140:10379": EOF

Epic: CRDB-549

@bdarnell bdarnell changed the title http: TLS handshake error AND grpc: Server.Serve failed to complete security handshake errors with k8s on AWS server: TLS handshake log spam from health checks Nov 1, 2018
@bdarnell
Copy link
Contributor

bdarnell commented Nov 1, 2018

These messages are harmless, but annoying. They're happening because the load balancer is just doing a TCP-level health check and closing the connection immediately. Unfortunately we can't easily suppress these log messages because they're coming from deep within our dependencies.

The k8s configs are supposed to be doing HTTPS health checks, which wouldn't generate these messages. There might be something else going on in the AWS environment, or maybe there's something wrong with those templates.

@knz knz added O-community Originated from the community S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. labels Nov 12, 2018
@knz knz added A-kv-server Relating to the KV-level RPC server C-question A question rather than an issue. No code/spec/doc change needed. labels Nov 12, 2018
@a-robinson
Copy link
Contributor

The k8s configs are supposed to be doing HTTPS health checks, which wouldn't generate these messages. There might be something else going on in the AWS environment, or maybe there's something wrong with those templates.

That "something else" is most likely the AWS network load balancer that got created by the "Service" that @markharding referred to being of type: LoadBalancer. It may be worth checking out that load balancer's health checks to confirm they're the cause and optionally tweak them.

@tim-o tim-o added O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs and removed O-community Originated from the community labels Jul 29, 2019
@RoachietheSupportRoach
Copy link
Collaborator

Zendesk ticket #3527 has been linked to this issue.

@robert-s-lee
Copy link
Contributor

@bdarnell @rytaft The load balancer template show service.beta.kubernetes.io/aws-load-balancer-type: "nlb" Should this be ALB? where is being picked up from?

@bdarnell
Copy link
Contributor

No, it needs to be NLB to support the non-HTTP postgres protocol. But even NLBs support HTTPS health checks, they just need to be configured appropriately.

@robert-s-lee
Copy link
Contributor

@bdarnell @jseldess is the appropriately NLP config step known in-house for documentation? who has the knowledge to help?

@bdarnell
Copy link
Contributor

bdarnell commented Sep 4, 2019

Maybe the MSO team? If not them, I'm not sure if anyone internal has tried this.

@robert-s-lee
Copy link
Contributor

@kannanlakshmi How is the MSO LB setup (with and w/o K8s) on AWS and Google.

@RoachietheSupportRoach
Copy link
Collaborator

RoachietheSupportRoach commented Sep 4, 2019 via email

@robert-s-lee
Copy link
Contributor

Is there a working annotation K8a config available with HTTP and HTTPS? The original post had HTTP but resulted in error message in the log file. Is the required change as simple as specifying HTTPS as Ben indicated?

@mberhault
Copy link
Contributor

Looking further, there doesn't seem to be much in the way of configuration for NLB on EKS: https://kubernetes.io/docs/concepts/services-networking/service/#aws-nlb-support

And a number of issues against kubernetes.

@knz
Copy link
Contributor

knz commented Oct 6, 2020

Investigated this today with @HonoreDB - the log spam here is slightly more complicated to eliminate than others because it is produced by our grpc upstream dependency, not the crdb code directly.

We do use some mechanism to prevent log spam for gRPC logging (the connectivitySpamRe in package grpcutil), but it's based on a threshold: we refuse to print the same error more than once per 30 seconds. Maybe that would be good enough for the TLS handshake error as well?

@bdarnell
Copy link
Contributor

bdarnell commented Oct 6, 2020

The once-per-30-seconds rule makes sense for outgoing connection attempts - there should be some indication that we're attempting to initiate a connection that is failing. But for this issue of incoming connections that do nothing wrong except close without going through the TLS handshake, it'd be better not to print them at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-question A question rather than an issue. No code/spec/doc change needed. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants