Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault fails to supply x509 certificate when connecting to Consul when mutual TLS is enabled #3248

Closed
simon-wenmouth opened this issue Aug 28, 2017 · 10 comments
Milestone

Comments

@simon-wenmouth
Copy link

Summary

  • Vault is using Consul (over HTTPS) as its storage backend
  • Consul is configured to use mutual TLS, i.e. verify_incoming: true
    and verify_outgoing: true
  • When Vault makes the service registration call (PUT /v1/agent/service/register)
    during the TLS handshake Consul sends a client certificate request.
  • Vault does not reply with its certificate.
  • The service registration call fails.
  • Vault reports the following error on startup.
physical/consul: reconcile unable to talk with Consul backend: error=service registration failed: Put https://${hostname}:8500/v1/agent/service/register: remote error: tls: bad certificate 

Expected Behavior:

Vault should respond with its certificate during the mutual TLS handshake with Consul.

Actual Behavior:

Vault did not reply with its certificate when responding to the client certificate request.

Steps to Reproduce:

See: https://github.com/simon-wenmouth/vault-consul-tls/

./docker-compose.sh build
./docker-compose.sh up
docker logs tls_consul-client_1

Environment:

  • Vault
    • Version: v0.8.1
    • Version Sha: 8d76a41
  • Consul
    • Version: v0.9.2
  • Operating System/Architecture:
    • System Version: OS X 10.11.6 (15G1611)
    • Kernel Version: Darwin 15.6.0
    • Model Name: MacBook Pro
    • Processor Name: Intel Core i7
    • Processor Speed: 2.5 GHz
    • Number of Processors: 1
    • Total Number of Cores: 4
  • Docker
    • Docker version 17.06.1-ce, build 874a737
    • docker-compose version 1.14.0, build c7bdf9e

Vault Config File:

See: https://github.com/simon-wenmouth/vault-consul-tls/tree/master/etc/vault

{
    "cluster_name": "intermediate-ca",
    "cache_size": "32000",
    "disable_cache": false,
    "disable_mlock": true,
    "default_lease_ttl": "768h",
    "max_lease_ttl": "768h"
}
{
  "listener": {
    "tcp": {
      "address":         "0.0.0.0:8200",
      "cluster_address": "172.19.0.12:8201",
      "tls_disable":     "false",
      "tls_cert_file":   "/opt/vault/config/keys/server.cert.pem",
      "tls_key_file":    "/opt/vault/config/keys/server.key.pem"
    }
  }
}
{
  "storage": {
    "consul": {
      "address": "consul-client.dc.consul:8500",
      "check_timeout": "5s",
      "consistency_mode": "default",
      "disable_registration": "false",
      "max_parallel": "128",
      "path": "vault/",
      "scheme": "https",
      "service": "vault",
      "service_tags": "",
      "token": "",
      "cluster_addr": "https://172.19.0.12:8201",
      "redirect_addr": "https://172.19.0.12:8200",
      "tls_ca_file": "/opt/vault/config/keys/server.ca.pem",
      "tls_cert_file": "/opt/vault/config/keys/server.cert.pem",
      "tls_key_file ": "/opt/vault/config/keys/server.key.pem"
    }
  }
}

Startup Log Output:

==> Vault server configuration:

                     Cgo: disabled
         Cluster Address: https://172.19.0.12:8201
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "172.19.0.12:8201", tls: "enabled")
               Log Level: trace
                   Mlock: supported: true, enabled: false
        Redirect Address: https://172.19.0.12:8200
                 Storage: consul (HA available)
                 Version: Vault v0.8.1
             Version Sha: 8d76a41854608c547a233f2e6292ae5355154695

==> Vault server started! Log data will stream in below:

2017/08/27 05:19:20.319318 [DEBUG] physical/consul: config path set: path=vault/
2017/08/27 05:19:20.319491 [DEBUG] physical/consul: config disable_registration set: disable_registration=false
2017/08/27 05:19:20.319503 [DEBUG] physical/consul: config service set: service=vault
2017/08/27 05:19:20.319507 [DEBUG] physical/consul: config service_tags set: service_tags=
2017/08/27 05:19:20.319511 [DEBUG] physical/consul: config check_timeout set: check_timeout=5s
2017/08/27 05:19:20.319530 [DEBUG] physical/consul: config address set: address=consul-client.dc.consul:8500
2017/08/27 05:19:20.319537 [DEBUG] physical/consul: config scheme set: scheme=https
2017/08/27 05:19:20.319539 [DEBUG] physical/consul: config token set
2017/08/27 05:19:20.321205 [DEBUG] physical/consul: configured TLS
2017/08/27 05:19:20.321271 [DEBUG] physical/consul: max_parallel set: max_parallel=128
2017/08/27 05:19:20.321323 [TRACE] physical/cache: creating LRU cache: size=32000
2017/08/27 05:19:20.324530 [TRACE] cluster listener addresses synthesized: cluster_addresses=[172.19.0.12:8201]
2017/08/27 05:19:20.333866 [WARN ] physical/consul: reconcile unable to talk with Consul backend: error=service registration failed: Put https://consul-client.dc.consul:8500/v1/agent/service/register: remote error: tls: bad certificate
@jefferai
Copy link
Member

You're claiming that Vault is not sending certificates to Consul. How have you verified this? The error message says it's a bad certificate, not that there is no certificate.

@simon-wenmouth
Copy link
Author

Hi @jefferai ,

Thanks for the reply! Sorry for the delay in getting back to you.

I approached the error message by first verifying the x509 certificates being used by Vault and Consul would accept a connection from each other (using the openssl command line tool).

I used the Consul certificate for a "server" process requiring mutual TLS:

openssl s_server -CAfile agent.ca.pem -cert agent.cert.pem -key agent.key.pem -Verify 1 -www

I used the Vault certificate as the "client".

openssl s_client -CAfile server.ca.pem -cert server.cert.pem -key server.key.pem -verify 1 -connect localhost:4433

The openssl server/client interaction demonstrated that the certificates were capable of mutual TLS.

The server transcript is as follows.

$> openssl s_server -CAfile agent.ca.pem -cert agent.cert.pem -key agent.key.pem -Verify 1 -www
verify depth is 1, must return a certificate
Using default temp DH parameters
Using default temp ECDH parameters
ACCEPT
depth=1 /CN=ca.dc.consul
verify return:1
depth=0 /CN=consul-client.dc.consul
verify return:1
ACCEPT
^C

The client transcript is as follows.

$ openssl s_client -CAfile server.ca.pem -cert server.cert.pem -key server.key.pem -verify 1 -connect localhost:4433
verify depth is 1
CONNECTED(00000003)
depth=1 /CN=ca.dc.consul
verify return:1
depth=0 /CN=consul-client.dc.consul
verify return:1
---
Certificate chain
 0 s:/CN=consul-client.dc.consul
   i:/CN=ca.dc.consul
 1 s:/CN=ca.dc.consul
   i:/CN=ca.dc.consul
---
Server certificate
-----BEGIN CERTIFICATE-----
 ... snip ...
-----END CERTIFICATE-----
subject=/CN=consul-client.dc.consul
issuer=/CN=ca.dc.consul
---
Acceptable client certificate CA names
/CN=ca.dc.consul
---
SSL handshake has read 2314 bytes and written 2250 bytes
---
New, TLSv1/SSLv3, Cipher is DHE-RSA-AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : DHE-RSA-AES256-SHA
    Session-ID: 85C068CD9B00E6478E507126AC5B036D0F195FD727F68ED82FF9A32B27208558
    Session-ID-ctx: 
    Master-Key: 1AA57ADBACF509226672AFD78241AAA9C1468AD2B67642C307142181EA2E54B217C8C07441C43CE4C0D576564C78852E
    Key-Arg   : None
    Start Time: 1503886642
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---
^C

Seeing the certificates work with one another, I then updated my docker-compose file to run an openssl s_server in place of the Consul agent to get a transcript of the communication from Vault to Consul.

This is the substituted command.

openssl s_server -CAfile /opt/consul/config/keys/agent.ca.pem -cert /opt/consul/config/keys/agent.cert.pem -key /opt/consul/config/keys/agent.key.pem -Verify 1 -msg -state -www -accept 8500

This is the information recorded by openssl s_server.

+ exec openssl s_server -CAfile /opt/consul/config/keys/agent.ca.pem -cert /opt/consul/config/keys/agent.cert.pem -key /opt/consul/config/keys/agent.key.pem -Verify 1 -msg -state -www -accept 8500
verify depth is 1, must return a certificate
Using default temp DH parameters
Using default temp ECDH parameters
ACCEPT
SSL_accept:before/accept initialization
SSL_accept:SSLv3 read client hello A
<<< TLS 1.2 Handshake [length 00bb], ClientHello
    01 00 00 b7 03 03 a4 39 78 52 a6 17 00 1d cb 06
    ... snip ...
    74 74 70 2f 31 2e 31 00 12 00 00
>>> TLS 1.2 Handshake [length 0059], ServerHello
    02 00 00 55 03 03 59 a3 84 a4 d5 d2 47 a9 10 99
    ... snip ...
    00 00 0b 00 04 03 00 01 02
SSL_accept:SSLv3 write server hello A
>>> TLS 1.2 Handshake [length 06b2], Certificate
    0b 00 06 ae 00 06 ab 00 03 79 30 82 03 75 30 82
    ... snip ...
    a9 7e 34 5c 97 a9 d5 6d da 64 78 4b 7d 3e d9 8c
SSL_accept:SSLv3 write certificate A
    bd 14 21 71 09 09 88 4f 5b 5d 81 50 cf 72 f7 e4
    ... snip ...
    1c 5e ca 18 c8 b4 67 e4 64 05 28 25 15 e0 f0 ff
    49 1a
>>> TLS 1.2 Handshake [length 014d], ServerKeyExchange
    0c 00 01 49 03 00 17 41 04 51 e5 98 12 cf 17 09
    ... snip ...
    d2 d5 cf cc be fa 12 92 77 f0 cc 45 17
>>> TLS 1.2 Handshake [length 0049], CertificateRequest
    0d 00 00 41 03 01 02 40 00 1e 06 01 06 02 06 03
    05 01 05 02 05 03 04 01 04 02 04 03 03 01 03 02
    03 03 02 01 02 02 02 03 00 1b 00 19 30 17 31 15
    30 13 06 03 55 04 03 13 0c 63 61 2e 64 63 2e 63
    6f 6e 73 75 6c 0e 00 00 00
SSL_accept:SSLv3 write key exchange A
SSL_accept:SSLv3 write certificate request A
SSL_accept:SSLv3 flush data
<<< TLS 1.2 Handshake [length 0007], Certificate
    0b 00 00 03 00 00 00
>>> TLS 1.2 Alert [length 0002], fatal handshake_failure
    02 28
SSL3 alert write:fatal:handshake failure
SSL_accept:error in SSLv3 read client certificate B
SSL_accept:error in SSLv3 read client certificate B
140541868337056:error:140890C7:SSL routines:SSL3_GET_CLIENT_CERTIFICATE:peer did not return a certificate:s3_srvr.c:3321:
ACCEPT

As such, I reasoned that Vault was not responding to the client certificate request.

Not knowing how to proceed, I updated the local Consul agent configuration (which is dedicated to the Vault process) to verify_incoming_rpc: true and verify_incoming_https: false (leaving verify_incoming: true for all other Consul agents). The interaction between Vault and Consul in this configuration works as expected (over TLS without mutual authentication).

The repository to which I linked was a small testbed that I hoped would be of use in quickly reproducing the problem I'm encountering. If there is anything else I can do or provide to aid in the diagnosis of the problem I've encountered please don't hesitate to ask.

Thanks!

  • Simon

@jefferai
Copy link
Member

jefferai commented Sep 2, 2017

Hi Simon ,

Looking at your testing methodology I still think that the problem isn't that Vault's not supplying a certificate, it's that the remote end is failing to verify it. (Note that the error message says "bad certificate", not "no certificate".)

When you are using curl you are explicitly specifying a CA certificate for the client. My guess is that if you left that off you'd see the same behavior.

The solution is likely to be concatenating your CA certificate into your client certificate file. It should look like this:

-----BEGIN CERTIFICATE-----
< client certificate >
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
< ca certificate >
-----END CERTIFICATE-----

Let me know how it goes!

@simon-wenmouth
Copy link
Author

Hi @jefferai ,

The client certificates being used by Vault and Consul are the concatenated
kind you described (the client certificate followed by the ca certificate).

In my Consul entrypoint.

cat <<-EOF > "${pki_request}"
{
  "common_name": "$(hostname -f)",
  "alt_names": "localhost",
  "ttl": "8760h",
  "ip_sans": "$(hostname -i),127.0.0.1",
  "format": "pem"
}
EOF

pki=$(curl --insecure --fail --request POST --header "X-Vault-Token: ${vault_token}" --data @"${pki_request}" "${VAULT_ADDR}/v1/pki-ca/issue/tls-cert")

echo "${pki}" | jq -r .data.certificate >  "${CONSUL_CERT_FILE}"
echo "${pki}" | jq -r .data.issuing_ca  >> "${CONSUL_CERT_FILE}"
echo "${pki}" | jq -r .data.private_key >  "${CONSUL_KEY_FILE}"
echo "${pki}" | jq -r .data.issuing_ca  >  "${CONSUL_CA_FILE}"

In my Vault entrypoint.

pki=$(vault write -tls-skip-verify -address="${CA_VAULT_ADDR}" -format=json "pki-ca/issue/tls-cert" "common_name=$(hostname -f)" "alt_names=localhost" "ttl=8760h" "ip_sans=$(hostname -i),127.0.0.1" "format=pem")

echo "${pki}" | jq -r .data.certificate >  "${VAULT_CERT_FILE}"
echo "${pki}" | jq -r .data.issuing_ca  >> "${VAULT_CERT_FILE}"
echo "${pki}" | jq -r .data.private_key >  "${VAULT_KEY_FILE}"
echo "${pki}" | jq -r .data.issuing_ca  >  "${VAULT_CA_FILE}"

I wrote the following short program to try and isolate/reproduce
the problem with the Consul communication outside of the Vault codebase.

package main

import "github.com/hashicorp/consul/api"
import "fmt"

func main() {
	client, err := api.NewClient(api.DefaultConfig())
	if err != nil {
		panic(err)
	}
	checks, _, err := client.Health().State("any", &api.QueryOptions{})
	if err != nil {
		panic(err)
	}
	for _, element := range checks {
		fmt.Printf("{'Node': '%s', 'Status': '%s', 'Name': '%s'}\n", element.Node, element.Status, element.Name)
	}
}

I ran the program as follows on the Vault host.

export CONSUL_HTTP_ADDR="127.0.0.1:8500"
export CONSUL_HTTP_SSL="true"
export CONSUL_HTTP_SSL_VERIFY="true"
export CONSUL_CACERT="/opt/vault/config/keys/server.ca.pem"
export CONSUL_CLIENT_CERT="/opt/vault/config/keys/server.cert.pem"
export CONSUL_CLIENT_KEY="/opt/vault/config/keys/server.key.pem"
export CONSUL_TLS_SERVER_NAME="consul-client.dc.consul"
./main

It printed the following text to the terminal.

{'Node': 'consul-client.dc.consul', 'Status': 'passing', 'Name': 'Serf Health Status'}
{'Node': 'consul-server.dc.consul', 'Status': 'passing', 'Name': 'Serf Health Status'}

Since I received a response (and not a panic) I'm assuming that this means
that the Vault and Consul mutually validate each others certificates, yes?

My next experiment was to rely only on the environment variables for the
Consul storage backend (as per the test above). Here is the JSON used
as configuration for using Consul as the Vault storage backend:

{
  "storage": {
    "consul": {
      "cluster_addr": "https://172.19.0.12:8201",
      "redirect_addr": "https://172.19.0.12:8200"
    }
  }
}

The Vault startup logs now show another error (x509: certificate signed by unknown authority).

==> Vault server configuration:

                     Cgo: disabled
         Cluster Address: https://172.19.0.12:8201
              Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "172.19.0.12:8201", tls: "enabled")
               Log Level: trace
                   Mlock: supported: true, enabled: false
        Redirect Address: https://172.19.0.12:8200
                 Storage: consul (HA available)
                 Version: Vault v0.8.1
             Version Sha: 8d76a41854608c547a233f2e6292ae5355154695

==> Vault server started! Log data will stream in below:

2017/09/03 19:31:30.289940 [DEBUG] physical/consul: config path set: path=vault/
2017/09/03 19:31:30.290157 [DEBUG] physical/consul: config disable_registration set: disable_registration=false
2017/09/03 19:31:30.290166 [DEBUG] physical/consul: config service set: service=vault
2017/09/03 19:31:30.290174 [DEBUG] physical/consul: config service_tags set: service_tags=
2017/09/03 19:31:30.290203 [DEBUG] physical/consul: configured TLS
2017/09/03 19:31:30.290260 [TRACE] physical/cache: creating LRU cache: size=32000
2017/09/03 19:31:30.294133 [TRACE] cluster listener addresses synthesized: cluster_addresses=[172.19.0.12:8201]
2017/09/03 19:31:30.314176 [WARN ] physical/consul: reconcile unable to talk with Consul backend: error=service registration failed: Put https://127.0.0.1:8500/v1/agent/service/register: x509: certificate signed by unknown authority

This was interesting as I was expecting it to work (like the test program).

I then updated the configuration to specify the CA file (as the environment variable
seemed not to be used).

{
  "storage": {
    "consul": {
      "cluster_addr": "https://172.19.0.12:8201",
      "redirect_addr": "https://172.19.0.12:8200",
      "tls_ca_file": "/opt/vault/config/keys/server.ca.pem"
    }
  }
}

The error reverted to remote error: tls: bad certificate. ☹

My next experiment was to add the CA pem to the system bundle (and removing, once
more, the tls_ca_file setting), e.g.

cp "${VAULT_CA_FILE}" /etc/pki/ca-trust/source/anchors/ && update-ca-trust extract

and

cp "${CONSUL_CA_FILE}" /etc/pki/ca-trust/source/anchors/ && update-ca-trust extract

for the Vault and Consul services (respectively). This also produced the error
message remote error: tls: bad certificate.

I then reviewed the difference between the test code and Vault

  • src/github.com/hashicorp/vault/physical/consul/consul.go:243
  • src/github.com/hashicorp/vault/vendor/github.com/hashicorp/consul/api/api.go:354

where the test code uses the latter and Vault the former. I cannot detect any
meaningful difference between the two aside from the fact that Vault also
makes a call to the http2 package.

As such, I updated the test program to configure its connection with Consul in
the same manner as Vault, yet it (still) does not reproduce the problem.

package main

import (
	"fmt"
	"net/http"
	"golang.org/x/net/http2"
	"github.com/hashicorp/consul/api"
	"crypto/tls"
)

func main() {
	config := api.DefaultConfig()
	config.Transport.MaxIdleConnsPerHost = 64
	tlsClientConfig, err := api.SetupTLSConfig(&config.TLSConfig)
	if err != nil {
		panic(err)
	}
	tlsClientConfig.MinVersion = tls.VersionTLS12
	config.Transport.TLSClientConfig = tlsClientConfig
	if err := http2.ConfigureTransport(config.Transport); err != nil {
		panic(err)
	}
	config.HttpClient = &http.Client{Transport: config.Transport}
	fmt.Printf("%#v\n", config)
	client, err := api.NewClient(config)
	if err != nil {
		panic(err)
	}
	checks, _, err := client.Health().State("any", &api.QueryOptions{})
	if err != nil {
		panic(err)
	}
	for _, element := range checks {
		fmt.Printf("{'Node': '%s', 'Status': '%s', 'Name': '%s'}\n", element.Node, element.Status, element.Name)
	}
	_, err = client.KV().Put(&api.KVPair{Key:"hello", Value: []byte("world")}, &api.WriteOptions{})
	if err != nil {
		panic(err)
	}
	pair, _, err := client.KV().Get("hello", &api.QueryOptions{})
	if err != nil {
		panic(err)
	}
	fmt.Printf("%s=%s\n", pair.Key, string(pair.Value))
}

The program produces the following output.

&api.Config{Address:"127.0.0.1:8500", Scheme:"https", Datacenter:"", Transport:(*http.Transport)(0xc4200101e0), HttpClient:(*http.Client)(0xc420071560), HttpAuth:(*api.HttpBasicAuth)(nil), WaitTime:0, Token:"", TLSConfig:api.TLSConfig{Address:"consul-client.dc.consul", CAFile:"/opt/vault/config/keys/server.ca.pem", CAPath:"", CertFile:"/opt/vault/config/keys/server.cert.pem", KeyFile:"/opt/vault/config/keys/server.key.pem", InsecureSkipVerify:false}}
{'Node': 'consul-client.dc.consul', 'Status': 'passing', 'Name': 'Serf Health Status'}
{'Node': 'consul-server.dc.consul', 'Status': 'passing', 'Name': 'Serf Health Status'}
hello=world

I have, for the time being, ran out of ideas that might lead to a fix for the
issue I'm experiencing.

Do you have any recommendations?

Thanks!

  • Simon

@simon-wenmouth
Copy link
Author

Hi Jeff,

I added logging to a development build of Vault and found that tlsClientConfig.Certificates was nil when exiting setupTLSConfig. I updated the method to be implemented in terms of consul/api following the pattern in NewConsulBackend of applying changes from the conf map to the default consul/api/Config. After this change the certificates were no longer nil and I found mutual TLS between Vault and Consul to work.

Best,

Simon

@jefferai
Copy link
Member

jefferai commented Sep 4, 2017

Did you ever try using tls_ca_file to specify the CA cert in conjunction with the client certificate?

@jefferai
Copy link
Member

jefferai commented Sep 4, 2017

I'm a bit confused by what's going on here...you opened a PR but now closed it...is it working for you?

@simon-wenmouth
Copy link
Author

Hi Jeff,

I found my mistake -- really quite embarrassing. After having added logging to setupTLSConfig
I uncovered that the property key for tls_key_file in the consul storage file ended with a space (that was being supplied by the environment in my previous PR).

I've submitted my logging changes as a PR on the off chance I'm not the only one who will make this mistake.

#3284

Thanks for your help!

  • Simon

@jefferai
Copy link
Member

jefferai commented Sep 4, 2017

@simon-wenmouth hah, okay -- I will look at the PR. I acutally looked earlier at the way that Consul was setting up TLS in their function and ours and was like...I don't really see why there should be a difference :-)

@jefferai
Copy link
Member

jefferai commented Sep 4, 2017

Closing, will look at the PR.

@jefferai jefferai closed this as completed Sep 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants