Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect Vault CA token updates are not retained on restart #11363

Closed
rrijkse opened this issue Oct 20, 2021 · 10 comments · Fixed by #17846
Closed

Connect Vault CA token updates are not retained on restart #11363

rrijkse opened this issue Oct 20, 2021 · 10 comments · Fixed by #17846
Labels
needs-investigation The issue described is detailed and complex. theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-vault Relating to Consul & Vault interactions type/bug Feature does not function as expected

Comments

@rrijkse
Copy link

rrijkse commented Oct 20, 2021

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

When you use the consul connect ca set-config command to update the Vault token the configuration is updated and successfully connects, however when the leadership in the cluster changes or Consul is restarted on a node the vault token in the configuration reverts back to the previous version (as seen by a consul connect ca get-config command.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with Vault CA integration
  2. Revoke the Vault Token and issue a new one
  3. Update the configuration on Consul with the consul connect ca set-config command
  4. Force an election or restart the leader node

It doesn't always happen but I can reproduce this on all of our clusters (DM me if you want remote access to our SBX environment)

Consul info for both Client and Server

Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 3
	services = 3
build:
	prerelease = 
	revision = c976ffd2
	version = 1.10.3
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 3
	leader = true
	leader_addr = 172.16.18.231:8300
	server = true
raft:
	applied_index = 3459907
	commit_index = 3459907
	fsm_pending = 0
	last_contact = 0
	last_log_index = 3459907
	last_log_term = 31964
	last_snapshot_index = 3459154
	last_snapshot_term = 31964
	latest_configuration = [{Suffrage:Voter ID:0db759ad-0007-2ea4-4d41-62aff64515d5 Address:172.16.18.231:8300} {Suffrage:Voter ID:4de69dad-f717-cb90-5c23-011fe0301004 Address:172.16.19.63:8300} {Suffrage:Voter ID:b846fa4a-7c4b-f797-33bb-31cd99211e35 Address:172.16.18.7:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 31964
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 400
	max_procs = 2
	os = linux
	version = go1.16.7
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 874
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 3
	member_time = 103391
	members = 16
	query_queue = 0
	query_time = 13
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 6
	intent_queue = 0
	left = 0
	member_time = 15622
	members = 9
	query_queue = 0
	query_time = 1

Operating system and Environment details

Amazon Linux 2 running on EC2 with Consul version 1.10.3 (before upgrade was 1.9.6)

Log Fragments

2021-10-07T00:53:27.206Z [ERROR] agent.server.connect: CA root replication failed, will retry: routine="secondary CA roots watch" error="Failed to initialize secondary CA provider: error configuring provider: Error making API request.

URL: GET https://vault.sandbox.homexlabs.com/v1/auth/token/lookup-self
Code: 403. Errors:

* permission denied"
@dnephin dnephin added theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected labels Oct 20, 2021
@dnephin
Copy link
Contributor

dnephin commented Dec 20, 2021

Thank you for the bug report! Sorry it has taken some time to respond.

I see the error is from a secondary datacenter. The context on the error indicates that something about the ConnectCA.Roots RPC response changed, and the secondary datacenter is being re-initialized as a result. I'd like to better understand the operations you performed, so I have a few questions.

I assume that all 3 datacenters are configured with the Vault provider, and point at the same instance of Vault. They should all use the same path for root_pki_path, and a different path for intermediate_pki_path (ref #11159).

After revoking the Vault token, did you update the token in all 3 datacenters? The primary DC only shares its public CA certificate with secondaries, it does not share the CA config, so you would need to update all 3 DCs.

Because of the issue documented in #11811, every leader rotation in the primary will cause this secondary CA roots watch routine to run. If the leader in the primary changes before updating the ca config in the secondary DC, I'd expect to see this error.

Did you also see this error in the primary DC, or only secondary DCs? If there was another error in the primary DC can you please share it as well. I expect the error in the primary to contain more information about why the update might have failed.

@Amier3
Copy link
Contributor

Amier3 commented Jan 7, 2022

Hey @rrijkse

Are you still experiencing this issue?

@Amier3 Amier3 added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 7, 2022
@github-actions github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 7, 2022
@rrijkse
Copy link
Author

rrijkse commented Jan 10, 2022

@Amier3 I did experience this yesterday when upgrading from 1.10.3 to 1.11.1, one of the datacenters "forgot" that we switched to the Consul CA away from the Vault CA (to mitigate this issue).

@dnephin All datacenters did use the Vault CA and they all used the same root_pki_path and separate intermediate_pki_path. The root_pki_path was a CA that we generated and the Consul instances had limited access to it so it didn't get overwritten by the multiple DC's (I tried to look for an issue where that was discussed but can't find it since I don't have a copy of the message anymore). Each of the DC's then had their own intermediate_pki_path that only their token had access to (each DC had their own periodic token).

I don't have any way of testing this with the Vault CA provider since we have updated to use the Consul built-in CA, however with that setup the settings are still reverting from the Consul CA to the Vault CA and I couldn't find anything specific in the logs about why that is happening.

@Amier3
Copy link
Contributor

Amier3 commented Jan 18, 2022

@rrijkse

Apologies for the delayed response. How are you currently managing consul (i.e systemd, config management)? If two different CA certs were generated and used in the same cluster it'd lead to the same issues you're experiencing

@rrijkse
Copy link
Author

rrijkse commented Jan 25, 2022

It's managed using config management, and we did have a couple of tries at getting the vault config setup properly before it worked could that have caused issues? If so how do I clear out any of the old configs/CA's? I have ran the consul connect ca set-config ... on all the servers and every once is while when they are restarted we get issues issuing certificates in that DC and I can see the config reverted to an older one.

@Amier3 Amier3 added the needs-investigation The issue described is detailed and complex. label Jan 31, 2022
@Penacillin
Copy link

Penacillin commented May 22, 2022

I'm having a somewhat similar looking issue. To reproduce:

  • Update connect ca vault token in consul config
  • Restart consul
  • consul connect ca get-config
    The config Token is not updated to be the new token that's in the consul config.
    So I try to:
> consul connect ca get-config > connect_ca.json
# Edit connect_ca.json with new token
> consul connect ca set-config -config-file connect_ca.json
Error setting CA configuration: Unexpected response code: 500 (backend not initialized)

Actually after spamming that a bit, it eventually works... Maybe this will help someone
Edit: tried again, didn't work out this timr

@jkirschner-hashicorp jkirschner-hashicorp added the theme/consul-vault Relating to Consul & Vault interactions label Nov 8, 2022
@jkirschner-hashicorp
Copy link
Contributor

The log fragment from the original report suggests that Consul is trying to lookup the details of the Vault token provided in the CA config, but that token lacks the Vault permissions to lookup itself:

2021-10-07T00:53:27.206Z [ERROR] agent.server.connect: CA root replication failed, will retry: routine="secondary CA roots watch" error="Failed to initialize secondary CA provider: error configuring provider: Error making API request.

URL: GET https://vault.sandbox.homexlabs.com/v1/auth/token/lookup-self
Code: 403. Errors:

* permission denied"

What happens if the Vault policy associated with that Vault token is updated to include the following rules?

path "auth/token/renew-self" {
    capabilities = [ "update" ]
}

path "auth/token/lookup-self" {
    capabilities = [ "read" ]
}

@jkirschner-hashicorp
Copy link
Contributor

jkirschner-hashicorp commented Dec 1, 2022

For others following this issue who aren't the original reporter:

Did you also observe the same behavior with an error in the log about a 403 on a Vault auth/token/lookup-self endpoint? If not, what did you observe instead?

@Penacillin
Copy link

On the latest version (consul 1.14.1), I am able to consistently set a new connect ca token with consul connect ca set-config (which fixes the 403s).

@jkirschner-hashicorp jkirschner-hashicorp added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 3, 2023
@n6g7
Copy link

n6g7 commented Feb 15, 2023

@jkirschner-hashicorp I am seeing the 403 log about Vault's auth/token/lookup-self endpoint as well, similar to what @Penacillin described.

Context:

  • recently generated a new Vault token for the connect CA
  • updated the vault token in Consul config file
  • restarted Consul

I get the following error in Consul's logs:

Feb 15 00:33:21 consulhost consul[21645]: 2023-02-15T00:33:21.569-0800 [ERROR] connect.ca: Failed to initialize Connect CA: routine="CA initialization"
Feb 15 00:33:21 consulhost consul[21645]:   error=
Feb 15 00:33:21 consulhost consul[21645]:   | error configuring provider: Error making API request.
Feb 15 00:33:21 consulhost consul[21645]:   |
Feb 15 00:33:21 consulhost consul[21645]:   | URL: GET https://vaulthost:8200/v1/auth/token/lookup-self
Feb 15 00:33:21 consulhost consul[21645]:   | Code: 403. Errors:
Feb 15 00:33:21 consulhost consul[21645]:   |
Feb 15 00:33:21 consulhost consul[21645]:   | * permission denied

And when adding a Vault audit device I see the following audit logs:

{
  "time": "2023-02-15T08:37:37.736108508Z",
  "type": "request",
  "auth": { "token_type": "default" },
  "request": {
    "id": "9f340a00-b4e0-5fad-0465-fa9f489f1140",
    "operation": "read",
    "mount_type": "token",
    "client_token": "hmac-sha256:71e39cd0ec955d1c2762012087c78349c6028397a6967989bbd9c0afc585a00b",
    "namespace": { "id": "root" },
    "path": "auth/token/lookup-self",
    "remote_address": "_redacted_",
    "remote_port": 33382
  },
  "error": "permission denied"
}

Which I guess means that the token is invalid (client_token present in the request but auth mentions a default token type, but I may be misinterpreting), especially since the hash value shown for the client_token ("71e39...") doesn't match the hash of the new token.

I also checked that the new token has the correct permissions: VAULT_TOKEN=$newtoken vault token lookup does produce the expected result.

So it seems like Consul is still using a ghost token for Vault, even after a full restart?

Using consul connect ca set-config does not seem to be an option for me as I get a "403 (ACL not found)" error, I assume Consul's init routine fails to get past the Connect CA and never loads ACLs?

Edit: found out about Vault audit devices raw_log option and was able to find the actual token used by Consul, then ran vault token lookup $ghosttoken and got:

Error looking up token: Error making API request.

URL: POST https://vaulthost:8200/v1/auth/token/lookup
Code: 403. Errors:

* bad token

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex. theme/certificates Related to creating, distributing, and rotating certificates in Consul theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/consul-vault Relating to Consul & Vault interactions type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants