Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault fills up the /core/leader key with uuid in bogus storage data #4326

Closed
uepoch opened this issue Apr 10, 2018 · 1 comment
Closed

Vault fills up the /core/leader key with uuid in bogus storage data #4326

uepoch opened this issue Apr 10, 2018 · 1 comment

Comments

@uepoch
Copy link
Contributor

uepoch commented Apr 10, 2018

Environment:

  • Vault Version: 0.10.0-rc1
  • Operating System/Architecture: Linux

Vault Config File:

{
    "ha_storage": {
        "consul": {
            "address": "http://localhost:{{env "foo"}}",
            "path": "ha-vault"
        }
    },
    "ui": true,
    "storage": {
        "consul": {
            "address": "http://localhost:{{env "foo"}}",
        }
    },
    "listener": {
        "tcp": {
            "address": "0.0.0.0:{{env "bar"}}",
            "tls_disable": 1
        }
    },
    "disable_mlock": true,
    "telemetry": {
        {{ range service "something" }}
        "dogstatsd_addr": "{{ .Address }}:{{ .Port }}",
        {{ end }}
    }
}

Startup Log Output:

2018-04-10T09:07:27.643Z [INFO ] core: acquired lock, enabling active operation
2018-04-10T09:07:27.705Z [DEBUG] core: generating local cluster certificate
2018-04-10T09:07:27.727Z [INFO ] core: post-unseal setup starting
2018-04-10T09:07:27.727Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.727Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.732Z [INFO ] core: loaded wrapping token key
2018-04-10T09:07:27.732Z [INFO ] core: successfully setup plugin catalog: plugin-directory=
2018-04-10T09:07:27.735Z [INFO ] core: successfully mounted backend: type=kv path=secret/
2018-04-10T09:07:27.735Z [INFO ] core: successfully mounted backend: type=system path=sys/
2018-04-10T09:07:27.735Z [INFO ] core: successfully mounted backend: type=identity path=identity/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=kv path=teams/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=kv path=services/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=kv path=test/
2018-04-10T09:07:27.746Z [INFO ] core: restoring leases
2018-04-10T09:07:27.746Z [INFO ] rollback: starting rollback manager
2018-04-10T09:07:27.746Z [DEBUG] expiration: collecting leases
2018-04-10T09:07:27.749Z [DEBUG] audit: adding reload function: path=file/
2018-04-10T09:07:27.749Z [DEBUG] audit: file backend options: path=file/ file_path=tmp/audit.log
2018-04-10T09:07:27.749Z [DEBUG] audit: syslog backend options: path=syslog/ facility=local5 tag=
2018-04-10T09:07:27.749Z [DEBUG] identity: loading entities
2018-04-10T09:07:27.750Z [DEBUG] identity: entities collected: num_existing=11
2018-04-10T09:07:27.750Z [DEBUG] identity: entities loading: progress=0
2018-04-10T09:07:27.806Z [INFO ] core: pre-seal teardown starting
2018-04-10T09:07:27.806Z [INFO ] core: cluster listeners not running
2018-04-10T09:07:27.806Z [DEBUG] audit: removing reload function: path=file/
2018-04-10T09:07:27.806Z [DEBUG] expiration: stop triggered
2018-04-10T09:07:27.818Z [DEBUG] expiration: leases collected: num_existing=37
2018-04-10T09:07:27.826Z [DEBUG] expiration: finished stopping
2018-04-10T09:07:27.826Z [INFO ] rollback: stopping rollback manager
2018-04-10T09:07:27.826Z [INFO ] core: pre-seal teardown complete
2018-04-10T09:07:27.826Z [ERROR] core: post-unseal setup failed: error="failed to update entity in MemDB: alias "e65e63eb-b610-b07d-95e6-24db1e36f4a1" in already tied to a different entity "134ff424-41d8-51a7-c3b6-8e367a83fe3c""
2018-04-10T09:07:27.829Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://mesos-slave005-par.central.criteo.preprod:31580 active_redirect_addr=http://mesos-slave005-par.central.criteo.preprod:31579
2018-04-10T09:07:27.829Z [DEBUG] core: refreshing forwarding connection
2018-04-10T09:07:27.829Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.829Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.829Z [DEBUG] core: done refreshing forwarding connection
2018-04-10T09:07:27.830Z [DEBUG] core: creating rpc dialer: host=fw-b049322c-ae03-90f6-37e4-18fd54328eee
2018-04-10T09:07:27.830Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
2018-04-10T09:07:27.835Z [INFO ] core: acquired lock, enabling active operation
2018-04-10T09:07:27.853Z [DEBUG] core: generating local cluster certificate
2018-04-10T09:07:27.923Z [INFO ] core: post-unseal setup starting
2018-04-10T09:07:27.923Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.923Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.923Z [DEBUG] core: forwarding: stopping heartbeating
2018-04-10T09:07:27.925Z [INFO ] core: loaded wrapping token key
2018-04-10T09:07:27.925Z [INFO ] core: successfully setup plugin catalog: plugin-directory=
2018-04-10T09:07:27.928Z [INFO ] core: successfully mounted backend: type=kv path=secret/
2018-04-10T09:07:27.928Z [INFO ] core: successfully mounted backend: type=system path=sys/
2018-04-10T09:07:27.929Z [INFO ] core: successfully mounted backend: type=identity path=identity/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=kv path=teams/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=kv path=services/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=kv path=test/
2018-04-10T09:07:27.938Z [INFO ] core: restoring leases
2018-04-10T09:07:27.939Z [INFO ] rollback: starting rollback manager
2018-04-10T09:07:27.939Z [DEBUG] expiration: collecting leases
2018-04-10T09:07:27.941Z [DEBUG] audit: adding reload function: path=file/
2018-04-10T09:07:27.941Z [DEBUG] audit: file backend options: path=file/ file_path=tmp/audit.log
2018-04-10T09:07:27.941Z [DEBUG] audit: syslog backend options: path=syslog/ facility=local5 tag=
2018-04-10T09:07:27.941Z [DEBUG] identity: loading entities
2018-04-10T09:07:27.943Z [DEBUG] identity: entities collected: num_existing=11
2018-04-10T09:07:27.943Z [DEBUG] identity: entities loading: progress=0
2018-04-10T09:07:27.946Z [INFO ] core: pre-seal teardown starting
2018-04-10T09:07:27.946Z [INFO ] core: cluster listeners not running
2018-04-10T09:07:27.946Z [DEBUG] audit: removing reload function: path=file/
2018-04-10T09:07:27.946Z [DEBUG] expiration: stop triggered
2018-04-10T09:07:27.957Z [DEBUG] expiration: leases collected: num_existing=37
2018-04-10T09:07:27.966Z [DEBUG] expiration: finished stopping
2018-04-10T09:07:27.967Z [INFO ] rollback: stopping rollback manager
2018-04-10T09:07:27.967Z [INFO ] core: pre-seal teardown complete
2018-04-10T09:07:27.967Z [ERROR] core: post-unseal setup failed: error="failed to update entity in MemDB: alias "212725fb-4fd5-3bc7-7c0b-7de77deea70b" in already tied to a different entity "b4668e4a-4f8f-ea3c-08e3-e822c9f4fdf1""
2018-04-10T09:07:27.970Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://mesos-slave005-par.central.criteo.preprod:31580 active_redirect_addr=http://mesos-slave005-par.central.criteo.preprod:31579
2018-04-10T09:07:27.970Z [DEBUG] core: refreshing forwarding connection
2018-04-10T09:07:27.970Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.970Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.970Z [DEBUG] core: done refreshing forwarding connection
2018-04-10T09:07:27.970Z [DEBUG] core: creating rpc dialer: host=fw-daa50d57-5c81-d7d7-b2c2-8566277eff4f
2018-04-10T09:07:27.970Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"

# Paste your log output here

This is repeating every 2seconds or so

Expected Behavior:

I expect the vault to crash and not try to get the lock again

Actual Behavior:
Vault is looping in the behavior described in the logs, no leader is ever elected and it feels the leader key in consul at each attempt without cleaning

Steps to Reproduce:

Ok, this one is kinda tricky. We had two consul set of data imported in the same kv.
Create a vault on any storage backend
create an entity with an auth backend
backup the storage and deploy it somewhere with another vault cluster on a separate storage
modify any entity on cluster 2
backup cluster 2 data and import it on cluster 1
Try to start
Look at the //core/leader path in your storage

Important Factoids:

I totally understand this is so wrong in many ways.
I just wanted to raise the leader infinite creation in case of bogus storage data

I'm not sure if it should just crash or still try to get the lock after seeing there is a conflict in entity
(All nodes of the cluster will be impacted the same way)

References:

I'm willing to look at implementing the patch if you identify it as a real bug

@jefferai
Copy link
Member

There isn't really much that we can do here, because when a new Vault takes leadership it needs to write its information for other HA nodes to find. The only time that info can be safely cleaned out is when grabbing leadership is successful. So it's a bit chicken and egg, but as you noted in normal usage this should never really occur. Thanks for calling attention to it though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants