Vault fills up the /core/leader key with uuid in bogus storage data #4326

uepoch · 2018-04-10T09:34:17Z

Environment:

Vault Version: 0.10.0-rc1
Operating System/Architecture: Linux

Vault Config File:

{
    "ha_storage": {
        "consul": {
            "address": "http://localhost:{{env "foo"}}",
            "path": "ha-vault"
        }
    },
    "ui": true,
    "storage": {
        "consul": {
            "address": "http://localhost:{{env "foo"}}",
        }
    },
    "listener": {
        "tcp": {
            "address": "0.0.0.0:{{env "bar"}}",
            "tls_disable": 1
        }
    },
    "disable_mlock": true,
    "telemetry": {
        {{ range service "something" }}
        "dogstatsd_addr": "{{ .Address }}:{{ .Port }}",
        {{ end }}
    }
}

Startup Log Output:

2018-04-10T09:07:27.643Z [INFO ] core: acquired lock, enabling active operation
2018-04-10T09:07:27.705Z [DEBUG] core: generating local cluster certificate
2018-04-10T09:07:27.727Z [INFO ] core: post-unseal setup starting
2018-04-10T09:07:27.727Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.727Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.732Z [INFO ] core: loaded wrapping token key
2018-04-10T09:07:27.732Z [INFO ] core: successfully setup plugin catalog: plugin-directory=
2018-04-10T09:07:27.735Z [INFO ] core: successfully mounted backend: type=kv path=secret/
2018-04-10T09:07:27.735Z [INFO ] core: successfully mounted backend: type=system path=sys/
2018-04-10T09:07:27.735Z [INFO ] core: successfully mounted backend: type=identity path=identity/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=kv path=teams/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=kv path=services/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2018-04-10T09:07:27.737Z [INFO ] core: successfully mounted backend: type=kv path=test/
2018-04-10T09:07:27.746Z [INFO ] core: restoring leases
2018-04-10T09:07:27.746Z [INFO ] rollback: starting rollback manager
2018-04-10T09:07:27.746Z [DEBUG] expiration: collecting leases
2018-04-10T09:07:27.749Z [DEBUG] audit: adding reload function: path=file/
2018-04-10T09:07:27.749Z [DEBUG] audit: file backend options: path=file/ file_path=tmp/audit.log
2018-04-10T09:07:27.749Z [DEBUG] audit: syslog backend options: path=syslog/ facility=local5 tag=
2018-04-10T09:07:27.749Z [DEBUG] identity: loading entities
2018-04-10T09:07:27.750Z [DEBUG] identity: entities collected: num_existing=11
2018-04-10T09:07:27.750Z [DEBUG] identity: entities loading: progress=0
2018-04-10T09:07:27.806Z [INFO ] core: pre-seal teardown starting
2018-04-10T09:07:27.806Z [INFO ] core: cluster listeners not running
2018-04-10T09:07:27.806Z [DEBUG] audit: removing reload function: path=file/
2018-04-10T09:07:27.806Z [DEBUG] expiration: stop triggered
2018-04-10T09:07:27.818Z [DEBUG] expiration: leases collected: num_existing=37
2018-04-10T09:07:27.826Z [DEBUG] expiration: finished stopping
2018-04-10T09:07:27.826Z [INFO ] rollback: stopping rollback manager
2018-04-10T09:07:27.826Z [INFO ] core: pre-seal teardown complete
2018-04-10T09:07:27.826Z [ERROR] core: post-unseal setup failed: error="failed to update entity in MemDB: alias "e65e63eb-b610-b07d-95e6-24db1e36f4a1" in already tied to a different entity "134ff424-41d8-51a7-c3b6-8e367a83fe3c""
2018-04-10T09:07:27.829Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://mesos-slave005-par.central.criteo.preprod:31580 active_redirect_addr=http://mesos-slave005-par.central.criteo.preprod:31579
2018-04-10T09:07:27.829Z [DEBUG] core: refreshing forwarding connection
2018-04-10T09:07:27.829Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.829Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.829Z [DEBUG] core: done refreshing forwarding connection
2018-04-10T09:07:27.830Z [DEBUG] core: creating rpc dialer: host=fw-b049322c-ae03-90f6-37e4-18fd54328eee
2018-04-10T09:07:27.830Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"
2018-04-10T09:07:27.835Z [INFO ] core: acquired lock, enabling active operation
2018-04-10T09:07:27.853Z [DEBUG] core: generating local cluster certificate
2018-04-10T09:07:27.923Z [INFO ] core: post-unseal setup starting
2018-04-10T09:07:27.923Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.923Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.923Z [DEBUG] core: forwarding: stopping heartbeating
2018-04-10T09:07:27.925Z [INFO ] core: loaded wrapping token key
2018-04-10T09:07:27.925Z [INFO ] core: successfully setup plugin catalog: plugin-directory=
2018-04-10T09:07:27.928Z [INFO ] core: successfully mounted backend: type=kv path=secret/
2018-04-10T09:07:27.928Z [INFO ] core: successfully mounted backend: type=system path=sys/
2018-04-10T09:07:27.929Z [INFO ] core: successfully mounted backend: type=identity path=identity/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=kv path=teams/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=kv path=services/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2018-04-10T09:07:27.930Z [INFO ] core: successfully mounted backend: type=kv path=test/
2018-04-10T09:07:27.938Z [INFO ] core: restoring leases
2018-04-10T09:07:27.939Z [INFO ] rollback: starting rollback manager
2018-04-10T09:07:27.939Z [DEBUG] expiration: collecting leases
2018-04-10T09:07:27.941Z [DEBUG] audit: adding reload function: path=file/
2018-04-10T09:07:27.941Z [DEBUG] audit: file backend options: path=file/ file_path=tmp/audit.log
2018-04-10T09:07:27.941Z [DEBUG] audit: syslog backend options: path=syslog/ facility=local5 tag=
2018-04-10T09:07:27.941Z [DEBUG] identity: loading entities
2018-04-10T09:07:27.943Z [DEBUG] identity: entities collected: num_existing=11
2018-04-10T09:07:27.943Z [DEBUG] identity: entities loading: progress=0
2018-04-10T09:07:27.946Z [INFO ] core: pre-seal teardown starting
2018-04-10T09:07:27.946Z [INFO ] core: cluster listeners not running
2018-04-10T09:07:27.946Z [DEBUG] audit: removing reload function: path=file/
2018-04-10T09:07:27.946Z [DEBUG] expiration: stop triggered
2018-04-10T09:07:27.957Z [DEBUG] expiration: leases collected: num_existing=37
2018-04-10T09:07:27.966Z [DEBUG] expiration: finished stopping
2018-04-10T09:07:27.967Z [INFO ] rollback: stopping rollback manager
2018-04-10T09:07:27.967Z [INFO ] core: pre-seal teardown complete
2018-04-10T09:07:27.967Z [ERROR] core: post-unseal setup failed: error="failed to update entity in MemDB: alias "212725fb-4fd5-3bc7-7c0b-7de77deea70b" in already tied to a different entity "b4668e4a-4f8f-ea3c-08e3-e822c9f4fdf1""
2018-04-10T09:07:27.970Z [DEBUG] core: parsing information for new active node: active_cluster_addr=https://mesos-slave005-par.central.criteo.preprod:31580 active_redirect_addr=http://mesos-slave005-par.central.criteo.preprod:31579
2018-04-10T09:07:27.970Z [DEBUG] core: refreshing forwarding connection
2018-04-10T09:07:27.970Z [DEBUG] core: clearing forwarding clients
2018-04-10T09:07:27.970Z [DEBUG] core: done clearing forwarding clients
2018-04-10T09:07:27.970Z [DEBUG] core: done refreshing forwarding connection
2018-04-10T09:07:27.970Z [DEBUG] core: creating rpc dialer: host=fw-daa50d57-5c81-d7d7-b2c2-8566277eff4f
2018-04-10T09:07:27.970Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure"

# Paste your log output here

This is repeating every 2seconds or so

Expected Behavior:

I expect the vault to crash and not try to get the lock again

Actual Behavior:
Vault is looping in the behavior described in the logs, no leader is ever elected and it feels the leader key in consul at each attempt without cleaning

Steps to Reproduce:

Ok, this one is kinda tricky. We had two consul set of data imported in the same kv.
Create a vault on any storage backend
create an entity with an auth backend
backup the storage and deploy it somewhere with another vault cluster on a separate storage
modify any entity on cluster 2
backup cluster 2 data and import it on cluster 1
Try to start
Look at the //core/leader path in your storage

Important Factoids:

I totally understand this is so wrong in many ways.
I just wanted to raise the leader infinite creation in case of bogus storage data

I'm not sure if it should just crash or still try to get the lock after seeing there is a conflict in entity
(All nodes of the cluster will be impacted the same way)

References:

I'm willing to look at implementing the patch if you identify it as a real bug

The text was updated successfully, but these errors were encountered:

jefferai · 2018-04-12T13:59:44Z

There isn't really much that we can do here, because when a new Vault takes leadership it needs to write its information for other HA nodes to find. The only time that info can be safely cleaned out is when grabbing leadership is successful. So it's a bit chicken and egg, but as you noted in normal usage this should never really occur. Thanks for calling attention to it though!

jefferai closed this as completed Apr 12, 2018

This was referenced Aug 8, 2018

expiration: shutting down, not attempting further revocation of lease #5000

Closed

[ERROR] core: post-unseal setup failed: error="failed to update entity in MemDB #5070

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault fills up the /core/leader key with uuid in bogus storage data #4326

Vault fills up the /core/leader key with uuid in bogus storage data #4326

uepoch commented Apr 10, 2018 •

edited

Loading

jefferai commented Apr 12, 2018

Vault fills up the /core/leader key with uuid in bogus storage data #4326

Vault fills up the /core/leader key with uuid in bogus storage data #4326

Comments

uepoch commented Apr 10, 2018 • edited Loading

jefferai commented Apr 12, 2018

uepoch commented Apr 10, 2018 •

edited

Loading