Fix k8s service registration case where Vault fails to unlabel itself as a leader #21642

tomhjp · 2023-07-06T23:00:03Z

This bug only ever occurs on enterprise, as we only ever call sealInternalWithOptions with keepHALock as true in enterprise, and so keepHALockOnStepDown is always 0 in OSS. When we step down as a leader but keep the HA lock, we should still unlabel ourselves as leader in k8s, but that happens in clearLeader, so before this fix if we keep the HA lock we'll never unlabel ourselves. Essentially, this change ensures we more closely track the core's standby state variable in that case.

I'm not sure about automated testing yet. I've been using the following script for reproducing the issue locally.

Repro script

#!/usr/bin/env bash
# 27 Jul 2022 - Sean Ellefson
# https://hashicorp.zendesk.com/agent/tickets/79606
#
# This script attempts to reproduce an issue with Kubernetes service
# registration where the 'vault-active' label some times doesn't get updated,
# resulting in more than pod with the label 'vault-active=true'.  It doesn't
# always work, but seems to occur more reliably than other methods when hitting
# the 'update-primary' endpoint
#
# This assumes you have Helm installed and configured with the HashiCorp
# repository as well as a local Kubernetes environment (built with minikube),
# and `jq`.  You'll also need to ensure you have your Vault Enterprise license
# created as a k8s secret with the key-name "vault.hclic".
#
# The script deploys a Vault dev server, configures the Transit secrets engine,
# and enables DR primary replication.  It then deploys a Vault Raft cluster,
# using the dev server for Transit auto-unseal, and enables DR secondary
# replication, and then uses the 'update-primary' endpoint.  You can then loop
# at the end of the script to repeatedly submit secondary tokens to the
# 'update-primary' endpoint until the issue occurs

# Arbitrary amount of iterations to generate activity on primary cluster, seems
# to help issue to recur
WAL_ITERATIONS=50 

# Colors because the world is a colorful place 🌎
TXTBLU="$(tput setaf 4)"
TXTCYA="$(tput setaf 6)"
TXTGRN="$(tput setaf 2)"
TXTMGT="$(tput setaf 5)"
TXTRED="$(tput setaf 1)"
TXTYLW="$(tput setaf 3)"
TXTWHT="$(tput setaf 7)"
TXTRST="$(tput sgr0)"

msg() {
    MSGSRC="[repro-79606]"
    MSGTYPE="$1"
    MSGTXT="$2"
    case "${MSGTYPE}" in
        greeting)
            printf "%s%s [=] %s %s\\n" "$TXTBLU" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
        info)
            printf "%s%s [i] %s %s\\n" "$TXTCYA" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
        success)
            printf "%s%s [+] %s %s\\n" "$TXTGRN" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
        complete)
            printf "%s%s [^] %s %s\\n" "$TXTGRN" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
        boom)
            printf "%s%s [*] %s %s\\n" "$TXTMGT" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
        notice)
            printf "%s%s [?] %s %s\\n" "$TXTYLW" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
        alert)
            >&2 printf "%s%s [!] %s %s\\n" "$TXTRED" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
        *)
            >&2 printf "%s%s [@] %s %s\\n" "$TXTCYA" "$MSGSRC" "$MSGTXT" "$TXTRST"
            ;;
    esac
}

trap cleanup SIGINT

cleanup() {
  msg alert "Caught interrupt!  Cleaning up..."
  helm uninstall vault-primary vault-secondary
  kubectl delete pvc data-vault-secondary-{0..2}
  if ps -p $SECONDARY_0_LOG_PID > /dev/null 2>&1 ; then kill -9 $SECONDARY_0_LOG_PID ; fi
  if ps -p $SECONDARY_1_LOG_PID > /dev/null 2>&1 ; then kill -9 $SECONDARY_1_LOG_PID ; fi
  if ps -p $SECONDARY_2_LOG_PID > /dev/null 2>&1 ; then kill -9 $SECONDARY_2_LOG_PID ; fi
  msg notice "Exiting..."
  exit
}


# Capture logs from secondary pods
msg info "Creating directory './repro-79606-logs'"
mkdir -p ./repro-79606-logs

# Dev server to be used as Transit auto-unseal target and replication primary
msg info "Deploy primary server "
helm install vault-primary hashicorp/vault \
  --set=server.dev.enabled=true \
  --set=server.dev.devRootToken=root \
  --set=server.standalone.enabled=true \
  --set=server.image.repository=hashicorp/vault-enterprise \
  --set=server.image.tag=1.14.0-ent \
  --set=server.enterpriseLicense.secretName=vault-license \
  --set=server.enterpriseLicense.secretKey=vault.hclic \
  --set=server.extraArgs="-dev-ha -dev-transactional" \
  --set=injector.enabled=false \
  --set=global.tlsDisable=true > /dev/null 

msg info "Wait until pod is ready"
until [ $(sleep 1 ; kubectl get pod vault-primary-0 -o json | jq .status.containerStatuses[].ready) == "true" ] 2> /dev/null ; do 
  sleep 2
done

msg info "Enable DR primary replication, prepare transit auto-unseal"
kubectl exec -it vault-primary-0 -- vault login root 

kubectl exec -it vault-primary-0 -- vault secrets enable transit
kubectl exec -it vault-primary-0 -- vault write -f transit/keys/autounseal
kubectl exec -it vault-primary-0 -- sh -c 'vault policy write autounseal - << EOF
path "transit/encrypt/autounseal" {
   capabilities = [ "update" ]
 }

 path "transit/decrypt/autounseal" {
    capabilities = [ "update" ]
  }
EOF'
TRANSIT_TOKEN=$(kubectl exec -it vault-primary-0 -- vault token create -format=json -policy="autounseal" | jq -r .auth.client_token)

kubectl exec -it vault-primary-0 -- sh -c 'vault policy write dr-secondary-promotion - <<EOF
path "sys/replication/dr/secondary/promote" {
  capabilities = [ "update" ]
}

path "sys/replication/dr/secondary/update-primary" {
    capabilities = [ "update" ]
  }

path "sys/storage/raft/autopilot/state" {
    capabilities = [ "update" , "read" ]
  }

path "sys/storage/raft/configuration" {
    capabilities = [ "read" ]
  }
EOF'
kubectl exec -it vault-primary-0 -- vault write auth/token/roles/failover-handler \
    allowed_policies=dr-secondary-promotion \
    orphan=true \
    renewable=false \
    token_type=batch
DR_TOKEN=$(kubectl exec -it vault-primary-0 -- vault token create --format=json -role=failover-handler -ttl=8h | jq -r .auth.client_token)

kubectl exec -it vault-primary-0 -- vault write -f sys/replication/dr/primary/enable 

# Raft secondary cluster required for reproducing issue
msg info "Deploy secondary cluster "
helm install vault-secondary hashicorp/vault \
  --set=server.affinity='' \
  --set=server.ha.enabled=true \
  --set=server.ha.raft.enabled=true \
  --set=server.ha.raft.replicas=3 \
  --set=server.image.repository=hashicorp/vault-enterprise \
  --set=server.image.tag=1.14.0-ent \
  --set=server.enterpriseLicense.secretName=vault-license \
  --set=server.enterpriseLicense.secretKey=vault.hclic \
  --set=server.logLevel=trace \
  --set=injector.enabled=false \
  --set=global.tlsDisable=true \
  --set=server.extraEnvironmentVars.VAULT_TOKEN=$TRANSIT_TOKEN \
  --set-string='server.ha.raft.config=
ui = true

service_registration "kubernetes" {}

listener "tcp" {
  address = ":8200"
  cluster_address = ":8201"
  tls_disable = 1
  telemetry {
    unauthenticated_metrics_access = true
  }
}

telemetry {
  prometheus_retention_time = "24h"
  disable_hostname = true
}

seal "transit" {
  address = "http://vault-primary-0.vault-primary-internal:8200"
  key_name = "autounseal"
  mount_path = "transit"
}

storage "raft" {
  path = "/vault/data"
  retry_join {
    leader_api_addr = "http://vault-secondary-0.vault-secondary-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-secondary-1.vault-secondary-internal:8200"
  }
  retry_join {
    leader_api_addr = "http://vault-secondary-2.vault-secondary-internal:8200"
  }
}
' > /dev/null

msg info "Wait until cluster has started"
sleep 3
until [ $(sleep 2 ; kubectl get pod vault-secondary-0 -o json | jq .status.containerStatuses[].started) == "true" ] 2> /dev/null ; do 
  sleep 1
done
kubectl logs vault-secondary-0 -f > ./repro-79606-logs/vault-secondary-0.log & SECONDARY_0_LOG_PID=$!
until [ $(sleep 2 ; kubectl get pod vault-secondary-1 -o json | jq .status.containerStatuses[].started) == "true" ] 2> /dev/null ; do 
  sleep 1
done
kubectl logs vault-secondary-1 -f > ./repro-79606-logs/vault-secondary-1.log & SECONDARY_1_LOG_PID=$!
until [ $(sleep 2 ; kubectl get pod vault-secondary-2 -o json | jq .status.containerStatuses[].started) == "true" ] 2> /dev/null ; do 
  sleep 1
done
kubectl logs vault-secondary-2 -f > ./repro-79606-logs/vault-secondary-2.log & SECONDARY_2_LOG_PID=$!

msg info "Initialize secondary cluster and start replication"
until [ $ROOT ] ; do
  read -r UNSEAL ROOT < <(kubectl exec -it vault-secondary-0 -- vault operator init --format=json -recovery-shares=1 -recovery-threshold=1 | jq -r '.recovery_keys_b64[], .root_token' | xargs echo -n)
done

until [ $(sleep 2 ; kubectl exec -it vault-secondary-0 -- curl http://localhost:8200/v1/sys/health | jq -r .standby) == "false" ] 2> /dev/null ; do 
  sleep 1
done

# Setting Transit auto-unseal token as env var requires unsetting VAULT_TOKEN
# before being able to make authenticated requests from within the pod
kubectl exec -it vault-secondary-0 -- vault login $ROOT 
SECONDARY_TOKEN=$(kubectl exec -it vault-primary-0 -- vault write -f --format=json sys/replication/dr/primary/secondary-token id=dr | jq -r .wrap_info.token)
kubectl exec -it vault-secondary-0 -- sh -c "unset VAULT_TOKEN ; vault write -f sys/replication/dr/secondary/enable token=$SECONDARY_TOKEN"

msg info "Wait until cluster is ready"
until [ $(sleep 2 ; kubectl get pod vault-secondary-0 -o json | jq .status.containerStatuses[].ready) == "true" ] 2> /dev/null ; do 
  sleep 1
done
until [ $(sleep 2 ; kubectl get pod vault-secondary-1 -o json | jq .status.containerStatuses[].ready) == "true" ] 2> /dev/null ; do 
  sleep 1
done
until [ $(sleep 2 ; kubectl get pod vault-secondary-2 -o json | jq .status.containerStatuses[].ready) == "true" ] 2> /dev/null ; do 
  sleep 1
done

# Checkpoint, shows correctly labelled active node
msg info "Show leader"
date ; msg success "kubectl get pods -l vault-active=true"
kubectl get pods -l vault-active=true

msg info "Generate some WALs..."
for i in $(seq 1 $WAL_ITERATIONS) ; do 
  kubectl exec -it vault-primary-0 -- vault token create -policy=default > /dev/null
  kubectl exec -it vault-primary-0 -- vault kv put secret/$i foo=bar > /dev/null
  echo -n "." 
done
echo

# May require submitting more than one secondary token to reproduce issue
reproduce_issue() {
  msg info "Generate new secondary token, hit update-primary and reproduce issue"
  kubectl exec -it vault-primary-0 -- vault write -f --format=json sys/replication/dr/primary/revoke-secondary id=dr
  SECONDARY_TOKEN=$(kubectl exec -it vault-primary-0 -- vault write -f --format=json sys/replication/dr/primary/secondary-token id=dr | jq -r .wrap_info.token)
  kubectl exec -it vault-secondary-0 -- vault write -f sys/replication/dr/secondary/update-primary token=$SECONDARY_TOKEN dr_operation_token=$DR_TOKEN

  msg info "Show leader"
  sleep 20
  msg alert "kubectl get pods -l vault-active=true"
  date ; kubectl get pods -l vault-active=true 
}

while : ; do 
  reproduce_issue 
  read -p "Press enter key to attempt reproduction again, Ctrl+C to cleanup and exit: "
done

cleanup

… as a leader

tomhjp · 2023-07-13T16:55:46Z

I re-applied this patch on top of the 1.14.0+ent tag on the vault-enterprise repo, and anyone reviewing (with access to the enterprise repo) can save the repro script as an executable file repro.sh and run the following to test:

VAULT_LICENSE=...

kind create cluster
kubectl create secret generic vault-license --from-literal="vault.hclic=${VAULT_LICENSE}"
gh run download --repo hashicorp/vault-enterprise --name vault-enterprise_default_linux_arm64_1.14.0+ent_0e81b9fed2383bfdfd3a9b926893fb1f1c5470ca.docker.tar 5544311282
docker image load --input vault-enterprise_default_linux_arm64_1.14.0+ent_0e81b9fed2383bfdfd3a9b926893fb1f1c5470ca.docker.tar
kind load docker-image hashicorp/vault-enterprise:1.14.0-ent
./repro.sh

Without my fix, it consistently reproduces within 2 tries. With my fix, I've retried 20 times with no reproduction.

ncabatoff · 2023-07-13T17:07:54Z

before this fix if we keep the HA lock we'll never unlabel ourselves

I'm confused as to why this change is necessary. From what I can tell, we only keep the HA lock in two instances: when we restore a raft snapshot, and possible when enabling a replication secondary. Ah, but update-primary is almost the same thing as enabling a replication secondary, so that makes sense then, given the repro.

My misgiving is that this seems inconsistent. If we're keeping the HA lock, why should we change the service registration state? For raft at least, holding the HA lock is synonymous with being the leader.

tomhjp · 2023-07-13T17:53:50Z

My misgiving is that this seems inconsistent. If we're keeping the HA lock, why should we change the service registration state? For raft at least, holding the HA lock is synonymous with being the leader.

This gets to the core of where I get a bit lost. On line 692 just above, we set c.standby = true regardless of whether we keep the HA lock on step down. And then shortly after exercising this code path, the node does indeed get replaced as leader by another node. So with my current understanding it seems inconsistent to me that we internally consider ourselves a standby (c.standby) while also holding the HA lock - the two seem like somewhat canonical but conflicting sources of information to me.

I guess the obvious other candidate fix would be to unlabel ourselves when we relinquish the HA lock instead of when we set c.standby = true. Would that be more correct if it's possible to do cleanly?

tomhjp · 2023-07-13T18:04:57Z

Based on cleanLeaderPrefix, it looks like the next leader is the one that deletes the HA lock from storage, and relatedly advertiseLeader (which calls cleanLeaderPrefix) is the function that labels the new leader as active.

ncabatoff · 2023-07-13T18:05:15Z

My misgiving is that this seems inconsistent. If we're keeping the HA lock, why should we change the service registration state? For raft at least, holding the HA lock is synonymous with being the leader.

This gets to the core of where I get a bit lost. On line 692 just above, we set c.standby = true regardless of whether we keep the HA lock on step down. And then shortly after exercising this code path, the node does indeed get replaced as leader by another node. So with my current understanding it seems inconsistent to me that we internally consider ourselves a standby (c.standby) while also holding the HA lock - the two seem like somewhat canonical but conflicting sources of information to me.

I guess the obvious other candidate fix would be to unlabel ourselves when we relinquish the HA lock instead of when we set c.standby = true. Would that be more correct if it's possible to do cleanly?

There are two levels in Vault that have an understanding as to active status: the physical-ha layer, and the Core layer. When we call sealInternalWithOptions with keepHALock=true, the idea is that we need to rebuild a Core because something fundamental has changed (snapshot has been restored, secondary mode has been enabled) and there's no other mechanism to flush our state. While it's true that we won't be an active node - indeed, we'll be sealed for a bit - we're expecting that once we get unsealed we'll remain the active node as we were prior. And we know that no one else will be the active node either until we resume, assuming nothing goes wrong.

Ok, I read the jira and now I have a better idea as to motivation. I'm not opposed to this change, and I think it's safe. A part of me wants to push instead for getting rid of keepHALock, since I'm not sure how necessary it is, and it adds complexity. But that's a riskier change, so maybe let's go with what you have for now.

tomhjp · 2023-07-17T12:42:26Z

Thanks!

Fix k8s service registration case where Vault fails to unlabel itself…

18c6f76

… as a leader

tomhjp requested review from swenson, scellef and ncabatoff July 6, 2023 23:00

VioletHynes added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Jul 7, 2023

tomhjp added this to the 1.15 milestone Jul 14, 2023

Add changelog

4d026d1

tomhjp added core/service-discovery backport/1.12.x labels Jul 14, 2023

ncabatoff approved these changes Jul 14, 2023

View reviewed changes

tomhjp merged commit 5d97159 into main Jul 17, 2023

tomhjp deleted the vault-7375/multiple-pods-labelled-leader branch July 17, 2023 12:42

tomhjp mentioned this pull request Jul 18, 2023

Use config's service registration in test cluster #21907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix k8s service registration case where Vault fails to unlabel itself as a leader #21642

Fix k8s service registration case where Vault fails to unlabel itself as a leader #21642

tomhjp commented Jul 6, 2023

tomhjp commented Jul 13, 2023 •

edited

Loading

ncabatoff commented Jul 13, 2023

tomhjp commented Jul 13, 2023

tomhjp commented Jul 13, 2023

ncabatoff commented Jul 13, 2023

tomhjp commented Jul 17, 2023

Fix k8s service registration case where Vault fails to unlabel itself as a leader #21642

Fix k8s service registration case where Vault fails to unlabel itself as a leader #21642

Conversation

tomhjp commented Jul 6, 2023

tomhjp commented Jul 13, 2023 • edited Loading

ncabatoff commented Jul 13, 2023

tomhjp commented Jul 13, 2023

tomhjp commented Jul 13, 2023

ncabatoff commented Jul 13, 2023

tomhjp commented Jul 17, 2023

tomhjp commented Jul 13, 2023 •

edited

Loading