Improvements to Template + Vault during Nomad Client restarts #13313

chuckyz · 2022-06-09T21:20:56Z

First off, thank you so much for the template improvements in 1.2.4!!

We’ve implemented these in our testing environment and I’d like to make a further improvement proposal. Today, when our config management runs (Chef) we just hard restart Nomad after each run. This has served us pretty well to this point but unfortunately it’s pointed out a flaw within Nomad’s template system; especially when combined with these improvements.

I recently simulated a Vault failure (overrode the DNS for /etc/resolv.conf to 127.0.0.1), and everything behaved exactly as expected — until the client daemon restarted.

Upon restarting things the following messages started:

    Jun 02 11:59:59 foo-host nomad[3704847]:     2022-06-02T11:59:59.070-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:14.070605338 -0500 CDT m=+142.893296741"
    Jun 02 11:59:59 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:14.070605338 -0500 CDT m=+142.893296741"
    Jun 02 12:00:02 foo-host nomad[3704847]:     2022-06-02T12:00:02.482-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:17.482024359 -0500 CDT m=+146.304715757"
    Jun 02 12:00:02 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:17.482024359 -0500 CDT m=+146.304715757"
    Jun 02 12:00:17 foo-host nomad[3704847]:     2022-06-02T12:00:17.529-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:32.529657848 -0500 CDT m=+161.352349245"
    Jun 02 12:00:17 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:32.529657848 -0500 CDT m=+161.352349245"
    Jun 02 12:00:20 foo-host nomad[3704847]:     2022-06-02T12:00:20.937-0500 [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:35.937767731 -0500 CDT m=+164.760459129"
    Jun 02 12:00:20 foo-host nomad[3704847]: client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: error="failed to renew the vault token: Put \"[https://foo-vault:8200/v1/auth/token/renew-self](https://foo-vault:8200/v1/auth/token/renew-self)\": dial tcp 127.0.0.1:8200: connect: connection refused" period="2022-06-02 12:00:35.937767731 -0500 CDT m=+164.760459129"

You can see from the times here that vault_retry seems to be ignored. I believe this is acceptable/desirable as this is a renewal of a lease outside of the template section and purely within the Vault information.

One thing this did was not cause the allocation to fail, but rather put it in a state I can’t really explain. It was running, the container was there happily working and serving traffic. However, from the control-plane it was completely broken. The CPU stats were unreported and it was as if the allocation existed but was ‘detached’ for lack of a better term.

When restarting Nomad for a 2nd time with the allocation in this state, Nomad marked it as failed and removed it from the node. I think this is not wrong behavior but it is undesirable for our use-cases.

Proposal

This is leading to the following asks:

Can we track Vault token state across daemon restarts?
Can we track Template state across daemon restarts, including current retry times?

Note: One extremely explicit call out here is I do not expect things to live through host restarts or things like docker restarts. If a host restarts/all containers stop, then all bets are off.

Use-cases

The purpose of those asks is to allow allocations to ‘survive’ through upstream problems and Nomad daemon restarts.

Attempted Solutions

Modifying all the *_retry settings.

The text was updated successfully, but these errors were encountered:

tgross · 2022-06-10T13:34:27Z

Hi @chuckyz!

One thing this did was not cause the allocation to fail, but rather put it in a state I can’t really explain. It was running, the container was there happily working and serving traffic. However, from the control-plane it was completely broken. The CPU stats were unreported and it was as if the allocation existed but was ‘detached’ for lack of a better term.

I suspect that what we're seeing here is the task failing to restore: the client's task runner hasn't successfully reattached to the task. Did this state continue after Vault connectivity was restored?

As for the template runner persisting state, this is all great and aligned with some ideas we've been discussing.

The currently tricky thing with templating is that the template runner runs in-process with the Nomad client but we're currently using consul-template as though it were a library. This was expedient to implement because then we get all the CT features "for free" but architecturally challenging to avoid security issues (ex #9129 #9129) and problems around restarting clients like you've described here (ex #9636). So we're planning on moving the template rendering (and artifact fetching) out into its own containerized process. See #12301. This would let us entirely avoid worrying about templates when the client restarts.

chuckyz · 2022-06-14T23:02:00Z

Did this state continue after Vault connectivity was restored?

Let me re-test. I think this might be the core of that particular angle.

Looking at #12301 this would run consul-template and go-getter in the same way as Envoy where it's ran inside of an allocation in a bridge-style mode, yes?

Thinking about that, I think we'd still have the issue of a valid Vault token but the Vault fingerprint failing, and thus I'd really like some kind of a knob exposed that says 'I don't care that Vault is down and the fingerprint is failing, just keep retrying forever but leave the alloc in the state it's in now.' Ideally if vault_retry has attempts=0 this can be short circuited to that.

tgross · 2022-06-15T13:36:14Z

Looking at #12301 this would run consul-template and go-getter in the same way as Envoy where it's ran inside of an allocation in a bridge-style mode, yes?

Yes, although probably not in the same network namespace as the rest of the allocation. The nitty-gritty details still need to be worked out.

Thinking about that, I think we'd still have the issue of a valid Vault token but the Vault fingerprint failing, and thus I'd really like some kind of a knob exposed that says 'I don't care that Vault is down and the fingerprint is failing, just keep retrying forever but leave the alloc in the state it's in now.' Ideally if vault_retry has attempts=0 this can be short circuited to that.

vault_retry=0 already tries an unlimited number of times. Where the containerization would help is that in order to run in their own process, the template/artifact container would need to have its own Vault/Consul API client. That API client can continue to run, retrying unlimited times, and be unaffected when the client agent restarts.

chuckyz · 2022-06-22T17:30:26Z

That API client can continue to run, retrying unlimited times, and be unaffected when the client agent restarts.

perfect!

chuckyz added the type/enhancement label Jun 9, 2022

tgross added the theme/template label Jun 10, 2022

tgross added hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jun 10, 2022

DerekStrickland self-assigned this Jun 30, 2022

DerekStrickland removed their assignment Aug 3, 2022

mikenomitch removed the hcc/cst Admin - internal label Jan 19, 2023

tgross mentioned this issue Sep 18, 2023

High memory usage on Nomad agent due to static buffer in template client? #18508

Closed

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to Template + Vault during Nomad Client restarts #13313

Improvements to Template + Vault during Nomad Client restarts #13313

chuckyz commented Jun 9, 2022 •

edited

Loading

tgross commented Jun 10, 2022

chuckyz commented Jun 14, 2022 •

edited

Loading

tgross commented Jun 15, 2022

chuckyz commented Jun 22, 2022

Improvements to Template + Vault during Nomad Client restarts #13313

Improvements to Template + Vault during Nomad Client restarts #13313

Comments

chuckyz commented Jun 9, 2022 • edited Loading

Proposal

Use-cases

Attempted Solutions

tgross commented Jun 10, 2022

chuckyz commented Jun 14, 2022 • edited Loading

tgross commented Jun 15, 2022

chuckyz commented Jun 22, 2022

chuckyz commented Jun 9, 2022 •

edited

Loading

chuckyz commented Jun 14, 2022 •

edited

Loading