`nomad logs` no longer works without `NOMAD_REGION` set #24635

HerbCSO · 2024-12-09T16:51:58Z

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf10

Operating system and Environment details

MacOS CLI, Linux server

Issue

Up to and including Nomad CLI 1.7.7 I was able to run nomad alloc logs deadbeef without NOMAD_REGION being set. Since 1.8.0 NOMAD_REGION is required, else I just get a Unexpected response code: 500 (No path to region). So I have to either set NOMAD_REGION or run nomad alloc logs -region myregion deadbeef.

From the docs it reads to me like this is unintentional behavior. It says Defaults to the Agent's local region. so I would assume I wouldn't need to specify this parameter. I didn't find anything in the release notes that specifies a necessary config change for this, so please let me know if I missed something there.

Also, this sounds a bit like the issue described in hashicorp/nomad-pack#374, but I'm not sure if this is really the same thing.

FWIW our Nomad servers and clients are a bit older, running 1.6.3 - maybe that's part of the issue?

Reproduction steps

Run nomad alloc logs using a version prior to 1.8.0 - works
Run nomad alloc logs using a version prior >= 1.8.0 - produces 500 error

Expected Result

No 500 error

Actual Result

500 error

Job file (if appropriate)

N/A

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

The text was updated successfully, but these errors were encountered:

tgross · 2024-12-09T18:45:26Z

Hi @HerbCSO! It looks like you're running into the same issue described here: #24609 (comment). The alloc logs RPC uses the exact same paths as the alloc exec command described there. I'd ask if you can answer some of the same questions I've asked there.

For what it's worth, I haven't been able to reproduce this behavior yet.

FWIW our Nomad servers and clients are a bit older, running 1.6.3 - maybe that's part of the issue?

This is an interesting clue! The CLI is 1.8.0+ and the agents are older?

HerbCSO · 2024-12-09T21:31:25Z

I'd ask if you can answer some of the same questions I've asked there.

I don't have full access to the cluster (specifically no API access), however nomad server members reports all 5 servers as alive, running version 1.6.3, one is the leader, all running raft version 3. Also nomad node status shows all nodes as eligible and ready status.

FWIW our Nomad servers and clients are a bit older, running 1.6.3 - maybe that's part of the issue?

This is an interesting clue! The CLI is 1.8.0+ and the agents are older?

Correct. And using CLI versions 1.6.2 or 1.7.7 (basically any version < 1.8.0) from the same machine I was able to pull logs without specifying the region.

tgross · 2024-12-10T19:29:22Z

Thanks @HerbCSO. I was able to reproduce and the reproducer was that the requests were going to a non-leader. The HTTP API request arrives at the server with the region set to global, so the logic we have to use the agent's region is not kicking in. I've used git bisect from 1.7.6 to main and determined that this b7419bc is the culprit. But I haven't figured out how that's breaking things yet.

(Edit: weirdly I think this may have revealed an existing bug, rather than being the bug itself. https://github.com/hashicorp/nomad/blob/v1.9.3/api/api.go#L612 doesn't match the behavior we have anywhere else.)

(Edit2: yeah this bug dates back to 4d3b75d. When making direct-to-client requests as in nomad alloc logs, we need to set the region for the mTLS cert but that's not always going to be global. We should be figuring out the real region in the command and then setting it in the API client we configure, which I suspect was happening accidentally before b7419bc so long as another request happened first.)

In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log` or `alloc exec` calls to a region where the region is not "global", we create a new client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `global` (because of mTLS handling code introduced all the way back in 4d3b75d), which fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. The "right" way of fixing this would be for `ClientConfig` not to change the region to global in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d

In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `"global"` (because of mTLS handling code introduced all the way back in 4d3b75d), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. Arguably the "right" way of fixing this would be for `ClientConfig` not to change the API client's region to `"global"` in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d

tgross · 2024-12-10T21:41:26Z

Fix up here for review: #24644

In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `"global"` (because of mTLS handling code introduced all the way back in 4d3b75d), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. Arguably the "right" way of fixing this would be for `ClientConfig` not to change the API client's region to `"global"` in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d

…ddress (#24644) (#24682) In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `"global"` (because of mTLS handling code introduced all the way back in 4d3b75d), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. Arguably the "right" way of fixing this would be for `ClientConfig` not to change the API client's region to `"global"` in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d Co-authored-by: Tim Gross <[email protected]>

HerbCSO added the type/bug label Dec 9, 2024

tgross added this to Nomad - Community Issues Triage Dec 9, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Dec 9, 2024

tgross added stage/waiting-reply theme/allocation API labels Dec 9, 2024

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Dec 9, 2024

tgross self-assigned this Dec 9, 2024

tgross removed the stage/waiting-reply label Dec 10, 2024

tgross mentioned this issue Dec 10, 2024

Nomad 1.8.4 failed to exec into task: No path to region #24609

Closed

tgross mentioned this issue Dec 10, 2024

Allocation API: fix "no path to region" errors for non-global regions #24644

Merged

tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Dec 10, 2024

tgross added this to the 1.9.x milestone Dec 10, 2024

tgross added the hcc/jira label Dec 11, 2024

tgross closed this as completed in #24644 Dec 16, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Dec 16, 2024

hc-github-team-nomad-core mentioned this issue Dec 16, 2024

Backport of Allocation API: fix "no path to region" errors for non-global regions into release/1.9.x #24682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nomad logs` no longer works without `NOMAD_REGION` set #24635

`nomad logs` no longer works without `NOMAD_REGION` set #24635

HerbCSO commented Dec 9, 2024

tgross commented Dec 9, 2024 •

edited

Loading

HerbCSO commented Dec 9, 2024 •

edited

Loading

tgross commented Dec 10, 2024 •

edited

Loading

tgross commented Dec 10, 2024

nomad logs no longer works without NOMAD_REGION set #24635

nomad logs no longer works without NOMAD_REGION set #24635

Comments

HerbCSO commented Dec 9, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented Dec 9, 2024 • edited Loading

HerbCSO commented Dec 9, 2024 • edited Loading

tgross commented Dec 10, 2024 • edited Loading

tgross commented Dec 10, 2024

`nomad logs` no longer works without `NOMAD_REGION` set #24635

`nomad logs` no longer works without `NOMAD_REGION` set #24635

tgross commented Dec 9, 2024 •

edited

Loading

HerbCSO commented Dec 9, 2024 •

edited

Loading

tgross commented Dec 10, 2024 •

edited

Loading