Nomad 1.8.4 `failed to exec into task: No path to region` #24609

mcphailtom · 2024-12-05T11:20:17Z

Nomad version

Nomad v1.8.4
BuildDate 2024-09-17T20:18:34Z
Revision 22ab32e

Operating system and Environment details

Linux 20.04.1-Ubuntu

Issue

As part of regular releases we are upgrading our shipped versions of Nomad. In our latest release candidates we have attempted a two step move from 1.15.17 -> 1.7.7 -> 1.8.4.

During testing we have performed this upgrade on a sample set of clusters and in all cases the upgrade is completely successfully. We have also attempted the 1.5.17 -> 1.8.4 upgrade without issue. Jobs are running as expected.

Several tests in our test suite use the nomad alloc exec command for test execution and we are seeing intermittent failures of the tests with the error message: failed to exec into task: No path to region

After some initial investigation it appears the command fails on all nodes of the cluster except the node where the allocation is actually running.

This behaviors is not consistent with previous versions of nomad.

Reproduction steps

Using a sample job:

admin@app1:~$ nomad job status mysql
ID            = mysql
Name          = mysql
Submit Date   = 2024-12-05T01:13:35Z
Type          = system
Priority      = 50
Datacenters   = primary
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
mysql       0       0         3        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
37e41f01  520855e6  mysql       0        run      running  10h7m ago  10h7m ago
69195c2c  8c55b7e6  mysql       0        run      running  10h7m ago  23m38s ago
8cf697a1  9da121d6  mysql       0        run      running  10h7m ago  10h7m ago
admin@app1:~$ nomad node status
ID        Node Pool  DC       Name   Class   Drain  Eligibility  Status
8c55b7e6  default    primary  data1  <none>  false  eligible     ready
0abc5c85  default    primary  app1   <none>  false  eligible     ready
520855e6  default    primary  data2  <none>  false  eligible     ready
9da121d6  default    primary  data3  <none>  false  eligible     ready
8f82e4c6  default    primary  app2   <none>  false  eligible     ready
admin@app1:~$

Running an alloc exec on the app1 node where no allocation is running produces the following result:

admin@app1:~$ nomad alloc exec -job mysql /bin/bash
failed to exec into task: No path to region
admin@app1:~$

Running it again on the data3 node where an allocation is running will function as expected:

admin@data3:~$ nomad alloc exec -job mysql /bin/bash
root@data3:/#

Running from another node with an explicit -region set to the default region functions as expected.

admin@app1:~$ nomad alloc exec -region primary -job mysql /bin/bash
root@data3:/#

Expected Result

The command should exec on any node/agent in the cluster in the default region (since agents default to the region they are connected to in the case of no flag)

Actual Result

An error is produced unless running on the node on which the allocation exists.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

mcphailtom · 2024-12-05T11:28:45Z

cc @half-ogre @jamesooo @andrba @another-mattr

tgross · 2024-12-06T19:54:45Z

Hi @mcphailtom! The error you're seeing (ErrNoRegionPath) comes from only a few places in the code. Each Nomad client agent is connected to exactly one Nomad server at a time (rotating between servers as needed to rebalance and handle failures). When you call nomad alloc exec, the HTTP API call gets transformed into a RPC call. The RPC call is forward from that agent to the server the agent is connected to. That server forwards the RPC to the leader (if it's not the leader). The leader looks up which server the destination allocation is on, and forwards the RPC to that server. And finally, the RPC is forwarded from that node to the client where the allocation is running. This is why you can make a request to any node anywhere and it makes its way to the correct place.

The ErrNoRegionPath error appears in only 3 places:

In all 3 locations, we're looking up a server from the list of Raft peers.

That you're getting an error here suggests that there's a problem in your cluster topology. One or more servers are not showing up in the list of Raft peers. I suspect you've either got a split brain in the cluster during your upgrade or you've got one peer that's not joining (maybe it's got a configuration or networking issue).

To help diagnose:

The error should be showing up in the logs of the host that's hitting it. Is that error happening on a client or a server?
Use nomad server members to verify the health of servers.
The Autopilot Read Health API will tell you which servers the cluster thinks it knows about.
The Read Raft Configuration API should match that.
The List Members API will give you the list of servers in the gossip pool (this is a lot of the same information in nomad server members but more detail).

Also, small detail:

Running from another node with an explicit -region set to the default region functions as expected.

Your example shows you're passing the same name as the DC. Datacenter and region aren't the same thing. Maybe you have the same name for them both, but just wanted to make sure 😁

tgross · 2024-12-10T19:29:50Z

I suspect #24635 is related and I have a reproduction for that and a culprit commit. Working on figuring out the underlying problem now.

In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log` or `alloc exec` calls to a region where the region is not "global", we create a new client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `global` (because of mTLS handling code introduced all the way back in 4d3b75d), which fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. The "right" way of fixing this would be for `ClientConfig` not to change the region to global in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d

In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `"global"` (because of mTLS handling code introduced all the way back in 4d3b75d), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. Arguably the "right" way of fixing this would be for `ClientConfig` not to change the API client's region to `"global"` in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d

tgross · 2024-12-10T21:41:49Z

Fix up here for review: #24644

In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `"global"` (because of mTLS handling code introduced all the way back in 4d3b75d), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. Arguably the "right" way of fixing this would be for `ClientConfig` not to change the API client's region to `"global"` in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d

…ddress (#24644) (#24682) In #16872 we added support for unix domain sockets, but this required mutating the `Config` when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients that mutation would happen multiple times and the address would be incorrectly parsed. When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private `url` field and that causes the URL parsing to be skipped for the new client. This results in the region always being set to the string literal `"global"` (because of mTLS handling code introduced all the way back in 4d3b75d), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader. Arguably the "right" way of fixing this would be for `ClientConfig` not to change the API client's region to `"global"` in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private `url` field so that the new address is re-parsed. Fixes: #24635 Fixes: #24609 Ref: #16872 Ref: #23785 Ref: 4d3b75d Co-authored-by: Tim Gross <[email protected]>

mcphailtom added the type/bug label Dec 5, 2024

mcphailtom changed the title ~~Nomad 1.8.4 alloc exec no path to regain~~ Nomad 1.8.4 alloc exec no path to region Dec 5, 2024

mcphailtom changed the title ~~Nomad 1.8.4 alloc exec no path to region~~ Nomad 1.8.4 alloc exec Error: no path to region Dec 5, 2024

mcphailtom changed the title ~~Nomad 1.8.4 alloc exec Error: no path to region~~ Nomad 1.8.4 failed to exec into task: No path to region Dec 5, 2024

tgross added this to Nomad - Community Issues Triage Dec 5, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Dec 5, 2024

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Dec 6, 2024

tgross self-assigned this Dec 6, 2024

tgross added stage/waiting-reply theme/raft labels Dec 6, 2024

tgross mentioned this issue Dec 9, 2024

nomad logs no longer works without NOMAD_REGION set #24635

Closed

tgross added the theme/allocation API label Dec 9, 2024

tgross mentioned this issue Dec 10, 2024

Allocation API: fix "no path to region" errors for non-global regions #24644

Merged

tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Dec 10, 2024

tgross added this to the 1.9.x milestone Dec 10, 2024

tgross removed the stage/waiting-reply label Dec 11, 2024

tgross closed this as completed in #24644 Dec 16, 2024

tgross closed this as completed in 75b0202 Dec 16, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Dec 16, 2024

hc-github-team-nomad-core mentioned this issue Dec 16, 2024

Backport of Allocation API: fix "no path to region" errors for non-global regions into release/1.9.x #24682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad 1.8.4 `failed to exec into task: No path to region` #24609

Nomad 1.8.4 `failed to exec into task: No path to region` #24609

mcphailtom commented Dec 5, 2024 •

edited

Loading

mcphailtom commented Dec 5, 2024 •

edited

Loading

tgross commented Dec 6, 2024 •

edited

Loading

tgross commented Dec 10, 2024

tgross commented Dec 10, 2024

Nomad 1.8.4 failed to exec into task: No path to region #24609

Nomad 1.8.4 failed to exec into task: No path to region #24609

Comments

mcphailtom commented Dec 5, 2024 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

mcphailtom commented Dec 5, 2024 • edited Loading

tgross commented Dec 6, 2024 • edited Loading

tgross commented Dec 10, 2024

tgross commented Dec 10, 2024

Nomad 1.8.4 `failed to exec into task: No path to region` #24609

Nomad 1.8.4 `failed to exec into task: No path to region` #24609

mcphailtom commented Dec 5, 2024 •

edited

Loading

mcphailtom commented Dec 5, 2024 •

edited

Loading

tgross commented Dec 6, 2024 •

edited

Loading