Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.8.4 failed to exec into task: No path to region #24609

Closed
mcphailtom opened this issue Dec 5, 2024 · 4 comments · Fixed by #24644
Closed

Nomad 1.8.4 failed to exec into task: No path to region #24609

mcphailtom opened this issue Dec 5, 2024 · 4 comments · Fixed by #24644

Comments

@mcphailtom
Copy link

mcphailtom commented Dec 5, 2024

Nomad version

Nomad v1.8.4
BuildDate 2024-09-17T20:18:34Z
Revision 22ab32e

Operating system and Environment details

Linux 20.04.1-Ubuntu

Issue

As part of regular releases we are upgrading our shipped versions of Nomad. In our latest release candidates we have attempted a two step move from 1.15.17 -> 1.7.7 -> 1.8.4.

During testing we have performed this upgrade on a sample set of clusters and in all cases the upgrade is completely successfully. We have also attempted the 1.5.17 -> 1.8.4 upgrade without issue. Jobs are running as expected.

Several tests in our test suite use the nomad alloc exec command for test execution and we are seeing intermittent failures of the tests with the error message: failed to exec into task: No path to region

After some initial investigation it appears the command fails on all nodes of the cluster except the node where the allocation is actually running.

This behaviors is not consistent with previous versions of nomad.

Reproduction steps

Using a sample job:

admin@app1:~$ nomad job status mysql
ID            = mysql
Name          = mysql
Submit Date   = 2024-12-05T01:13:35Z
Type          = system
Priority      = 50
Datacenters   = primary
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
mysql       0       0         3        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
37e41f01  520855e6  mysql       0        run      running  10h7m ago  10h7m ago
69195c2c  8c55b7e6  mysql       0        run      running  10h7m ago  23m38s ago
8cf697a1  9da121d6  mysql       0        run      running  10h7m ago  10h7m ago
admin@app1:~$ nomad node status
ID        Node Pool  DC       Name   Class   Drain  Eligibility  Status
8c55b7e6  default    primary  data1  <none>  false  eligible     ready
0abc5c85  default    primary  app1   <none>  false  eligible     ready
520855e6  default    primary  data2  <none>  false  eligible     ready
9da121d6  default    primary  data3  <none>  false  eligible     ready
8f82e4c6  default    primary  app2   <none>  false  eligible     ready
admin@app1:~$

Running an alloc exec on the app1 node where no allocation is running produces the following result:

admin@app1:~$ nomad alloc exec -job mysql /bin/bash
failed to exec into task: No path to region
admin@app1:~$

Running it again on the data3 node where an allocation is running will function as expected:

admin@data3:~$ nomad alloc exec -job mysql /bin/bash
root@data3:/#

Running from another node with an explicit -region set to the default region functions as expected.

admin@app1:~$ nomad alloc exec -region primary -job mysql /bin/bash
root@data3:/#

Expected Result

The command should exec on any node/agent in the cluster in the default region (since agents default to the region they are connected to in the case of no flag)

Actual Result

An error is produced unless running on the node on which the allocation exists.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@mcphailtom
Copy link
Author

mcphailtom commented Dec 5, 2024

cc @half-ogre @jamesooo @andrba @another-mattr

@mcphailtom mcphailtom changed the title Nomad 1.8.4 alloc exec no path to regain Nomad 1.8.4 alloc exec no path to region Dec 5, 2024
@mcphailtom mcphailtom changed the title Nomad 1.8.4 alloc exec no path to region Nomad 1.8.4 alloc exec Error: no path to region Dec 5, 2024
@mcphailtom mcphailtom changed the title Nomad 1.8.4 alloc exec Error: no path to region Nomad 1.8.4 failed to exec into task: No path to region Dec 5, 2024
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Dec 6, 2024
@tgross tgross self-assigned this Dec 6, 2024
@tgross
Copy link
Member

tgross commented Dec 6, 2024

Hi @mcphailtom! The error you're seeing (ErrNoRegionPath) comes from only a few places in the code. Each Nomad client agent is connected to exactly one Nomad server at a time (rotating between servers as needed to rebalance and handle failures). When you call nomad alloc exec, the HTTP API call gets transformed into a RPC call. The RPC call is forward from that agent to the server the agent is connected to. That server forwards the RPC to the leader (if it's not the leader). The leader looks up which server the destination allocation is on, and forwards the RPC to that server. And finally, the RPC is forwarded from that node to the client where the allocation is running. This is why you can make a request to any node anywhere and it makes its way to the correct place.

The ErrNoRegionPath error appears in only 3 places:

In all 3 locations, we're looking up a server from the list of Raft peers.

That you're getting an error here suggests that there's a problem in your cluster topology. One or more servers are not showing up in the list of Raft peers. I suspect you've either got a split brain in the cluster during your upgrade or you've got one peer that's not joining (maybe it's got a configuration or networking issue).

To help diagnose:

  • The error should be showing up in the logs of the host that's hitting it. Is that error happening on a client or a server?
  • Use nomad server members to verify the health of servers.
  • The Autopilot Read Health API will tell you which servers the cluster thinks it knows about.
  • The Read Raft Configuration API should match that.
  • The List Members API will give you the list of servers in the gossip pool (this is a lot of the same information in nomad server members but more detail).

Also, small detail:

Running from another node with an explicit -region set to the default region functions as expected.

Your example shows you're passing the same name as the DC. Datacenter and region aren't the same thing. Maybe you have the same name for them both, but just wanted to make sure 😁

@tgross
Copy link
Member

tgross commented Dec 10, 2024

I suspect #24635 is related and I have a reproduction for that and a culprit commit. Working on figuring out the underlying problem now.

tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address. In #23785 we fixed a bug where if the
configuration was used across multiple clients that mutation would happen
multiple times and the address would be incorrectly parsed.

When making `alloc log` or `alloc exec` calls to a region where the region is
not "global", we create a new client from the same configuration and then set
the address. But in this case we copy the private `url` field and that causes
the URL parsing to be skipped for the new client. This results in the region
always being set to the string literal `global` (because of mTLS handling code
introduced all the way back in 4d3b75d), which fails with an error "no path
to region" when the cluster isn't non-global and requests are sent to a
non-leader.

The "right" way of fixing this would be for `ClientConfig` not to change the
region to global in the first place, but as this is a public API and extremely
longstanding behavior, it could potentially be a breaking change for some
downstream consumers. Instead, we'll avoid copying the private `url` field so
that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
@tgross
Copy link
Member

tgross commented Dec 10, 2024

Fix up here for review: #24644

@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Dec 10, 2024
@tgross tgross added this to the 1.9.x milestone Dec 10, 2024
tgross added a commit that referenced this issue Dec 10, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 16, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
@tgross tgross closed this as completed in 75b0202 Dec 16, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Dec 16, 2024
tgross added a commit that referenced this issue Dec 16, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this issue Dec 16, 2024
…ddress (#24644) (#24682)

In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d

Co-authored-by: Tim Gross <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants