network hook fails after client restart w/ non-Docker driver #9750

shishir-a412ed · 2021-01-07T19:02:04Z

Nomad version

Nomad v0.11.4+ent

Operating system and Environment details

Ubuntu 18.04.5 LTS (Bionic Beaver)

Issue

We are seeing this error when the containerd-driver restarts and it tries to reattach to the existing allocation.
nomad is unable to attach to the existing allocation and throws this error:

Recent Events:
Time                       Type           Description
2021-01-07T10:19:06-08:00  Killing        Sent interrupt. Waiting 5s before force killing
2021-01-07T10:19:05-08:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/e728e152-7304-c42c-b7e7-21bf63fc44c4: operation not permitted
2021-01-07T10:16:37-08:00  Started        Task started by client
2021-01-07T10:16:36-08:00  Task Setup     Building Task Directory
2021-01-07T10:16:36-08:00  Received       Task received by client

and starts a new allocation.

Reproduction steps

Launch a nomad job using containerd-driver
Wait for the job to get into the running state.
SSH into the nomad client node, and restart nomad + containerd-driver

systemctl restart nomad

nomad job status <job> should show a new allocation and a previously failed allocation.
nomad alloc status <failed_alloc_id> should show the above error message.

Logs

Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]:     2021-01-07T18:58:31.277Z [INFO]  client.driver_mgr.containerd-driver: HELLO HELLO: Recover Task: driver=containerd-driver @module=containerd-driver timestamp=2021-01-07T18:58:31.277Z
Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]:     2021-01-07T18:58:31.277Z [INFO]  client: started client: node_id=2bec1ea8-0a91-76a5-2241-ce62e083d2b3
Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]:     2021-01-07T18:58:31.279Z [ERROR] client.alloc_runner: prerun failed: alloc_id=6d449ba6-9190-741e-7ce3-b37c85640151 error="pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/6d449ba6-9190-741e-7ce3-b37c85640151: operation not permitted"
Jan 07 18:58:31 ip-10-102-98-114 nomad[15339]: client.alloc_runner: prerun failed: alloc_id=6d449ba6-9190-741e-7ce3-b37c85640151 error="pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/6d449ba6-9190-741e-7ce3-b37c85640151: operation not permitted"

The text was updated successfully, but these errors were encountered:

shishir-a412ed · 2021-01-07T19:31:00Z

@tgross @notnoop Any ideas about this issue?

tgross · 2021-01-07T19:52:20Z

Hi @shishir-a412ed!

The CreateNetwork call can either delegate to the task driver or use the Linux default. If you're using the containerd plugin I can see at https://github.com/Roblox/nomad-driver-containerd, it looks like it's being deferred to the Linux default. We define that in network_manager_linux.go#L92-L96 and it looks like the only place you can be getting the error is in the nsutil.NewNS call.

So I think that error is bubbling up either from os.MkdirAll(NetNSRunDir, 0755) or possibly os.Create(nsPath) as those are the two likely unwrapped errors I see in that function. Does your Nomad client have the appropriate permissions to the netns locations?

shishir-a412ed · 2021-01-08T01:17:25Z

@tgross Thank you for the quick response. Your analysis is spot on!

I added some fmt.Println() statements in the nomad codebase to validate.

Jan 07 23:34:09 ip-10-102-98-114 nomad[19401]: HELLO: ERROR IN CREATING NETWORK NAMESPACE FILE: /var/run/netns/e2839014-1bf8-1993-bee7-9b225bf5465f
Jan 07 23:34:09 ip-10-102-98-114 nomad[19401]: HELLO HELLO ERROR: open /var/run/netns/e2839014-1bf8-1993-bee7-9b225bf5465f: operation not permitted

The error is indeed coming from os.Create(nsPath)

Looks like the nomad client has no problems creating the nsPath file the first time it launches the job.
When the first time it creates the file, the file is created with 0444 permissions (not sure if that's related)

When the nomad + containerd-driver restarts, it tries to make an os.Create(nsPath) call on the existing file and throws operation not permitted error.

Does your Nomad client have the appropriate permissions to the netns locations? This definitely seems like the reason, when the nomad client + containerd-driver restarts, nomad client doesn't have the right permissions.

I checked the process, and it's running as root. Not sure why it is not able to re-create the /var/run/netns/e2839014-1bf8-1993-bee7-9b225bf5465f file?

Also,

If you're using the containerd plugin I can see at https://github.com/Roblox/nomad-driver-containerd, it looks like it's being deferred to the Linux default.

Where do you see in the containerd-driver it's being deferred to the Linux default?

tgross · 2021-01-08T13:42:20Z

Ok, so good news and bad news. The good news is that I was able to reproduce the behavior with the exec driver on the current HEAD so it's not a problem specific to the version of Nomad you're running or the containerd driver. The bad news is that I was able reproduce the behavior with the exec driver. 😀

Jobspec:

job "execjob" {
  datacenters = ["dc1"]

  group "execgroup" {

    network {
      mode = "bridge"
      port "www" {
        to = "8000"
      }
    }

    task "exectask" {
      driver = "exec"

      config {
        command = "python"
        args    = ["-m", "SimpleHTTPServer"]
      }
    }
  }
}

Run the job, which works fine. Take a look at the permissions for that netns:

$ sudo ls -lah /var/run/netns/01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1
-r--r--r-- 1 root root 0 Jan  8 13:35 /var/run/netns/01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1

Restart the Nomad client, as root:

2021-01-08T13:36:36.424Z [ERROR] client.alloc_runner: prerun failed: alloc_id=01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1 error="pre-run hook "network" failed: failed to create network for alloc: open /var/run/netns/01f3d62d-82a0-9ab2-a09e-85d7f6e7a7e1: operation not permitted"

And this ends up causing a restart of the task.

Where do you see in the containerd-driver it's being deferred to the Linux default?

I might be missing it, but there's no implementation of CreateNetwork in the driver. So in that case the network_hook code falls back to the default in network_manager_linux. That's pretty typical; of the HashiCorp drivers only docker implements it, and that's why we see the same behavior in both containerd and exec drivers.

I'm going to rename this bug, and we'll dig in further to figure out what's going on here.

Edit: interesting, it looks like way back in 0.10.0 I'd tried to solve for not recreate network namespaces: e17901d I suspect either there's a bug there we missed or a regression since then.

tgross · 2021-01-08T13:58:31Z

Ok I went thru #6315 and it looks like I introduced a fix for Docker (see e17901d#diff-13af1c2034f8a861c687bbeea321da745d2490f0110857c3a805fb385bcf0804R50-R59) but that the fix was missing what we needed for the default path.

I think when we create the file, if we get an error, we should then check for the existence of the file (which means it was previously created), and return nil, true, nil from CreateNetwork if it already exists. ~~I'm not totally sure I understand what that file is doing other than acting as a sentinel value though~~ (ah I see from the error we get that it's the namespace file)... I'll push up a PR with the fix and then ping one of my colleagues who knows that area of the code a bit better.

tgross · 2021-01-08T14:41:01Z

I've opened #9757 with patch for this.

shishir-a412ed · 2021-01-08T19:50:52Z

Hi @tgross! Thank you for taking a look and the quick response! This looks great. Looking forward to #9757.

tgross · 2021-01-11T16:31:34Z

That PR is merged and the fix will ship in 1.0.2

github-actions · 2022-10-25T02:45:02Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/networking stage/waiting-reply labels Jan 7, 2021

shishir-a412ed mentioned this issue Jan 7, 2021

Fix issue in Recover Task. Roblox/nomad-driver-containerd#55

Merged

tgross changed the title ~~Issue: pre-run hook "network" failed: failed to create network for alloc~~ network hook fails after client restart w/ non-Docker driver Jan 8, 2021

tgross added theme/driver/exec stage/accepted Confirmed, and intend to work on. No timeline committment though. type/bug theme/driver/java and removed stage/waiting-reply labels Jan 8, 2021

tgross mentioned this issue Jan 8, 2021

safely handle existing net namespace in default network manager #9757

Merged

tgross closed this as completed in #9757 Jan 11, 2021

tgross added this to the 1.0.2 milestone Jan 11, 2021

shishir-a412ed mentioned this issue Jan 11, 2021

[question] Is it production ready? Roblox/nomad-driver-containerd#58

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

network hook fails after client restart w/ non-Docker driver #9750

network hook fails after client restart w/ non-Docker driver #9750

shishir-a412ed commented Jan 7, 2021

shishir-a412ed commented Jan 7, 2021

tgross commented Jan 7, 2021 •

edited

Loading

shishir-a412ed commented Jan 8, 2021 •

edited

Loading

tgross commented Jan 8, 2021 •

edited

Loading

tgross commented Jan 8, 2021 •

edited

Loading

tgross commented Jan 8, 2021

shishir-a412ed commented Jan 8, 2021

tgross commented Jan 11, 2021

github-actions bot commented Oct 25, 2022

network hook fails after client restart w/ non-Docker driver #9750

network hook fails after client restart w/ non-Docker driver #9750

Comments

shishir-a412ed commented Jan 7, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

shishir-a412ed commented Jan 7, 2021

tgross commented Jan 7, 2021 • edited Loading

shishir-a412ed commented Jan 8, 2021 • edited Loading

tgross commented Jan 8, 2021 • edited Loading

tgross commented Jan 8, 2021 • edited Loading

tgross commented Jan 8, 2021

shishir-a412ed commented Jan 8, 2021

tgross commented Jan 11, 2021

github-actions bot commented Oct 25, 2022

tgross commented Jan 7, 2021 •

edited

Loading

shishir-a412ed commented Jan 8, 2021 •

edited

Loading

tgross commented Jan 8, 2021 •

edited

Loading

tgross commented Jan 8, 2021 •

edited

Loading