Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't schedule a job because "resources exhausted" #146

Closed
adriaandejonge opened this issue Sep 29, 2015 · 6 comments
Closed

Can't schedule a job because "resources exhausted" #146

adriaandejonge opened this issue Sep 29, 2015 · 6 comments

Comments

@adriaandejonge
Copy link

I tried starting nomad on both CoreOS (local and GCE) and Debian (GCE) and running a cluster (both with -dev and with client/server. For installation, I follow the steps described in the Vagrantfile.

As soon as I try to run:

nomad init
nomad run example.nomad

I get a message that my resources are exhausted:

username@instance-2:~$ nomad run example.nomad
==> Monitoring evaluation "bac37878-9a39-bd3e-5ef2-acc51cde7981"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "2263a54c-ea0e-ab68-adee-30f8d2e0827a" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: bandwidth exceeded" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "bac37878-9a39-bd3e-5ef2-acc51cde7981" finished with status "complete"

Or this variant:

core@core-01 ~ $ ./nomad run example.nomad 
==> Monitoring evaluation "ee52b439-49e1-f3e4-a3d2-0ad6a1fc48d2"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "c23ca37e-2de5-b10d-d90a-d36bc468092c" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: no networks available" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ee52b439-49e1-f3e4-a3d2-0ad6a1fc48d2" finished with status "complete"

This is the first job I am scheduling so it is unlikely that the network resources are actually exhausted. Am I missing some kind of configuration telling nomad about the network?

@kelseyhightower
Copy link

I think the problem here is that nomad assumes eth0 for all systems, which is not true for systemd. See the code here: https://github.com/hashicorp/nomad/blob/master/client/fingerprint/network_unix.go#L38

@sethvargo
Copy link
Contributor

Cross-linking with #158

@adriaandejonge
Copy link
Author

Thanks for your reply @kelseyhightower & @sethvargo

Trying to test this hypothesis:

I choose this because:

username@instance-group-1-w000 /tmp $ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens4v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc fq_codel state UP group default qlen 1000
    link/ether 42:01:0a:f0:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.240.0.0/32 brd 10.240.0.0 scope global ens4v1
       valid_lft forever preferred_lft forever
    inet6 fe80::4001:aff:fef0:0/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0@NONE: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default 
    link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 172.17.42.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::50d3:89ff:fe54:12e8/64 scope link 
       valid_lft forever preferred_lft forever

After compiling the modified version of nomad and running, I still get this result:

username@instance-group-1-w000 /tmp $ ./nomad run example.nomad 
==> Monitoring evaluation "f59bcb83-a2fb-7100-0ca5-426918f97e11"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "af0cd1a2-199e-04cc-a8f7-f8dfe4323760" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: no networks available" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f59bcb83-a2fb-7100-0ca5-426918f97e11" finished with status "complete"

Of course this is just a quick hack to test the hypothesis and not an actual fix. Are my steps above correct? If so, the problem might be different.

@ghost
Copy link

ghost commented Oct 1, 2015

+1 Nomad 0.1.0 docker images do not currently function on CentOS 7 or CoreOS stable (766.4.0), similar error (using example.nomad)

==> Monitoring evaluation "129a2fab-8d5a-4ff6-71ab-e6463c12e854"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "9c141829-63bb-81ae-8267-802a7154c4e4" status "failed" (0/3 nodes filtered)
      * Resources exhausted on 3 nodes
      * Dimension "network: no networks available" exhausted on 3 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "129a2fab-8d5a-4ff6-71ab-e6463c12e854" finished with status "complete"

Installing net-tools (provides ifconfig) and setting net.ifnames 0 to rename the interface back to eth0 also does not make a difference on CentOS 7. Nomad server/agent log:

Sep 30 23:19:45 nomad.example.org nomad[2114]: ==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
Sep 30 23:19:45 nomad.example.org nomad[2114]: ==> Starting Nomad agent...
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:47 [ERR] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
Sep 30 23:19:47 nomad.example.org nomad[2114]: ==> Nomad agent configuration:
Sep 30 23:19:47 nomad.example.org nomad[2114]: Atlas: <disabled>
Sep 30 23:19:47 nomad.example.org nomad[2114]: Client: true
Sep 30 23:19:47 nomad.example.org nomad[2114]: Log Level: INFO
Sep 30 23:19:47 nomad.example.org nomad[2114]: Region: global (DC: dc1)
Sep 30 23:19:47 nomad.example.org nomad[2114]: Server: true
Sep 30 23:19:47 nomad.example.org nomad[2114]: ==> Nomad agent started! Log data will stream in below:
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] serf: EventMemberJoin: nomad.example.org.global 10.42.0.60
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] nomad: starting 1 scheduling worker(s) for [batch service _core]
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] client: using state directory /tmp/nomad/client
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] client: using alloc directory /tmp/nomad/alloc
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] raft: Node at 10.42.0.60:4647 [Follower] entering Follower state
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [WARN] serf: Failed to re-join any previously known node
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [INFO] nomad: adding server nomad.example.org.global (Addr: 10.42.0.60:4647) (DC: dc1)
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [ERR] fingerprint.network: Error calling ifconfig (/usr/sbin/ifconfig): %!s(<nil>)
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [WARN] raft: Heartbeat timeout reached, starting election
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Node at 10.42.0.60:4647 [Candidate] entering Candidate state
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Election won. Tally: 1
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Node at 10.42.0.60:4647 [Leader] entering Leader state
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] nomad: cluster leadership acquired
Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:46 [INFO] raft: Disabling EnableSingleNode (bootstrap)

Interesting line

Sep 30 23:19:47 nomad.example.org nomad[2114]: 2015/09/30 23:19:45 [ERR] fingerprint.network: Error calling ifconfig (/usr/sbin/ifconfig): %!s(<nil>)

@cdrage
Copy link
Contributor

cdrage commented Oct 2, 2015

It seems I am getting the same issue on Debian:

▶ cat /etc/debian_version 
8.1
▶ sudo ./nomad agent -dev
==> Starting Nomad agent...
2015/10/02 05:52:23 [ERR] fingerprint.env_aws: Error querying AWS Metadata URL, skipping
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: DEBUG
                Region: global (DC: dc1)
                Server: true

==> Nomad agent started! Log data will stream in below:

    2015/10/02 05:52:20 [INFO] serf: EventMemberJoin: wikus.global 127.0.0.1
    2015/10/02 05:52:20 [INFO] nomad: starting 4 scheduling worker(s) for [batch service _core]
    2015/10/02 05:52:20 [INFO] client: using alloc directory /tmp/NomadClient180079229
    2015/10/02 05:52:20 [INFO] raft: Node at 127.0.0.1:4647 [Follower] entering Follower state
    2015/10/02 05:52:20 [INFO] nomad: adding server wikus.global (Addr: 127.0.0.1:4647) (DC: dc1)
    2015/10/02 05:52:21 [ERR] fingerprint.network: Error calling ifconfig (/sbin/ifconfig): %!s(<nil>)
    2015/10/02 05:52:21 [WARN] fingerprint.network: Ethtool output did not match regex
    2015/10/02 05:52:21 [WARN] fingerprint.network: Ethtool not found, checking /sys/net speed file
    2015/10/02 05:52:22 [WARN] raft: Heartbeat timeout reached, starting election
    2015/10/02 05:52:22 [INFO] raft: Node at 127.0.0.1:4647 [Candidate] entering Candidate state
    2015/10/02 05:52:22 [DEBUG] raft: Votes needed: 1
    2015/10/02 05:52:22 [DEBUG] raft: Vote granted. Tally: 1
    2015/10/02 05:52:22 [INFO] raft: Election won. Tally: 1
    2015/10/02 05:52:22 [INFO] raft: Node at 127.0.0.1:4647 [Leader] entering Leader state
    2015/10/02 05:52:22 [INFO] raft: Disabling EnableSingleNode (bootstrap)
    2015/10/02 05:52:22 [DEBUG] raft: Node 127.0.0.1:4647 updated peer set (2): [127.0.0.1:4647]
    2015/10/02 05:52:22 [INFO] nomad: cluster leadership acquired
    2015/10/02 05:52:23 [DEBUG] client: applied fingerprints [arch cpu host memory storage network]
    2015/10/02 05:52:23 [DEBUG] client: available drivers [exec java qemu docker]
    2015/10/02 05:52:23 [DEBUG] client: node registration complete
    2015/10/02 05:52:23 [DEBUG] client: updated allocations at index 1 (0 allocs)
    2015/10/02 05:52:23 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/02 05:52:23 [DEBUG] client: state updated to ready
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/jobs (584.169µs)
    2015/10/02 05:52:26 [DEBUG] worker: dequeued evaluation f6991d44-1e21-ec4e-aedf-81a010e725ff
    2015/10/02 05:52:26 [DEBUG] sched: <Eval 'f6991d44-1e21-ec4e-aedf-81a010e725ff' JobID: 'example'>: allocs: (place 1) (update 0) (migrate 0) (stop 0) (ignore 0)
    2015/10/02 05:52:26 [DEBUG] worker: submitted plan for evaluation f6991d44-1e21-ec4e-aedf-81a010e725ff
    2015/10/02 05:52:26 [DEBUG] sched: <Eval 'f6991d44-1e21-ec4e-aedf-81a010e725ff' JobID: 'example'>: setting status to complete
    2015/10/02 05:52:26 [DEBUG] worker: updated evaluation <Eval 'f6991d44-1e21-ec4e-aedf-81a010e725ff' JobID: 'example'>
    2015/10/02 05:52:26 [DEBUG] worker: ack for evaluation f6991d44-1e21-ec4e-aedf-81a010e725ff
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/evaluation/f6991d44-1e21-ec4e-aedf-81a010e725ff (56.99µs)
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/evaluation/f6991d44-1e21-ec4e-aedf-81a010e725ff/allocations (80.542µs)
    2015/10/02 05:52:26 [DEBUG] http: Request /v1/allocation/25924f64-41c9-9376-a1fe-f88ef6af81af (166.524µs)
▶ sudo ./nomad run example.nomad
==> Monitoring evaluation "f6991d44-1e21-ec4e-aedf-81a010e725ff"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "25924f64-41c9-9376-a1fe-f88ef6af81af" status "failed" (0/1 nodes filtered)
      * Resources exhausted on 1 nodes
      * Dimension "network: no networks available" exhausted on 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f6991d44-1e21-ec4e-aedf-81a010e725ff" finished with status "complete"

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants