Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: aws-ebs-csi plugin v1.28.0 fails to place allocations #20094

Closed
lgfa29 opened this issue Mar 7, 2024 · 3 comments · Fixed by #24522
Closed

csi: aws-ebs-csi plugin v1.28.0 fails to place allocations #20094

lgfa29 opened this issue Mar 7, 2024 · 3 comments · Fixed by #24522

Comments

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 7, 2024

Starting in v1.28.0, the AWS EBS CSI plugin introduced a new segment key to the repose of NodeGetInfo.

When this version of the plugin is used, the Nomad client receives the following accessible topology segments.

"AccessibleTopology": {
    "Segments": {
        "topology.ebs.csi.aws.com/zone": "ca-central-1a",
        "topology.kubernetes.io/zone": "ca-central-1a"
    }
}

But dynamic volume creation does not add the topology.kubernetes.io/zone key, even if requested.

"Topologies": [
    null,
    {
        "Segments": {
            "topology.ebs.csi.aws.com/zone": "ca-central-1a"
        }
    }
],

This causes the volume to fail scheduling because the topology segments are compared with strict equality.

func (t *CSITopology) Equal(o *CSITopology) bool {
if t == nil || o == nil {
return t == o
}
return maps.Equal(t.Segments, o.Segments)
}

Any job that tries to use it will fail placement with a constraint violation.

==> 2024-03-07T13:17:59-05:00: Monitoring evaluation "ea69ca5b"
    2024-03-07T13:17:59-05:00: Evaluation triggered by job "mysql"
    2024-03-07T13:18:00-05:00: Evaluation within deployment: "2e1cd256"
    2024-03-07T13:18:00-05:00: Evaluation status changed: "pending" -> "complete"
==> 2024-03-07T13:18:00-05:00: Evaluation "ea69ca5b" finished with status "complete" but failed to place all allocations:
    2024-03-07T13:18:00-05:00: Task Group "mysql" (failed to place 1 allocation):
      * Constraint "did not meet topology requirement": 1 nodes excluded by filter
    2024-03-07T13:18:00-05:00: Evaluation "c59054f1" waiting for additional capacity to place remainder

I've skimmed the spec a few times, but it's not clear to me how to handle this scenario.

As pointed out in kubernetes-sigs/aws-ebs-csi-driver#729 (comment), Kubernetes ignores additional node segments, so the additional segments in the node may be safe to ignore.

Reproduction steps

  1. Create an SSH key on AWS (if you don't have one already).
  2. Clone https://gist.github.com/lgfa29/b707d56ace871602cb4955df2a1afad0
    git clone https://gist.github.com/b707d56ace871602cb4955df2a1afad0.git
  3. Enter the directory and provision the infrastructure with Terraform.
    cd b707d56ace871602cb4955df2a1afad0
    terraform init
    terraform apply 
  4. Enter your SSH key name and approve the plan.
  5. Run the mysql.nomad.hcl job.
    NOMAD_ADDR=$(terraform output -raw nomad_addr) nomad run mysql.nomad.hcl

Expected Result

Job starts successfully.

Actual Result

Job fails placement with constraint error:

Constraint "did not meet topology requirement": 1 nodes excluded by filter
@magec
Copy link

magec commented Jun 4, 2024

Had the same issue, downgrading to 1.27.0 fixed it for me.

@Vigenere36
Copy link

Wondering timeline on this - is a fix as simple as doing an intersection of the two maps? Not crucial, but is blocking our team from upgrading our csi driver.

@thefallentree
Copy link
Contributor

sent a patch to fix this: #24522

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

5 participants