Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created EKS Addon does not get saved to state if it does not become active #4759

Closed
flostadler opened this issue Nov 12, 2024 · 5 comments · Fixed by #4898 or #4907
Closed

Created EKS Addon does not get saved to state if it does not become active #4759

flostadler opened this issue Nov 12, 2024 · 5 comments · Fixed by #4898 or #4907
Assignees
Labels
awaiting/bridge The issue cannot be resolved without action in pulumi-terraform-bridge. blocked The issue cannot be resolved without 3rd party action. customer/feedback Feedback from customers impact/regression Something that used to work, but is now broken kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed service/eks EKS issues
Milestone

Comments

@flostadler
Copy link
Contributor

Describe what happened

When creating an EKS Addon, the provider will first send the CreateAddon API call to AWS and then wait for the addon to become active.
Some addons, like corends, take a longer time to become active and might hit wait timeouts.
If the resource creation fails while waiting for the Addon to become active, the resource isn't saved to state.

Re-running pulumi up is now guaranteed to fail because Pulumi will wants to create the addon, even though it already exists in the cluster.

As a workaround, users either need to delete the addon from the cluster manually or import the Addon into Pulumi state.

Sample program

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
import * as awsx from "@pulumi/awsx";
import * as eks from "@pulumi/eks";

// Grab some values from the Pulumi configuration (or use default values)
const config = new pulumi.Config();
const vpcNetworkCidr = config.get("vpcNetworkCidr") || "10.0.0.0/16";

const env = "aws-addon-bug"

// Create a new VPC
const eksVpc = new awsx.ec2.Vpc("eks-vpc", {
    enableDnsHostnames: true,
    cidrBlock: vpcNetworkCidr,
});

const instanceRole = new aws.iam.Role('testrole', {
    assumeRolePolicy: JSON.stringify({
        Version: '2012-10-17',
        Statement: [
            {
                Effect: 'Allow',
                Principal: {
                    Service: 'ec2.amazonaws.com',
                },
                Action: 'sts:AssumeRole',
            },
        ],
    }),
    managedPolicyArns: [
        'arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy',
        'arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly',
        'arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy',
    ],
})

const eksCluster = new eks.Cluster(`${env}-cluster`, {
    vpcId: eksVpc.vpcId,
    authenticationMode: eks.AuthenticationMode.Api,
    corednsAddonOptions: {
        enabled: false,
    },
    createOidcProvider: true,
    enabledClusterLogTypes: ['api', 'audit', 'authenticator'],
    fargate: false,
    instanceRole: instanceRole,
    kubeProxyAddonOptions: {
        enabled: false,
    },
    nodeAssociatePublicIpAddress: false,
    privateSubnetIds: eksVpc.privateSubnetIds,
    publicSubnetIds: eksVpc.publicSubnetIds,

    skipDefaultSecurityGroups: false,
    skipDefaultNodeGroup: true,
    useDefaultVpcCni: false,
    version: '1.25',
});

const mng = new eks.ManagedNodeGroup(`${env}-managed-ng`, {
    cluster: eksCluster,
    instanceTypes: ['t3.medium'],
    scalingConfig: {
        desiredSize: 1,
        maxSize: 1,
        minSize: 1,
    },
    nodeRole: instanceRole,
});

const addonVersion = aws.eks.getAddonVersionOutput({
    addonName: 'coredns',
    kubernetesVersion: eksCluster.eksCluster.version,
    mostRecent: true,
}).version;

// takes ~ 15 minutes to create
new aws.eks.Addon(`${env}-cluster-coredns`, {
    clusterName: eksCluster.eksCluster.name,
    addonName: 'coredns',
    addonVersion: addonVersion,
    resolveConflictsOnCreate: 'OVERWRITE',
    resolveConflictsOnUpdate: 'OVERWRITE',
}, { customTimeouts: { create: '2m', update: '2m' }, dependsOn: [mng] });

// Export some values for use elsewhere
export const kubeconfig = eksCluster.kubeconfig;
export const vpcId = eksVpc.vpcId;

Log output

n/a

Affected Resource(s)

  • aws.eks.Addon

Output of pulumi about

n/a

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@flostadler flostadler added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team service/eks EKS issues customer/feedback Feedback from customers and removed needs-triage Needs attention from the triage team labels Nov 12, 2024
@t0yv0 t0yv0 self-assigned this Dec 2, 2024
@mjeffryes mjeffryes assigned flostadler and unassigned t0yv0 Dec 3, 2024
@mjeffryes mjeffryes added this to the 0.114 milestone Dec 3, 2024
@flostadler
Copy link
Contributor Author

While working on pulumi/pulumi-eks#1509 I ran into this, but it occurred after updating pulumi-aws from v6.47.0 to v6.63.0: pulumi/pulumi-eks#1519 (comment)

This makes me believe that this is actually a regression. I'll do a bisect over the versions to confirm my suspicion

@flostadler
Copy link
Contributor Author

I bisected the versions and it's indeed a regression that was introduced in v6.51.0.

That one includes multiple upstream upgrades and a bridge upgrade. Looking into these to find the root cause

@flostadler flostadler added the impact/regression Something that used to work, but is now broken label Dec 9, 2024
@flostadler
Copy link
Contributor Author

flostadler commented Dec 9, 2024

The upstream upgrades v5.64.0 and v5.65.0 do not include any suspect changes (no changes to any EKS resources). It doesn't repro in Terraform either.
So this is seems to be a regression on the pulumi side

@flostadler
Copy link
Contributor Author

flostadler commented Dec 9, 2024

It's this: pulumi/pulumi-terraform-bridge#2696

We're not returning the partial state to the engine for init errors since enabling PRC. That's why Pulumi doesn't save the Addon to state when it's failing to initialize.

@flostadler flostadler added blocked The issue cannot be resolved without 3rd party action. awaiting/bridge The issue cannot be resolved without action in pulumi-terraform-bridge. labels Dec 9, 2024
VenelinMartinov added a commit to pulumi/pulumi-terraform-bridge that referenced this issue Dec 9, 2024
In the SDKv2 bridge under PlanResourceChange we are not passing any
state we receive during TF Apply back to the engine if we also received
an error. This causes us to incorrectly miss any resources which were
created but encountered errors during the creation process. The engine
should see these as `ResourceInitError`, which allows the engine to
attempt to update the partially created resource on the next `up`.

This PR fixes the issue by passing the state down to the engine in the
case when we receive an error and a non-nil state from TF during Apply.

related to pulumi/pulumi-gcp#2700
related to pulumi/pulumi-aws#4759

fixes #2696
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Dec 10, 2024
VenelinMartinov added a commit to pulumi/pulumi-terraform-bridge that referenced this issue Dec 11, 2024
In the SDKv2 bridge under PlanResourceChange we are not passing any
state we receive during TF Apply back to the engine if we also received
an error. This causes us to incorrectly miss any resources which were
created but encountered errors during the creation process. The engine
should see these as ResourceInitError, which allows the engine to
attempt to update the partially created resource on the next up.

This PR fixes the issue by passing the state down to the engine in the
case when we receive an error and a non-nil state from TF during Apply.

This is the second attempt at this. The first was
#2695 but was
reverted because it caused a different panic:
#2706. We added
a regression test for that in
#2710

The reason for that panic was that we were now creating a non-nil
`InstanceState` with a nil `stateValue` which causes the `ID` function
to panic. This PR fixes both issues by not allowing non-nil states with
nil `stateValue`s and by preventing the panic in `ID`.

There was also a bit of fun with go nil interfaces along the way, which
is the reason why `ApplyResourceChange` now returns a
`shim.InstanceState` interface instead of a `*v2InstanceState2`.
Otherwise we end up creating a non-nil interface with a nil value.

related to pulumi/pulumi-gcp#2700
related to pulumi/pulumi-aws#4759

fixes #2696
corymhall added a commit that referenced this issue Dec 11, 2024
This PR was generated via $ upgrade-provider pulumi/pulumi-aws
--kind=bridge.

---

Upgrading pulumi-terraform-bridge from v3.96.0 to v3.97.1.

**Manual Changes:**
copied from #4898

Fixes #4759
Fixes #4894
@pulumi-bot
Copy link
Contributor

This issue has been addressed in PR #4898 and shipped in release v6.65.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting/bridge The issue cannot be resolved without action in pulumi-terraform-bridge. blocked The issue cannot be resolved without 3rd party action. customer/feedback Feedback from customers impact/regression Something that used to work, but is now broken kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed service/eks EKS issues
Projects
None yet
4 participants