Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job modify index has changed since last refresh #398

Open
shantanugadgil opened this issue Dec 4, 2023 · 10 comments
Open

job modify index has changed since last refresh #398

shantanugadgil opened this issue Dec 4, 2023 · 10 comments

Comments

@shantanugadgil
Copy link
Contributor

Terraform Version

Terraform v1.6.5
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v5.29.0
+ provider registry.terraform.io/hashicorp/local v2.4.0
+ provider registry.terraform.io/hashicorp/nomad v2.0.0

Nomad Version

3 node server cluster at version 1.6.3

Provider Configuration

Which values are you setting in the provider configuration?

provider "nomad" {
  address = "http://nomad.somedomain:80"
  region  = "...."
}

Environment Variables

Do you have any Nomad specific environment variable set in the machine running Terraform?
Not Nomad specific, but we have TF_IN_AUTOMATION = "1" set as we run this under Atlantis.

Affected Resource(s)

nomad_job

We have a "common" job which runs using the nomad provider like so ...

Terraform Configuration Files

resource "nomad_job" "common" {
  hcl2 {
    allow_fs = true
  }
  purge_on_destroy = true

  jobspec = templatefile("${path.module}/common.nomad.tpl",
    {
      common_bash         = data.local_file.common_bash.content
    }
  )
}

We have Atlantis setup for automation.

The problem is that the atlantis plan works fine, but fails during apply.

Debug Output

N/A

Panic Output

N/A

Expected Behavior

apply should have worked properly

Actual Behavior

╷
│ Error: job modify index has changed since last refresh
│ 
│   with nomad_job.common,
│   on common.tf line 9, in resource "nomad_job" "common":9: resource "nomad_job" "common" {
│ 
╵

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Important Factoids

this behavior is only seen when run under Atlantis.
If I merge the code and run it on the command line, this doesn't occur.

References

N/A

Q: Is there something we could do (like a force refresh before apply) to make this error go away?

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 16, 2023

Hi @shantanugadgil 👋

This error happens when the job is modified between a Terraform state refresh and the provider plans the job submission.

if resp != nil && resp.JobModifyIndex != wantModifyIndex {
// Should rarely happen, but might happen if there was a concurrent
// other process writing to Nomad since our Read call.
return fmt.Errorf("job modify index has changed since last refresh")
}

Unfortunately I have not been able to reproduce it since it should only happen in a race condition. Do you, by any chance, have multiple Terraform runs modifying the same job at the same time?

Another unfortunate aspect of this is that Terraform SDK doesn't provide a way for use to get around this. State refresh is handled completely outside of the provider and by the time it is invoked, we only receive the state data. Retrying inside the provider would have no effect since the state data will always be the same.

I don't have a lot of experience with Atlantis, but is there a way to automatically retry failed runs? Or maybe reduce how many plan/applies are executed in parallel? The only way forward that I see to fix this would require changes outside of the provider.

Maybe we could make this check optional? But then it would result in a Terraform plan is almost guaranteed to result in data loss since the job was changed while the diff was computed.

@shantanugadgil
Copy link
Contributor Author

Unfortunately I have not been able to reproduce it since it should only happen in a race condition. Do you, by any chance, have multiple Terraform runs modifying the same job at the same time?

There is nothing changing the job index outside of the Terraform way.

I could elaborate on the contents of the state... it is an AWS ASG with some Nomad jobs launching on it.

Job type is system and it is using the node.class to restrict the job to the ASG (this particular "common" job deploys some logrotate configs etc and sits in a sleep infinity)

Another unfortunate aspect of this is that Terraform SDK doesn't provide a way for use to get around this. State refresh is handled completely outside of the provider and by the time it is invoked, we only receive the state data. Retrying inside the provider would have no effect since the state data will always be the same.

Due to this, we use the Atlantis method only to check that we have no "compile errors" (basic syntax issues) and the atlantis apply is a best effort. When the atlantis apply fails (which is random), we proceed to apply from a "terraform machine" as mentioned in the "factoids" above.

since the bug report above, I have gone ahead and added the refresh=true but that hasn't helped :( it still fails randomly.

    apply:
      steps:
        - apply:
            extra_args: ["-refresh=true"]

As I am typing this, I realized another thing ... this job (and others like it) which randomly fail are in a different namespace than default.

(hunch) could that be related somehow?

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 18, 2023

As I am typing this, I realized another thing ... this job (and others like it) which randomly fail are in a different namespace than default.

I've recently merged this change, which forces a namespace on the job plan request.

The wrong namespace could be an issue if:

  1. You're running Terraform in a context where the NOMAD_NAMESPACE env var is set (like TFC or if your Atlantis runner is a Nomad allocation)
  2. You have jobs with the same name on different namespaces.

But I suspect you would have a lot more problems than the modify index being different 🤔

Next time it happens, could you collect the value of modify_index in state for the resource being modified and also the ModifyIndex from the Noamd API?

@shantanugadgil
Copy link
Contributor Author

Next time it happens, could you collect the value of modify_index in state for the resource being modified and also the ModifyIndex from the Noamd API?

OK, will try to capture these the next time.

@VenelinMartinov
Copy link

Hey @lgfa29, I am one of the developers of the pulumi-nomad provider, which uses the TF nomad provider under the hood.

Pulumi, unlike terraform does not run refresh by default and this issue affects users more than users of the TF provider. This should be very reproducible in TF with terraform apply -refresh=false.

Is it possible to supply a flag here to disable the index has changed since last refresh check? For some workflows that is not necessary and clobbering the existing state is the desired behaviour.

@shantanugadgil
Copy link
Contributor Author

closing due to age. Lot of Nomad versions updated since the original bug report and a couple of version updates to the provider as well.

@shantanugadgil shantanugadgil closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2024
@VenelinMartinov
Copy link

VenelinMartinov commented Nov 6, 2024

@shantanugadgil this is still an issue can we please reopen it? Is it possible to also consider the suggestion in #398 (comment) of adding a flag to disable the index has changed since last refresh check?

@shantanugadgil
Copy link
Contributor Author

@shantanugadgil this is still an issue can we please reopen it? Is it possible to also consider the suggestion in #398 (comment) of adding a flag to disable the index has changed since last refresh check?

I am just doing some cleaning up of issues I have reported, but which have been open for a long time! 🙂

I didn't think someone else was tracking this as well! 👍

I'll re-open this issue.

@shantanugadgil shantanugadgil reopened this Nov 6, 2024
@VenelinMartinov
Copy link

Should I open a new issue for this? What's the best way to bubble this problem up?

@mashedmonday
Copy link

This issue is a real pain for us on our development cluster. Mostly because we use auto_revert=true in nomad quite a bit and whenever a deployment was rolled back, it essentially breaks pulumi for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants