Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_dataflow_job - when updating, wait for new job to start #3591

Merged

Conversation

jcanseco
Copy link
Member

@jcanseco jcanseco commented Jun 2, 2020

Release Note Template for Downstream PRs (will be copied)

dataflow: changed the update logic for `google_dataflow_job` to wait for the replacement job to start successfully before modifying the resource ID to point to the replacement job

@modular-magician
Copy link
Collaborator

Hello! I am a robot who works on Magic Modules PRs.

I have detected that you are a community contributor, so your PR will be assigned to someone with a commit-bit on this repo for initial review.

Thanks for your contribution! A human will be with you soon.

@emilymye, please review this PR or find an appropriate assignee.

@modular-magician modular-magician requested a review from emilymye June 2, 2020 02:27
@jcanseco
Copy link
Member Author

jcanseco commented Jun 2, 2020

I'm not really sure how to add reviewers, but @c2thorn would have context regarding this change.

@modular-magician
Copy link
Collaborator

Hi! I'm the modular magician. Your PR generated some diffs in downstreams - here they are.

Diff report:

Terraform GA: Diff ( 1 file changed, 41 insertions(+))
Terraform Beta: Diff ( 1 file changed, 41 insertions(+))

@emilymye
Copy link
Contributor

emilymye commented Jun 2, 2020

Hi @jcanseco! Thank you so much for contributing to MM! We really appreciate all the contributions you've been making. Just a couple comments but otherwise LGTM.

I'd also note (for future PRs, this one can stay as is) that we set up some polling utils in common_polling.go where you pass PollingWaitTime(...) a read function and a status-checking function specific to the resource.

I don't need any action right now though to use these utils though - I see some hardcoded timeouts in the existing code, so I'll probably be doing a PR afterwards where I can add the polling utils.

third_party/terraform/resources/resource_dataflow_job.go Outdated Show resolved Hide resolved
case "JOB_STATE_FAILED":
return resource.NonRetryableError(fmt.Errorf("the replacement job with ID %q failed with state %q.", replacementJobID, state))
default:
log.Printf("the replacement job with ID %q has state %q.", replacementJobID, state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Printf("the replacement job with ID %q has state %q.", replacementJobID, state)
log.Printf("[DEBUG] replacement job with ID %q has successful terminal state %q.", replacementJobID, state)

(needs [DEBUG] or else TF won't print it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return resource.RetryableError(fmt.Errorf("the replacement job with ID %q has not yet started and has state %q.", replacementJobID, state))
case "JOB_STATE_FAILED":
return resource.NonRetryableError(fmt.Errorf("the replacement job with ID %q failed with state %q.", replacementJobID, state))
default:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default:
case "":
return resource.RetryableError(fmt.Errorf("the replacement job with ID %q does not have a defined state. Retrying.", replacementJobID, state))
default:

Found a case where the state just returns empty before eventually getting to JOB_STATE_FAILED. Adding a retry here gets to the failed state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also combine this with "JOB_STATE_PENDING" and change the message to "has pending state %q"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Fixed.

@jcanseco jcanseco force-pushed the dataflow_wait_for_update branch from 3b31ee5 to 8b8a5d2 Compare June 2, 2020 19:33
@jcanseco
Copy link
Member Author

jcanseco commented Jun 2, 2020

Hi @jcanseco! Thank you so much for contributing to MM! We really appreciate all the contributions you've been making.

My pleasure!

@modular-magician
Copy link
Collaborator

Hello! I am a robot who works on Magic Modules PRs.

I have detected that you are a community contributor, so your PR will be assigned to someone with a commit-bit on this repo for initial review.

Thanks for your contribution! A human will be with you soon.

@rambleraptor, please review this PR or find an appropriate assignee.

@modular-magician
Copy link
Collaborator

Hi! I'm the modular magician. Your PR generated some diffs in downstreams - here they are.

Diff report:

Terraform GA: Diff ( 1 file changed, 41 insertions(+))
Terraform Beta: Diff ( 1 file changed, 41 insertions(+))

@emilymye emilymye removed the request for review from rambleraptor June 2, 2020 19:43
This patch modifies the update-by-replacement logic to wait for the new
job to start before updating the google_dataflow_job's resource ID to
point to the new job's ID. This ensures that the google_dataflow_job
resource continues to point to the original job if the update operation
were to fail.
@jcanseco jcanseco force-pushed the dataflow_wait_for_update branch from 8b8a5d2 to 540663a Compare June 2, 2020 20:16
@modular-magician
Copy link
Collaborator

Hello! I am a robot who works on Magic Modules PRs.

I have detected that you are a community contributor, so your PR will be assigned to someone with a commit-bit on this repo for initial review.

Thanks for your contribution! A human will be with you soon.

@rambleraptor, please review this PR or find an appropriate assignee.

@c2thorn c2thorn removed the request for review from rambleraptor June 2, 2020 20:18
@modular-magician
Copy link
Collaborator

Hi! I'm the modular magician. Your PR generated some diffs in downstreams - here they are.

Diff report:

Terraform GA: Diff ( 1 file changed, 38 insertions(+))
Terraform Beta: Diff ( 1 file changed, 38 insertions(+))

@emilymye emilymye self-requested a review June 3, 2020 21:07
Copy link
Contributor

@emilymye emilymye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - running a test just as sanity check and if it passes I'll merge


region, err := getRegion(d, config)
if err != nil {
return resource.NonRetryableError(err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilymye @c2thorn,

@spew brought up the following about this line:

This seems like a place where we would want to retry and thus not return NonRetryableError?
Example: transient errors such as service 500s, TLS handshakes, etc, I believe will be returned by resourceDataflowJobGetJob(...) as that function is simply using the GCP APIs directly.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds plausible, but haven't seen any such errors in practice. It doesn't hurt to retry if we know specifically which errors we want to retry for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has occurred for many resources for us in the past. Most of it was fixed by running things through the retry functions in retry_utils.go using the defaultRetryPredicates in error_retry_predicates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest using the default retry predicates

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll put out a patch for Magic Modules, then if it looks good to the Terraform team, I can bring that patch to KCC's copy of Terraform to ensure we're in sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @jcanseco!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also apologies, I just realized I commented on the "NonRetryableError" for getRegion(). I meant to do so for the one for resourceDataflowGetJob(). Might've been obvious but I thought I should clarify it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants