Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial experimental support for deferred actions when count or for_each are unknown #34651

Merged
merged 13 commits into from
Feb 12, 2024

Conversation

apparentlymart
Copy link
Contributor

This PR represents the first real functional milestone toward solving #30937, in a very early experimental capacity.

There are still various details not quite right here yet, but this pile of commits has already got big enough so I think it's time to land it and then we can continue working on the details in subsequent smaller PRs.

The end result of all of this is that:

  • In any module that's opted in to the unknown_instances language experiment, it's valid for count or for_each in a module or resource block to be unknown. (That was actually mostly true before, but there was a bug making it not work for module calls, and turning it on before just caused the plan to be broken rather than doing anything useful.)
  • For any resource block that has an unknown count or for_each, or that is declared beneath a module call with unknown count or for_each, Terraform asks the provider to confirm that what we know of the configuration is plannable, but then just makes a note that all of the instances of this resource must have their actions deferred to a future round.
  • For any resource that refers to a resource that's already been deferred for another reason, all of the planning actions run as normal but then the result isn't actually added to the plan, and instead Terraform again just makes a note that it's been deferred to a future round.
  • When a module used as part of a component in a stack produces any deferred actions, any downstream components have all of their actions forced deferred, to ensure that we preserve the dependency ordering between components. However, the planning actions do otherwise still run as normal, so that operators can get feedback if we can prove that anything about the configuration is invalid with the information available.

My main priority in this first round was to keep the new codepaths as isolated as possible from the existing code, to minimize the risk of impacting the behavior for anyone not participating in the experiment. Therefore there are some rough edges and compromises called out in comments throughout. Since none of this new code should be reachable in stable releases, I'd ask that reviewers pardon my dust and trust that we'll continue improving this in future PRs.

With that said though: if you do see something that seems like it could break non-experimental usage, please call that out! (The fact that I needed to make very few changes to existing tests gives me confidence, but it can't hurt to double-check.)

The new test coverage here is also relatively light, and I apologize for that. My next step here is going to be to try this with a more realistic stack configuration from our set of examples from stacks private preview, and so that will be my primary testing vehicle immediately, and then I'll add more automated tests gradually in future PRs.

I tried to make up for the fact that this is a huge change by breaking it into digestible commits, so I suggest that reviewers take it on a commit-by-commit basis rather than trying to review the entire diff at once.

Copy link
Contributor

@alisdair alisdair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good! None of my comments are of significance and shouldn't block merge. My focus in review was on a possible leak of the experimental behaviour, and I found none.

@@ -419,7 +419,7 @@ func (e *Expander) GetResourceInstanceRepetitionData(addr addrs.AbsResourceInsta
}

// GetModuleCallInstanceKeys determines the child instance keys for one specific
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment's method name doesn't match the method below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll fix this one before merge just because it's not likely to cascade down onto other commits.

(For your other feedback, I'm going to keep it in mind for later work but leave it unchanged for now just to avoid churning this big pile of commits too much, since it's all experimental code we're explicitly intending to revisit anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For consistency I should really have named this method GetResourceInstanceKeys, but since changing that would impact the caller from another commit I'm just noting that in a comment here so I can refer back to it while planning later work.)

// This must be called zero or one times before any other use of the receiver.
// Changing this setting after a [Deferred] has already been used, or
// concurrently with any other method call, will cause inconsistent and
// undefined behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these constraints, did you consider preventing mutation of this value, and instead requiring it to be set at construction time? Perhaps through a second constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did specifically consider adding an "opts" argument to NewDeferred, but it seemed weird to complicate that signature for what is currently just a single boolean flag.

I suppose a separate constructor would be a viable approach too, but honestly I decided not to sweat it too much for now. It is true though that this design does make it possible for someone to accidentally misuse this at runtime after initial construction, and so might be worth revisiting this in a later iteration if it seems likely that we're going to continue following this approach once we've got feedback from the experiment.

// It's invalid to call this method for an address that was already reported
// as deferred using [Deferred.ReportResourceInstanceDeferred], and so this
// method will panic in that case. Callers should always test whether a resource
// instance action should be deferred _before_ reporting that it has been.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little confused by this paragraph of commentary (and the corresponding one below before the panic). I think the contract we're enforcing here is that callers are required to call these two methods in a specific order for each resource instance, at most once each:

  • First call ShouldDeferResourceInstanceChanges
  • If it returns true, then you must call ReportResourceInstanceDeferred

The contract is to crash upon calling these in any other order, or any more often. That's a severe restriction, which may be entirely reasonable, but it was a bit hard for me to follow and I wonder if it's a possible maintenance hazard.

Assuming that I've understood what's happening, which is quite possibly not the case, I have two ideas:

  • Bundle these two methods into one, so that it's impossible to call them in the wrong order. Something like:

    func (d *Deferred) ShouldDeferResourceInstanceChanges(addr addrs.AbsResourceInstance) bool {
        ret := d.shouldDeferResourceInstanceChanges(addr)
        if ret {
            d.reportResourceInstanceDeferred(addr)
        }
        return ret
    }
  • Or, update the comments throughout to clarify the expectation that these methods are always to be called in this specific order, as a pair, and only once per addr, because we require that the graph walk only evaluates deferrals once per instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a little confusing, and I think a big part of what makes it confusing is that many different resource instances are potentially in play, and this constraint only applies for one specific resource instance address. In practice then, the sequence of calls is:

  1. Some upstream graph node earlier called ReportResourceInstanceDeferred to report that it has deferred a change, regardless of why that happened.
  2. The current graph node then calls ShouldDeferResourceInstanceChanges with its own address, which causes this function to check whether that any of address's upstream dependencies were deferred. (In the case of this example, yes.)
  3. The current graph node then responds to that result by calling ReportResourceInstanceDeferred with its own resource instance address. (This would then be the step 1 of a downstream incarnation of this sequence, if any)

Bundling the check in with the report is an interesting idea. I think we will still need a separate exported ReportResourceInstanceDeferred because in the end this will not be the only reason for a resource instance graph node to need to report that it made a deferral decision: it would do so also if a provider responded to PlanResourceChange by saying "I cannot plan this yet", once we get to that functionality in a later PR. But I admittedly cannot think of any reason why one would call SourceDeferResourceInstanceChanges, get a true response, and not immediately call ReportResourceInstanceDeferred, so it does seem plausible to build it so that asking the question and performing the side-effect of reporting are an atomic action from the caller's perspective.

(I do typically try to avoid writing functions that are both accessors and mutators at the same time as far as that's possible to do so, but I do agree that in this case the separation seems more likely to encourage bugs than combining them would.)

// with "b:", even though the rest of the name is unknown.
"name": cty.UnknownVal(cty.String).Refine().
StringPrefixFull("b:").
NotNull().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat idea for a test case! I had forgotten about the helpful overlap between refinements and deferred actions so seeing it called out here is great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I didn't actually explicitly intend to capture this here... it just emerged naturally from how I constructed the test fixture! But I did quite like to see it once I saw it, so I kept it. 😀

This method must hold a read lock while it does its work, or else it will
race with calls to register the expansion mode for other module calls.
We allow experiments only in alpha builds, and so we propagate the flag
for whether that's allowed in from "package main". We previously had that
plumbed in only as far as the rpcapi startup.

This plumbs the flag all the way into package stackeval so that we can
in turn propagate it to Terraform's module config loader, which is
ultimately the one responsible for ensuring that language experiments can
be enabled only when the flag is set.

Therefore it will now be possible to opt in to language experiments in
modules that are used in stack components.
This is kinda awkward because this address type represents both resources
whose own instances are not expanded yet and resources belonging to whole
modules whose instance keys haven't been expanded yet, and those two cases
have different address types.

However, in return for this awkward API only for the rare case where we
need to isolate the module instance address, the rest of the system gets
to not worry very much about this distinction: it can share most code
between the two cases, since they both need similar evaluation treatment
anyway.
The internal API here was a little unclear, so this is just some
retroactive additional documentation resulting from me having "relearned"
how this API works after some time away from it, in the hope that future
maintainers will not need to repeat that exercise.
When we defer planning actions that have direct side-effects -- that is,
resource instance actions -- we need to remember that we did so because
we then need to force deferring planning for any other resources
downstream to preserve the required order of operations. That means we
need to keep track of what we've already deferred so we can check whether
a newly-visited resource instance must also be deferred.

We also need to track deferrals for the purpose of reporting them as part
of the plan presented to the end-user. This change _starts_ to implement
that, but for now the public API is limited only to generating a boolean
signal for whether there are any deferrals at all, because that's the
minimum possible reporting that allows for correct behavior (so the caller
can treat the resulting plan slight differently in the workflow).

In future we'll also want to be able to return placeholder values for
not-yet-expanded resource instances, but that's not fully built out here
yet since for an initial implementation we can cheat and use
cty.DynamicVal as a universal "we know nothing at all" placeholder.

This API is likely to grow and change in forthcoming commits as we
continue wiring these new behaviors into the Terraform modules runtime.
As with the various other context-tracking containers we use during graph
walks, we'll need a deferring.Deferred object shared across all graph
nodes' EvalContexts and with the expression evaluator so that we can
report deferrals and then make decisions based on deferrals previously
registered upstream.

As of this commit we're not yet doing anything with this new object, but
in future commits we'll use it to report what we've deferred and to
"artifically" defer anything that has a side-effect and depends on
something else that's already been deferred.
Usually language experiments have single-module scope, but this particular
one can potentially force modules other than the one that opt in to do
some behavior differently for correct results.

To minimize the risk of the new experiment code impacting those who are not
participating in the experiment, we'll want to branch into a new
DynamicExpand codepath only when at least one module is participating in
the experiment, but graph nodes alone have a more tightly-scoped view that
prevents them from answering that question directly, and so we'll
temporarily ask the ConfigTransformer to help in detecting that up top and
propagating the result into all of the resource nodes.

This commit doesn't yet make use of the new temporary field; that'll
follow in a subsequent commit that introduces the new
partial-expansion-aware alternative DynamicExpand codepath.
Previously we were incorrectly checking for the experiment being enabled
on the global EvalContext, but that doesn't make sense because experiments
are explicitly module-specific.

Now we'll ask the question separately for each module so we can use the
module-scoped EvalContext.
An earlier commit changed the EvalContext model to support evaluation in
both fully-expanded and partial-expanded modules, but the
provider-contributed functions feature landed semi-concurrently with it
and so inadvertently introduced a codepath that only worked in the
fully-expanded case.

This will now handle both situations, since all we really need is the
addrs.Module, which we can obtain in both modes.
For now this is just an up-top weird exception to minimize the impact on
the code that's runs when not participating in the unknown_instances
experiment.

In future it would be nice to rework this GetResource method to rely on
the ResourceInstanceKeys result even in the known case, since that will
simplify this function and remove a some duplicate code, but our priority
for now is minimizing the possibility that the code for this experiment
can impact users who are not participating in the experiment.
When at least one module in the configuration is opted in to the
unknown_instances language experiment, we'll now divert into an
experimental adaptation of nodeExpandPlannableResource.DynamicExpand which
knows how to deal with partial-expanded modules and resources.

Whenever a partial-expanded resource prefix is detected, we'll make a
special graph node for it which goes through some of the motions of
planning but does so only really for the side-effect of checking the
configuration for dynamic errors.

We do also save the result as a placeholder value for the object for
completeness, although it's not really used for much yet because the
partial-expansion tracker saves this primarily for reporting information
about the deferred changes as extra context in the plan, and we're not
implementing that yet.

This isn't yet quite right because we're not preserving enough information
from plan phase to apply phase to correctly handle references to the
resources that had deferred actions. However, since this is currently
just an experiment anyway, this is a useful interim milestone that shows
at least that things are functional enough to converge on the correct
result in the end.

We'll correct the evaluation misses (as documented in the
TestContext2Apply_deferredActionsResourceForEach test) in subsequent
commits, which will require tracking a little more information in the
saved plan so that we can re-populate all of the deferral metadata when
we start the apply phase.
When the modules runtime is being used inside the stacks runtime, it's
possible that a component could refer to another component that has
deferred changes in its plan. In that case, we do still want to plan the
downstream component (to give earlier feedback if there are obvious
problems with it) but we need to force all planned actions to be treated
as deferred so that we preserve the correct dependency ordering across
all objects described in a stack.

This commit only deals with the modules runtime handling that case. We'll
make the stacks runtime use it in a subsequent commit.
This is the bare minimum functionality to ensure that we defer all actions
in any component that depends on a component that already had deferred
actions.

We will also eventually need to propagate out a signal to the caller for
whether the stack plan as a whole is complete or incomplete, but we'll
save that for later commits, since the stack orchestration in Terraform
Cloud will do the right thing regardless, aside from the cosmetic concern
that it won't yet know to show a message to the user saying that there
are deferred changes.
@apparentlymart apparentlymart merged commit e6665c4 into main Feb 12, 2024
6 checks passed
@apparentlymart apparentlymart deleted the f-deferred-actions-resources branch February 12, 2024 20:13
Copy link

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core enhancement stacks Relating to the Stacks features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants