Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype: Opt-in additional hints during validation #29661

Closed
wants to merge 8 commits into from

Conversation

apparentlymart
Copy link
Contributor

@apparentlymart apparentlymart commented Sep 28, 2021

This is an initial prototype of a new "hint mode", which extends the Validate operation with some additional checks which can generate warnings that might not always be applicable, and thus would not necessarily be appropriate to use as a blocker for merging a PR, or similar.

The idea here would be to raise situations that aren't technically invalid but that are either non-idiomatic or a common cause of unexpected/undesirable behavior. The main use-case I have in mind for this is that if Terraform is doing something surprising and you don't understand why then you might ask it for hints on your configuration to see if it generates any additional notes that might either give something new to investigate or allow eliminating certain possible causes from consideration (by removing redundant declarations in order to focus on the significant parts).

As a starting point, this includes a set of checks relating to the depends_on meta-argument within resource and data blocks, detecting situations where the declarations in there are either redundant with other references or over-specified. I started with these warnings because redundant declarations seem to have been a common source of distraction for people attempting to debug, where they waste time trying to re-state dependencies that Terraform can already see and can thus find themselves in an XY problem, thinking that dependencies are the problem when there's actually some other cause.

Some other possible ideas for lint checks we could potentially include are:

  • Conditional expressions where the predicate provably always returns true or always false, such as comparing two values of different types with either == or !=. (Could resolve compare map doesn't work when we use condition #27643, by automatically explaining the situation rather than leaving it ambiguous.)
  • Using literals of a particular type in situations that would always lead to an implicit conversion to some other type. For example, "1" + 2 could warn that the "1" ought to be written as 1. This is not problematic in itself, but this is a thing in the category of "eliminate odd stuff that is likely to distract from finding a real cause". (I would hesitate to extend this to non-literal operands because relying on implicit conversions for resource attributes and similar is pretty common and so would likely be very noisy without really explaining anything useful.)
  • Using the impure functions (currently timestamp, uuid, and bcrypt) in resource arguments, which will typically cause a configuration to never converge because there will always be a planned update/replace for that resource. (Could we detect that indrectly too, so that e.g. referring to a local value which is itself non-converging would also generate the warning?)
  • A heuristic to check if it looks like a heredoc template is trying to generate JSON? We might not be able to detect this reliably enough, but if we can it would be nice to suggest using jsonencode as a more robust answer, since we commonly see folks trying to generate JSON using string templates and struggling to get all of the punctuation right. The goal of catching what appears to be slightly-invalid JSON makes this heuristic a bit trickier than it would otherwise be.
  • Recommend explicit type constraints for all input variables (even if they are just any), because that tends to help understand why something weird is going on when a caller writes an expression of the wrong type.
  • Using lookup in its two-argument form, which has long been superseded by the explicit index syntax.
  • Using lookup in its three-argument form, which (subjectively) tends to be more readable as a try call instead, because the argument then appears just as normal traversal syntax and is thus easier to read.
  • Using element with a list of a known length and a constant index that we can prove is in bounds for the list. Better to use index syntax instead, because such a call can't possibly be relying on the wrap-around effect of element.
  • Perhaps we can detect legacy usage with splat expressions combined with index expressions, as was common in Terraform v0.11 and earlier, like aws_instance.example.*.id[0], and propose writing aws_instance.example[0].id instead to make the expression easier to understand when debugging.
  • We could report when a local value is declared but not referred to anywhere in the same module. A local value doesn't contribute to the externally-visible behavior of a module, so there's no point in declaring one that isn't used somewhere else in the module.
  • Warn if the file path argument to functions like file, templatefile, filebase64, etc refers to a file that doesn't exist and seems similar to a filename in a location that seems like it's intending to generate a file during the apply operation, such as in a local-exec provisioner for a resource that this object depends on? This one might be too fuzzy to detect reliably, but if we could catch it then we could give more explicit direction that these functions aren't suitable for that purpose.
  • If a quoted literal string contains something which appears to be a valid reference to elsewhere in the module (it's both syntactically correct and resolves to something in the current module), hint that the author probably wanted to write it not in quotes. For example, "var.example" would, if there's also a variable "example" block in the module, suggest writing plain var.example instead.
  • Applying the splat operator [*] to a resource which has for_each set. Although it's technically okay to do that, because the splat operator can apply to values of any type, for for_each resources in particular it's much more likely to be an incorrect attempt to mimic selection of multiple attributes on a count resource. This lint rule could suggest wrapping the resource expression itself in values(...) in order to get what was probably the intended effect, like values(aws_instance.example)[*].private_ip.
  • ...

I'm focusing on generic language-level lints here, rather than provider-specific lints. Provider-specific lints could be interesting too, but we'd need to figure out how to represent that in the provider wire protocol and so wouldn't want to block on that.


I've not wired this new mode up to the UI in this PR, because there are a few different ways we could expose these with some different tradeoffs, while the main point of this PR was to prototype how lint checking might integrate into Terraform Core, not how we might expose it in the UI. I'm considering two main possibilities for exposing this in the UI, though we might consider others too:

  • terraform validate -lint
  • terraform lint

In both of these cases it seems like we might want to add a new diagnostic severity to represent lint warnings, so that readers of the output can differentiate between the warnings that they probably should pay close attention to vs. the lint warnings that might not necessarily be relevant (but hopefully are more often than not). For the moment I've just classified these as normal warnings, because I wanted to keep these changes mostly confined to the main Terraform Core package.

It might also be nice to include "fix it hints" which can be machine-understood and automatically applied to the source code, e.g. in a text editor. That won't apply to all lints, but can apply to some. I wouldn't expect to do that in a first pass -- it would require extending our diagnostics model in some way -- but would be nice to do in the long run.


The three resource-dependency-related example lint warnings that this PR introduces to start with:

╷
│ Warning: Redundant explicit dependency
│ 
│   on lint.tf line 19, in resource "null_resource" "depender":
│   19:     null_resource.a,
│ 
│ There is already an implied dependency for null_resource.a at
│ lint.tf:15,12, so this declaration is redundant.
╵
╷
│ Warning: Redundant explicit dependency
│ 
│   on lint.tf line 21, in resource "null_resource" "depender":
│   21:     null_resource.b,
│ 
│ There is already an explicit dependency for null_resource.b at
│ lint.tf:20,5, so this declaration is redundant.
╵
╷
│ Warning: Over-specified explicit dependency
│ 
│   on lint.tf line 22, in resource "null_resource" "depender":
│   22:     null_resource.c[0],
│ 
│ Terraform references are between resources as a whole, and don't
│ consider individual resource instances. The [0] instance key in
│ this address has no effect.
╵

@radeksimko
Copy link
Member

One aspect of the existing validate command is that it requires providers to be installed (and as a result I think also state access, but I might be wrong about that one), which complicates editor integration for two reasons:

I understand that this is necessary if we are to keep some validation logic on the provider side, but it is problematic enough that we don't run validate on save in the official VS Code extension, unlike fmt.

If you don't anticipate the linting to require providers I would vote for a separate lint command as that will make integration much easier.

@apparentlymart
Copy link
Contributor Author

Unfortunately a lot of things we'd want to do with linting will require access to the provider schemas anyway -- we basically can't do anything with the bodies of resource, data, and provider blocks without schema -- so I don't think it'll be easy to get away from that.

I think if we do try to solve for linting without that overhead then it'll probably also lead to us doing the work to enable a partial validate that ignores the stuff that we call into providers to get, though I'm unsure whether such a result would actually be that useful since most of the interesting stuff in a Terraform module is content Terraform Core passes largely verbatim onto providers anyway.

It sounds like we ought to have a separate effort somewhere to optimize the integration point between the language server and Terraform CLI for validation purposes. I expect we can optimize that path to a fair extent if we prioritize profiling it and are then willing to change some fundamentals of how that all works, such as letting Terraform CLI stay running so it can cache schemas in memory, and perhaps changing the provider protocol to reduce the number of gRPC round-trips it takes to validate, assuming that profiling shows those two as the main overhead. (I've not measured it, so that's just a guess based on what you reported here.) Doing that should hopefully benefit both validation and linting, since they are both built on the same basic structures internally and vary only in some extra CPU-bound logic for the lint rules.

@radeksimko
Copy link
Member

That's fair - I just wanted to raise this early as I imagine many people will want to run this from an editor and will likely run into the same set of problems I did with the existing validate command.

I've not measured it, so that's just a guess based on what you reported here.

I admit I have mostly anecdotal evidence + my own experience at the moment, but I do plan to gather some more data in this area, so I should have something more concrete for you soon.


Back to the original proposal:

I'm sure there's an issue tracking it somewhere, but I remember we had some conversations about the possibility of providers doing some "heavier" API-based validation, such as "is this name available for an S3 bucket", or "is this EC2 instance type available for me in this AWS account"? I imagine if we were to implement it somehow, such validation would also need to be separate-able from the other - more static and faster - validation.

I'm in no way suggesting you to tackle this problem here in this PR, but just sharing some thoughts on possible direction of future expansion of validate.

@apparentlymart
Copy link
Contributor Author

Currently our main distinction between validate and plan is that validate doesn't "configure" the providers (and thus doesn't need endpoints/credentials configured) while plan does. I therefore tend to think of validate's scope in those terms: it checks the subset of things that we can check based on configuration alone.

To me, the plan phase is already the correct place for checks like the ones you mentioned -- the remaining checks that get excluded from terraform validate's scope. There's no strong architectural reason why aws_s3_bucket couldn't check during planning if the indicated bucket name already exists and return an error if so: it has all of the information it needs to do that. It would of course run into the usual limitation of checks like that: if something changes outside of Terraform between plan and apply then it could still fail, and if the bucket name is derived from an unknown value then we wouldn't be able to check it until apply, but those are fundamental consequences of Terraform's architecture which I don't think will be going anywhere, and we'd still get the benefit in the happy path where the bucket name is static and there's no drift outside of Terraform. (Something like #22094 could potentially help make the "declaring a conflicting object" situation a bit more explicit in the model, but I don't think that'd change the limitations I described here.)

I think so far this has mainly come down to a combination of opportunity cost (we get more impact from working on the constant stream of new functionality the cloud vendors are releasing than adding and maintaining dynamic validation checks that most folks will not hit) and SDK design quirks (the way the current SDK models planning is pretty far removed from how Terraform itself thinks about planning). The second of these at least we have an opportunity to address in the new framework. The opportunity cost is harder to overcome, but hopefully at least that cost will be lower once we have an SDK designed with that functionality in mind, rather than retrofitted late in its life.

With all of that said, I do appreciate you raising these ideas! I do think they are worth exploring, I'm just seeing them as something separate from the idea of linting, which I think of as a partial solution to the common problem of folks writing things that are technically valid but don't mean what they expected, often because of pragmatic compromises in our language design inherited from earlier Terraform versions.

This follows a similar principle as Context.Plan and Context.Eval.
Although we don't have any special extra options to add right now, adding
a new argument to the Validate method causes disruptive changes to callers
in various other files, whereas adding a new optional member to
ValidateOpts will typically touch fewer files and thus allow more of our
code to stay unmodified (and thus more backport-friendly) across future
updates.
This is another small step in our ongoing effort to refine the exported
API of the "terraform" package to be only what external callers need.

The validate graph is an implementation detail of Context.Validate and
need not be exposed outside of this package.
The graph builder doesn't actually use this information yet, but this
will allow us to make the graph builder react to future options without
causing any further changes in the validateGraphBuilder callers.
Here we establish a new mode for Context.Validate which enables some extra
checks which can produce warnings that don't necessarily need to be fixed
in order for a Terraform module to be accepted, but that which are
indicative of common mistakes that might lead to unintended behavior or
of non-idiomatic style.

The idea of this mode is that users could potentially enable it if they
are seeing some unexpected behavior, to potentially get some additional
hints as to what might be causing it. Since linting might sometimes
generate false negatives, this mode probably won't be suitable for use
as a pre-merge automated check.

For this first round, the lint checks are focused on the depends_on
arguments within "resource" and "data" blocks, catching situations where
entries in depends_on are redundant with other declarations or refer to
individual resource instances when Terraform only considers whole
resources.

This new mode is currently not exposed in the UI anywhere, so this is all
just dead code for the moment. Perhaps in future it will be exposed either
as a new mode for "terraform validate" or as a new command
"terraform lint", but we'll save that decision for later work.
This is here to support a forthcoming "hint mode" for validation, which
opts in to generating additional hints that will hopefully help with
debugging and improving module understandability but do not imply that a
module is necessarily "incorrect" or need to be fixed in all cases.

Not all of our codepaths will be able to handle this severity well,
because it's arriving on the scene late and so some existing codepaths
won't be aware of it, but that's okay because it's constrained only to
the validation codepath in particular, and only when hint mode is enabled,
and so we only really need to deal with it in that particular situation.
The noun "hint" seems to better describe what this mode actually does,
which is to make suggestions that hopefully improve readability overall
but may not be appropriate in all situations.

This retains the previous suggestions about depends_on and doesn't
introduce any others, but they are now presented as hint diagnostics
instead of warning diagnostics.
With a particular compile-time flag set (not by default) this adds an
extra -hints option to "terraform validate" which makes it run Terraform
Core's validate operation in hints mode.

For the moment this is just here to let us easily experiment with hints
in development environments. It's not yet ready to graduate to being
included in real releases because we still need to get a sense of what the
"rules" are for what makes sense as a hint, and we might also decide to
use a different CLI argument style or other interface to access the hints
mode.
@apparentlymart
Copy link
Contributor Author

I've done a bit more fiddling with this today, since I had some fragmented spare time between meetings/discussions. I ended up renaming "lint" to "hints" because that seemed more consistent with the goals I outlined here, of just offering some suggestions for the author to evaluate rather than intending this as a checklist of things that must be fixed in order for a module to be considered "good".

That then led to a new presentation of the results where they get annotated explicitly as being "Hint:" rather than "Warning:", and thus they are more clearly distinct from any error and warning messages that we might produce alongside.

╷
│ Hint: Redundant explicit dependency
│ 
│   on lint.tf line 19, in resource "null_resource" "depender":
│   19:     null_resource.a,
│ 
│ There is already an implied dependency for null_resource.a at lint.tf:15,12, so this
│ declaration is redundant.
╵
╷
│ Hint: Redundant explicit dependency
│ 
│   on lint.tf line 21, in resource "null_resource" "depender":
│   21:     null_resource.b,
│ 
│ There is already an explicit dependency for null_resource.b at lint.tf:20,5, so this
│ declaration is redundant.
╵
╷
│ Hint: Over-specified explicit dependency
│ 
│   on lint.tf line 22, in resource "null_resource" "depender":
│   22:     null_resource.c[0],
│ 
│ Terraform references are between resources as a whole, and don't consider individual
│ resource instances. The [0] instance key in this address has no effect.
╵

@apparentlymart apparentlymart changed the title Prototype: Configuration linting mode Prototype: Opt-in additional hints during validation Oct 29, 2021
@hashicorp-cla
Copy link

hashicorp-cla commented Mar 12, 2022

CLA assistant check
All committers have signed the CLA.

@apparentlymart
Copy link
Contributor Author

This PR has now become so conflicted as to be useless. 😖

Copy link
Contributor

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compare map doesn't work when we use condition
3 participants