Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stablize the yamlencode function output #29436

Closed
apparentlymart opened this issue Aug 20, 2021 · 6 comments · Fixed by #31907
Closed

Stablize the yamlencode function output #29436

apparentlymart opened this issue Aug 20, 2021 · 6 comments · Fixed by #31907

Comments

@apparentlymart
Copy link
Contributor

apparentlymart commented Aug 20, 2021

Back when we originally introduced yamlencode, circa the v0.12 release, we included a note in its documentation that the exact layout of its output was subject to change in future releases and so not to use it in situations that would react poorly to such changes:

Warning: This function is currently experimental and its exact result format may change in future versions of Terraform, based on feedback. Do not use yamldecode to construct a value for any resource argument where changes to the result would be disruptive. To get a consistent string representation of a value use jsonencode instead; its results are also valid YAML because YAML is a JSON superset.

The motivation for this leeway arose from the fact that YAML is a relatively complex language (in comparison to, say, JSON) and so we expected that we'd need to make some pragmatic adjustments to allow for quirks in other YAML parser implementations, to try to maximize interop with other systems that consume YAML. In the intervening time we have made some small adjustments to both yamlencode and yamldecode, but nothing particularly substantial.

In practice, we're now treating yamlencode as largely frozen, because although the letter of our documented exception would allow it we don't actually have any wish to churn existing uses of yamlencode for any purely-stylistic reasons, and so the bar for changing it at this point is high. Specifically, I expect we'd only change it if we found that it either had a significant bug where it's producing a result that isn't valid per the YAML spec, or if there were a YAML parser elsewhere that is commonly used within software relevant to Terraform's use-cases (infrastructure tools) which can't reliably parse our yamldecode output even if it's valid per the YAML spec.

This issue represents further raising that bar to say that the current output of yamlencode for a given input is effectively frozen under the Terraform v1.0 Compatibility Promises unless the change is to address behavior that's inconsistent with the YAML specification and the benefit of fixing it outweighs the costs when considering the behavior of existing third-party YAML parsers. To make that more specific:

  • If we learn that yamlencode produces something that isn't valid per the YAML spec then our first order of business would be to understand why we hadn't previously been aware of that. If the outcome is that common YAML parsers in the wild are more tolerant than the YAML spec requires then we would likely conclude that the benefit of fixing the bug is outweighed by the cost of breaking compatibility. If the outcome is that the bug only occurs in a rarely-encountered edge case and that common YAML parser in the wild can't parse the buggy output then we would likely move to fix the bug, even at the risk of impacting modules that are already generating invalid YAML in that edge case.
  • If we learn that yamlencode produces something that is valid per the YAML spec but yet still can't be parsed correctly by some particular third-party YAML parser, we would now typically not change the yamlencode behavior and instead propose that the third party update their parser implementation to be more compliant with the specification.

Although the experiment warning in the documentation isn't specific about what situations it's talking about, I want to be clear here that one major motivating use-case for this stabilization is to de-risk the use of yamlencode in userdata/metadata features in common cloud providers when producing cloud-init's "Cloud config" language, which typically requires including an additional YAML comment and thus isn't a suitable situation for using JSON instead:

  user_data = <<-EOT
    #cloud-config
    ${yamlencode({
      packages = [
        "mysql-client",
      ]
    })}
  EOT

user_data in EC2, and its equivalent in most/all other cloud VM systems, is an argument which can be changed only by replacing the corresponding virtual machine with a fresh one, because cloud-init processes it only on system boot. However, from the cloud provider level of abstraction it's just an opaque string or sequence of bytes, and thus there's no scope for any classification of two inputs as functionally-equivalent. It's therefore a particularly sensitive situation when it comes to churn, and so a configuration like the above would be against the current recommendation in the docs.


However, before we do this we ought to first try to conclude some due diligence related to the existing issue #23322.

That was originally opened to discuss making the yamlencode output configurable, but as I already noted over there we don't intend to do that because YAML encoding is a relatively-minor programmatic integration point from Terraform's perspective, rather than a big part of it's scope, and so the cost of implementing and maintaining a highly-configurable YAML encoder (particularly under 1.0 compatibility promises as discussed here) is too high.

However, there was some other commentary in that issue about how some yamlencode output is causing non-specific problems with specific external software. I asked the participants in that issue to share some more concrete examples, but at the time of writing nobody has done so.

Given that, I think our best path forward would be to spend some time with the specific software mentioned (Ansible, and Kubernetes config maps) to see if we can reproduce any such problems ourselves. If we can then we can discuss here any tactical changes we might want to make to yamlencode's current behavior (while preserving our goal of minimizing unjustified churn).

If we can't reproduce any such problems, and if nobody in the other issue comes forward with a specific example in the meantime, then I suggest we move forward with stabilization of the current implementation and then respond to any late-arriving feedback per the principles I described above, and within the principles of our 1.0 Compatibility Promises.

@mhulscher
Copy link

Kubernetes config maps

We are seeing issues where applying a k8s configmap once, results in terraform wanting to make a change to apply the exact same configmap in a second run of terraform apply. The keys of the original and the newly planned configmap are in a different order. Effectively this makes our terraform plans non-idempotent. In our case, this is coming from here: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/aws_auth.tf#L81

@apparentlymart
Copy link
Contributor Author

Hi @mhulscher,

From the behavior you described it sounds like something outside of Terraform is rewriting your data values into a different shape, which the Kubernetes provider is therefore detecting as drift.

I must admit to not being super familiar with config maps, but my understanding is that config maps are mappings from string keys to arbitrary blobs/strings, and so I would expect the server to just store what you submitted verbatim, rather than parsing it as YAML and then reserializing it as YAML (with different formatting details) on the next read. Therefore I'm not sure I understand what would be causing the inconsistency you're describing.

The concern of classifying detected changes as normalization rather than drift is the responsibility of providers rather than Terraform itself, so if these arguments are defined as being YAML specifically, rather than just raw strings to be stored and returned verbatim, then that sounds like a missing normalization-detection rule in the Kubernetes provider. I think something else might be going on in your particular case though, so I'd like to learn more if you have more to share. I'm not sure exactly what to ask due to my only-surface-level Kubernetes knowledge, but if you know of anything else in your system that might be rewriting these YAML blobs, or if you know that my assumption above about them being arbitrary blobs is incorrect, please let me know! And of course, if you know of anything else that might be relevant, even though I don't know to ask about it, please let me know that too!

@mhulscher
Copy link

Hi @apparentlymart ,

Thank you for taking the time to write out such a detailed response. I am still investigating the issue that we are facing, but I don't believe the problem to be related to the k8s provider or the way the config_map resource is implemented.

Our configmap has 1 key-value pair. The value is the string result of an array that is yamlencode()'d. The order of the items in the array seems to change. This could be because of yamlencode() itself. Or, something I consider to be more likely, some outside actor is changing the value of the config_map and terraform rightfully detects a diff.

Unless you consider it possible that yamlencode() (or terraform itself) changes the order of items in an array, I suggest to ignore my issue until and unless I have found more compelling evidence.

The diff can be found below:

   ~ resource "kubernetes_config_map" "aws_auth" {
       ~data        = {
           ~"mapRoles"    = <<-EOT
              - - groups:
              -   - system:bootstrappers
              -   - system:nodes
              -   rolearn: arn:aws:iam::000000000000:role/ci-eks-qsply82021101914485380780000000d
              -   username: system:node:{{EC2PrivateDNSName}}
              - - groups:
              -   - system:masters
              -   rolearn: arn:aws:iam::000000000000:role/administrator
              -   username: admininstrator
              - - groups:
              -   - system:masters
              -   rolearn: arn:aws:iam::000000000000:role/developer
              -   username: developer
              - - groups:
              -   - system:bootstrappers
              -   - system:nodes
              -   - system:node-proxier
              -   rolearn: arn:aws:iam::000000000000:role/ci-eks-qsply8-fargate2021101914485380780000000c
              -   username: system:node:{{SessionName}}

              + - "groups":
              +   - "system:bootstrappers"
              +   - "system:nodes"
              +   "rolearn": "arn:aws:iam::000000000000:role/ci-eks-qsply82021101914485380780000000d"
              +   "username": "system:node:{{EC2PrivateDNSName}}"
              + - "groups":
              +   - "system:bootstrappers"
              +   - "system:nodes"
              +   - "system:node-proxier"
              +   "rolearn": "arn:aws:iam::000000000000:role/ci-eks-qsply8-fargate2021101914485380780000000c"
              +   "username": "system:node:{{SessionName}}"
              + - "groups":
              +   - "system:masters"
              +   "rolearn": "arn:aws:iam::000000000000:role/administrator"
              +   "username": "admininstrator"
              + - "groups":
              +   - "system:masters"
              +   "rolearn": "arn:aws:iam::000000000000:role/developer"
              +   "username": "developer"
             EOT
             # (2 unchanged elements hidden)
         }
         id          = "kube-system/aws-auth"
         # (1 unchanged attribute hidden)

         # (1 unchanged block hidden)
     }

@apparentlymart
Copy link
Contributor Author

Hi @mhulscher,

If I'm reading that diff correctly, it seems like the value you are encoding is a list of objects (or, equivalently for yamlencode's work, a tuple with various object type elements). If that's true then yamlencode would generate exactly the order given, because order is meaningful for both of Terraform's sequence types. Unless the argument to yamlencode itself changed, it should produce the same ordering each time.

Another possibility is that the input here is a set of objects. Terraform sets don't have any inherent ordering and so Terraform converts them to YAML sequences using lexical ordering for sets of strings or an unspecified-but-consistent ordering for other types. Again that should mean that the same input to yamlencode would produce the same result, but unlike sequences it does mean that changing any part of one of the elements might cause Terraform to select a different order, because the order is derived (in an unspecified way) from the element contents.

In both cases then I don't really see how the ordering of a yamldecode result here could change between runs while you're still providing it the same data structure to encode.

If you are using Terraform v1.0 or later then we might be able to get a little extra information to debug this: if the problem is something else modifying this data outside of Terraform then Terraform v1.0 should detect that during planning and announce "Note: Objects changed outside of Terraform" and then describe essentially the opposite of the diff you're seeing, and then Terraform will propose to undo that change to restore what you wrote in your configuration just as you're seeing today. Giving more information to help understand what's causing non-converging configurations like this is a key use-case of that new feature.

With that said though, with the information we have right now I'm inclined to believe that something outside of the Terraform language is modifying your YAML documents here, and so I'm going to move forward with that assumption for the moment but if you are able to demonstrate otherwise then I'd definitely like to hear about it! Thanks again for sharing this strange behavior with me.

@nfi-opsguru
Copy link

As I mentioned in #23322 (comment), I am hard-pressed to think of an example resource API that accepts YAML that actually extends beyond JSON-compatible, and would not therefore be better-served by using actual JSON, besides the aforementioned cloud-init.

The aws_auth map example above could be JSON.

So I'd say yamlencode should just have a big warning that encourages users to use the (presumably stable) jsonencode instead, where possible.

@github-actions
Copy link
Contributor

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants