Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global IDs for representing relationships between resource objects (object containment, name collision detection, etc) #22094

Open
apparentlymart opened this issue Jul 16, 2019 · 5 comments
Labels
core enhancement providers/protocol Potentially affecting the Providers Protocol and SDKs thinking

Comments

@apparentlymart
Copy link
Contributor

apparentlymart commented Jul 16, 2019

This is a description of a problem space and some initial sketches for how it might be solved. It's not yet a fully actionable proposal, since we need to gather more examples and do prototyping with them to figure out exactly what the problem cases are and thus how best to solve them.

For now, this issue is here mainly so it can be mentioned in other issues (e.g. in provider repositories) that describe use-cases where this mechanism might be beneficial.

As a consequence, anything here is subject to change in subsequent discussion.


Terraform currently considers each remote object to be entirely distinct from others. That includes (but is not limited to) the following incorrect assumptions:

  • Deleting one object does not implicitly delete or modify any other objects, and can be done independently of the existence of other objects.
  • Updating one object in-place never affects the state of another object.
  • Creating a new object can never conflict with an existing object.

The above assumptions are clearly false for many real-world vendor APIs, though in practice we've been able to work around most of them in one way or another. In some cases that requires special care on the part of the user though, which can be problematic if violating the assumption has a negative effect such as system downtime or Terraform becoming "stuck" and unable to make progress.

Based on real-world experience with existing APIs, it seems like Terraform could benefit from explicit modelling of relationships between resource objects that are richer than what can be inferred only from the user-provided dependency graph. In principle providers could use their knowledge about the remote system to give Terraform more information about these relationships, and then Terraform could use that information to prevent certain obviously-incorrect actions and to generate warnings about situations that are less certain.

The remainder of this issue is some notes about a possible way to achieve that, and some initial ideas about how it might be used. This initial sketch is mainly serving as a request for example use-cases to inform a next iteration of it, and not something that is currently ready to implement.

Global Object IDs

Prior to Terraform 0.12, Terraform required all resource instance objects to have an associated id attribute, but imposed no requirement on how providers would use it other than that it must not be an empty string. In practice, that requirement didn't really serve any purpose from Terraform Core's standpoint, and so from Terraform 0.12 onwards there is no such requirement at the Core level, though as I write this the SDK does still impose that requirement for
0.11-compatibility reasons.

However, having a more strongly-defined sense of an ID for an object -- one that is global in scope and allows Terraform Core to make certain assumptions about it -- could be a useful building block for modelling relationships between objects.

Some of the remote systems we interact with already have a sense of ids that are global to their entire system. For example, AWS has the idea of an "ARN" which can uniquely identify a particular object across the whole of AWS, including not only the service-local unique identifier but also the overall AWS account the object belongs to and (where appropriate) the service region it was created in.

We can potentially generalize this idea by allowing each Terraform provider to define its own unique id scheme. The provider itself would control that scheme but Terraform Core would make certain assumptions about it that the provider must ensure are valid:

  • Each remote object has an id completely distinct from all others.
  • The unique id includes enough information to be unique across any possible provider configuration. (For example, for services that use regional namespaces selectable in the provider configuration, the id must include the region that was active when each object was created.)
  • The ID for a particular object is stable over time. That is, upgrading to a new version of the provider won't cause references to an ID to become dangling, unless the target object legitimately no longer exists.

Because the requirements for each remote system are different, Terraform Core would impose only a simple syntax requirement on these ids: they must be strings and they must start with the provider type name followed by a colon. After the colon can be any valid sequence of Unicode printable characters. If the remote system already has a suitable global ID syntax, it may be best to just use that directly in case these ids are seen by users (though ideally they should not be).

For example, any id generated by the azurerm provider must begin with azurerm: but can then be followed by any any printable Unicode characters needed to fully describe the identity of an object.

In practice I suspect we might elect to allow each remote object to have potentially multiple global object IDs, as a way to handle changes in the format over time (can report both the old and new forms at once) and to deal with any other unavoidable ambiguity that might arise. In that cases though, each distinct ID string should still only be associated with one object.

Not all objects need to have global IDs. Firstly, if we were to introduce a feature like this then necessarily it would start with most existing providers not supporting it universally, and even after it's been around for a while the global ID mechanism would serve no purpose for certain object types. In particular, there's no reasonable global persistent ID for many of the transient in-state-only object types that are offered by providers like null, tls, etc.

Potential Uses for Global IDs

The following sections describe some situations we've already encountered that Global IDs might be useful for. There are likely other ways these problems could be addressed too, so this section is mainly here just as a set of examples to help us identify other problems that we might be able to address through the introduction of Global IDs.

Detecting Object Collisions

A straightforward use of Global IDs is to automatically detect and flag when two objects in the same state have the same Global ID. That suggests a user error (defining the same object twice) and ought to be resolved somehow before proceeding, or Terraform's behavior would otherwise be unpredictable.

Another variant of this is situations where the provider has enough information available at plan time to predict one or more specific Global IDs for an object that hasn't been created yet. That would then potentially allow Terraform to detect collisions during planning and prevent them from occurring in the first place.

Terraform will not always have sufficient information to detect this at plan time (if the Global ID is derived from values that won't be known until apply time), but in that case it would degenerate to the first case above of detecting the conflict during a subsequent plan and requiring some sort
of resolution. (Exactly what resolution would be possible/appropriate is an open question; perhaps Terraform would require removing all but one of the conflicting resource blocks but then skip creating Delete actions for those in the plan, assuming the user is intending the still-remaining resource block to be the "owner" of that previously-shared object.)

"Containment" relationship

Many remote systems have a sense of one domain object being "contained within" another, which for the sake of this section we'll define as where the container object must outlive all of the contained objects. There are two main variants of this we've seen across many systems:

  • As long as contained objects exist, the container cannot be deleted.
  • Deleting the container implicitly deletes all of the contained objects.

Both of these situations violate Terraform's current assumptions. In the first case this can result in apply-time failures or timeouts, while the second case is more problematic in that it will tend to cause Terraform state to go out of sync with reality because Terraform cannot see that the contained objects have been deleted.

To address this, we could potentially augment the resource instance object state model so that each object can record:

  • A set of Global IDs that the object is contained within.
  • A set of Global IDs that the object contains.

While storing both directions of this relationship is redundant in the case where all objects are in the same configuration, it is possible (and, perhaps, common) for the objects to be split across two separate configurations by making use of data sources, and so the bidirectional tracking gives Terraform a fuller picture of the relationships in such cases.

The intent of these two sets is that they would be set by the provider during any changes, but also would be refreshed by the provider during a refresh operation, probably by calling an API to query the relationship.

As a specific example, consider that aws_subnet resources are always contained within aws_vpc resources: it's not possible to delete a VPC as long as at least one subnet exists. In this case it is a many-to-one relationship represented in the API as a foreign key on the subnet side, so the aws_subnet implementation can trivially determine the Global ID of the single VPC the subnet belongs to without any further queries (it's a transform of the vpc_id attribute), but the aws_vpc implementation would need to additionally call DescribeSubnets during refresh to properly populate the set of subnets that are contained within it, even if they were created in a different configuration.

Terraform Core can use this information to produce a more accurate plan whenever a container is planned for destruction. Terraform Core might see that a Delete action is planned for an aws_vpc and thus also automatically plan Delete actions for the associated subnets in the same configuration. If there are any contained subnets that are not known in the current
workspace state, Terraform could return an error saying that these contained objects must be destroyed first, and thus leave the human operator to decide which other Terraform configuration must be changed to achieve that.

The containment relationship also allows for improving Terraform's behavior in the more complex case of DeleteThenCreate or CreateThenDelete actions: this additional information might allow Terraform to understand both that it needs to replace all of the subnets when a containing VPC is replaced and that these objects are related in a way that requires a specific ordering
of the destroy and create actions to produce a correct result.

Referring to Objects in the UI

The above use-cases include situations where Terraform Core must report a problem to the user that will include references to involved objects. Since the global IDs are not necessarily user-friendly, we might elect to have a mechanism to ask a provider to generate a human-friendly (but potentially slightly ambiguous) name for a given global ID.

For example, while AWS VPC objects are a per-region namespace in principle, in practice collisions between regions are very unlikely within a particular user's infrastructure and so it is common to talk about VPCs and subnets using just their region-local ids, without qualifying them with a region. The AWS provider might elect to transform a full VPC ARN into just a vpc-abc123-like string for display to the user, assuming that the user will have enough context
to understand which region is relevant, and intentionally excluding the AWS account id because VPC IDs never overlap between two AWS accounts.

Relationships Between Providers

A key feature of Terraform is its ability to easily pass data between objects in entirely different systems. For example, the IP address of a created compute instance might be sent to a separate DNS vendor to create a DNS record.

It's not clear yet whether there are use-cases for representing Global ID-based relationships between objects in different providers. If there are then the global nature of these identifiers would make that possible, but that then imposes an additional compatibility constraint on each provider as the details of its global ID formats would be embedded in the logic of other providers.

Until we identify a specific use-case for representing a cross-provider relationship, I suggest we forbid it to start. Then if a use-case is found later we can use that real example to figure out what constraints ought to apply in that cross-provider case, rather than risking being constrained by a
naive design not informed by use-cases.

Sidebar: Global Object IDs for multi-instance systems

The idea of allocating global object ids maps nicely onto hosted (SaaS, etc) systems where the namespace of objects is physically fixed to a particular vendor and no other instances are available. It's trickier for self-hosted software and other situations where the physical location of the remote system is part of its unique identifier.

For example, the mysql provider is configured with a hostname or IP address for the specific MySQL server to talk to. If the server has a stable, meaningful hostname then using that hostname as part of the identifier is reasonable, but in modern ephemeral environments such services often don't have stable locations and are instead located via a service discovery system, which may not be implemented via DNS lookups.

How to robustly allocate global object IDs for this class of remote system is an open question still to be resolved. A key requirement is that it be possible to move the system to another physical address without implicitly renaming all of its existing global IDs, which seems likely to involve introducing some sort of user-controlled "logical location" that is distinct from the physical
location and can persist as the service moves between physical locations, but without imposing operational constraints on the service such as being at a stable hostname.

@apparentlymart
Copy link
Contributor Author

A further use-case:

Detecting when create_before_destroy replacement will fail

Today Terraform allows setting create_before_destroy in a resource's lifecycle block to ask Terraform to perform a "replace" action by creating a new object first and only destroying the old one after the new one is complete.

However, for any resource type that maps to an object type with a name uniqueness constraint in the remote system, this will tend to fail unless the provider or the user has made a special effort to include a random or otherwise unique string as part of the name.

In situations where a provider can predict the global ID for an object planned for creation, Terraform could potentially detect that the existing object and the new object have the same global ID and thus know ahead of time that create_before_destroy will fail.

This is ultimately just a specialization of "Detecting Object Collisions" in the original proposal, but if Terraform could recognize that the conflict is across a CreateThenDelete action then it could perhaps provide a more specific and actionable error message.

On the other hand, Terraform Core doesn't know what aspects of the provider configuration contribute to the global ID, and so it would not be able to give specific direction about how to fix the conflict without some additional information from the provider.

@bflad
Copy link
Contributor

bflad commented Nov 6, 2019

Related Terraform Plugin SDK issue: hashicorp/terraform-plugin-sdk#224

@gdavison
Copy link
Contributor

gdavison commented Jun 4, 2020

Another related use-case:

Indicating connected resources that will prevent replacement

This is related to the containment relationship, but the "contained" resource is connected to multiple resources in a graph structure.

This example references resources from the AWS Provider and was inspired by hashicorp/terraform-provider-aws#636. Note that there is a workaround by adding randomness to the resource name.

When configuring an AWS load balancer, some of the resources involved are: aws_lb_target_group, aws_lb_listener, and aws_lb_listener_rule. The aws_lb_listener_rule can be configured to reference a aws_lb_target_group.

The port argument of aws_lb_target_group forces a new resource. If no aws_lb_listener_rules are connected, the aws_lb_target_group can be replaced, but the AWS API returns errors if there are any connected.

A relationship similar to depends_on between the aws_lb_target_group and aws_lb_listener_rule that would schedule deletion of the aws_lb_listener_rule whenever the aws_lb_target_group is deleted would be useful here.

@ankon
Copy link

ankon commented Jun 22, 2022

I'm not sure this belongs exactly here, but it feels worth mentioning: Apart from internally knowing all possible relationships between resource types, or being explicit about them via contains {} hints, another option could be to take the existing hints from users. As a writer of a terraform configuration file I indicate relationships between resources by referring to them, for example:

resource "aws_ecs_service" "service" {
  name                               = var.service_name
  cluster                             = aws_ecs_cluster.cluster.arn                 # << service needs cluster

  wait_for_steady_state = true

  task_definition = aws_ecs_task_definition.task_definition.arn    # << service needs the task definition

  load_balancer {
    container_name   = var.service_name
    container_port     = var.container_port2
    target_group_arn = aws_lb_target_group.service_target_group.arn  # << service also needs the target group  
  }
}

Obviously this is very trimmed down, and especially with modules coming into play things become less obvious (should modules be boundaries, or could one even follow dependencies through variables and locals?), but still the point stays: I'm already telling terraform the relationship that my resources have in my case, so terraform could use that ...1


1 ... for example by not deleting the target group before waiting for the service to have settled again after deleting the load_balancer {} :)

@jamiekt
Copy link

jamiekt commented Jan 8, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core enhancement providers/protocol Potentially affecting the Providers Protocol and SDKs thinking
Projects
None yet
Development

No branches or pull requests

5 participants