-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
r/virtual_machine: Datastore cluster support #447
Conversation
This is in prep for adding datastore cluster support. We may need extra validation aside from what is listed here, but this gives us at least a bit of a baseline to work from - illustrating where the conflicting attributes lay as well so that config validation versus diff validation can block the more basic cases.
This is a new attempt at VM creation through storage DRS. This time, unlike the last attempt, we use the storage DRS API completely as a replacement to CreateVM_Task. This is how it's supposed to be intended. The PoC in govc does not do this, rather it assigns the datastores in the returned recommendations to the VirtualMachineConfigSpec manually. In our provider, this resulted in a lot of convoluted code and workflows that just were not worth it in the long run, especially for cloning. The workhorse for this are helpers for both datastore cluster and standard datastore creation paths, in addition to a new CreateVM helper in the storagepod helper package that is nearly API compatible with our Create function in the virtualmachine helper package, with the only addition being that we require the datastore cluster to deploy to. We will use this PoC for VM creation to create helpers for cloning, reconfiguration, and migration as well. All of these should be similarly simple to creation. The only drawback to this approach is that immediate reconfiguration of the VM after cloning to do things like grow the disk or what not may result in SDRS recommendations that may either cause the operation to fail or result in an immediate storage vMotion should the reconfiguration operation push the disk into SDRS violation territory. However, when comparing how complicated the path of getting recommendations for a new VM with the final specs and transposing those recommendations on the clone spec, versus just making the two SDRS apply calls, the latter is much cleaner and less complicated from a code perspective - the former would require massive refactoring of the current resource to be accommodated with any level of sanity. TODO: Add datastore cluster self-hosting into acceptance tests.
Fixed the test so that it's self-hosting now. Also did a bit of refactoring to the CreateVM function, moved the functionality for fetching and applying SDRS recommendations to an internal helper as we will need this functionality for other helpers, and it helps make CreateVM less complex.
Part 1. We need to update the datastore fetching logic in the cloneSpec and appropriate relocate sections still to ensure that it doesn't fail looking for empty datastore IDs.
This makes the few minor changes necessary to the cloning workflow to ensure that a datastore ID does not need to be specified. It also adds a step to the reconfigure operation that just defers to standard reconfigure if there are actually no disk creation operations happening in a reconfigure, as an error is issued if there are no SDRS disk placements requested.
We added this validation last commit, but it's deeper down the stack right now than it needs to be - this moves it to a very early point to save some API calls.
This gets us NewValueKnown, which we use to properly compute whether or not datastore_cluster_id is defined in configuration, even if interpolation is still pending at diff time.
This commit includes pretty much complete support for vMotion through the Storage DRS API, which we use if we have a datastore cluster defined. The workflow is properly triggered not necessarily by checking if there is a change in datastore_cluster_id, but rather checking the datastores that VM configuration and individual disks are in, and ensuring that those datastores are members of the defined datastore cluster. We also handle the event in which a datastore cluster ID is not known but defined at diff time (ie: pending interpolation), which basically will force the default datastore back to computed, which will in turn request a vMotion. Some extra tests are/may be pending: * (Only) host migration when a datstore cluster is defined. This will determine if we need to handle the case where it's possible that no SDRS recommendations will be given. * Migration of an individual disk to a datastore that is not a member of the defined datastore cluster. This is to determine if the resource will properly trigger a migration to move the disk back. * Possible handling of a scenario where TF has triggered a change in datastore cluster (ie: set it back to computed) but no real change is happening (ie: the possibly the current cluster is being destroyed and its datastores moved to a newly created cluster). Admittedly this is an edge case, and depending on the difficultly of writing the test (which should be an indication of the likelihood of the edge case happening), we may just skip this. Technically, this poses the possibility of a failed operation due to no recommendations, but would more than likely resolve itself in the next TF plan, which would then be empty on part of the new datastore cluster ID being available, with the datastore the VM resides in being a member.
As mentioned in the previous commit, this update adds two out of the three mentioned tests: the host vMotion test, and the individual disk migration test. We found a minor bug in the latter path that was fixed. The last path I have opted to not add. The main reason being is that it's too hard to predict that we would not have a race condition under this scenario in normal use without even more use of workarounds like depends_on, as more than likely any changes that would happen within datastore clusters would be happening in the graph the same time as the virtual machine changes would be happening, which could possibly lead to SDRS errors for more standard reasons. The best solution would be to manage these things in separate stacks anyway, and the likelihood that people are going to be managing datastore clusters and VMs in the same configuration is probably small anyway.
This completes the SDRS support for the VM resource. Documentation has been added along with examples. A small refactor has been necessary as well as we missed one initial config validation scenario - this has been added and we have split off the membership validation part of the SDRS diff customization step to address gocyclo.
It was incorrectly assumed that this was not necessary but it for sure is - we need to account for a computed datastore ID for the same reasons that we need to account for the computed datastore cluster ID - it could be specified, but not available in its final value in the diff.
This updates the vMotion check so that a change in datastore cluster ID is handled properly. The old logic included a change check to datastore_cluster_id, only really to ensure that we caught any possible changes to disk sub-resources when a datastore cluster was defined (ie: in the event of relocating a disk that had been migrated out of the cluster). This was producing a side-effect that relocations where being attempted even when there was no actual changes in datastores. The new logic adds a return flag to the DiskRelocateOperation function for disk sub-resources, which gets flagged on the first genuine relocation operation that gets processed - including default datastore, pinned datastore, and datastore cluster-triggered changes. This allows us to check for these things without triggering a migration due to a change in datastore_cluster_id that does not result in any actual change. Removing datastore_cluster_id from the logic does not have any implications to other migration workflows as all of these operations are handled naturally through diff customization. The source of truth for whether or not a migration is necessary are always the datastore_id attributes.
The recommendAndApplySDRS function handles multiple operations, some of which are not migrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one typo in the docs. Other than that, LGTM
ignored unless disks drift outside of the datastore cluster. The example below | ||
makes use of the [`vsphere_datastore_cluster` data | ||
source][tf-vsphere-datastore-cluster-data-source], and the | ||
[`datastore_clsuter_id`](#datastore_cluster_id) configuration setting. Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be datastore_cluster_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bill-rich! Correcting this now
Ref: #2 (not necessarily fixes just yet, want to hold off on that until we add the SDRS VM exception resources)
This adds datastore cluster/SDRS support to the
vsphere_virtual_machine
resource, via the new attributedatastore_cluster_id
. When this option is in use, all changes to underlying datastores in the resource are ignored unless disks drift outside of the datastore cluster.In addition to this, when the setting is in use, most VM creation, cloning, update, and vMotion workflows are sent to the Storage DRS API to ensure proper placement of disks within datastores. The main exception is reconfigurations to virtual machines when no disks are being created, which produces an error in the SDRS API, hence these are still just sent to the standard reconfiguration workflow.
When
datastore_cluster_id
is in use, datastore assignment to individual disks is not possible, nor is it possible to use this with a VM that is attaching an external disk. This decision was made to simplify the workflows that Terraform would need to handle - especially in the latter scenario, it could be all too easy to have an "attached" disk (which is already purely a Terraform construct) to drift out of sync with state if the appropriate SDRS exemptions were not created and managed correctly.