Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r/virtual_machine: Datastore cluster support #447

Merged
merged 14 commits into from
Apr 4, 2018

Conversation

vancluever
Copy link
Contributor

Ref: #2 (not necessarily fixes just yet, want to hold off on that until we add the SDRS VM exception resources)

This adds datastore cluster/SDRS support to the vsphere_virtual_machine resource, via the new attribute datastore_cluster_id. When this option is in use, all changes to underlying datastores in the resource are ignored unless disks drift outside of the datastore cluster.

In addition to this, when the setting is in use, most VM creation, cloning, update, and vMotion workflows are sent to the Storage DRS API to ensure proper placement of disks within datastores. The main exception is reconfigurations to virtual machines when no disks are being created, which produces an error in the SDRS API, hence these are still just sent to the standard reconfiguration workflow.

When datastore_cluster_id is in use, datastore assignment to individual disks is not possible, nor is it possible to use this with a VM that is attaching an external disk. This decision was made to simplify the workflows that Terraform would need to handle - especially in the latter scenario, it could be all too easy to have an "attached" disk (which is already purely a Terraform construct) to drift out of sync with state if the appropriate SDRS exemptions were not created and managed correctly.

This is in prep for adding datastore cluster support.

We may need extra validation aside from what is listed here, but this
gives us at least a bit of a baseline to work from - illustrating where
the conflicting attributes lay as well so that config validation versus
diff validation can block the more basic cases.
This is a new attempt at VM creation through storage DRS. This time,
unlike the last attempt, we use the storage DRS API completely as a
replacement to CreateVM_Task. This is how it's supposed to be intended.
The PoC in govc does not do this, rather it assigns the datastores in
the returned recommendations to the VirtualMachineConfigSpec manually.
In our provider, this resulted in a lot of convoluted code and workflows
that just were not worth it in the long run, especially for cloning.

The workhorse for this are helpers for both datastore cluster and
standard datastore creation paths, in addition to a new CreateVM helper
in the storagepod helper package that is nearly API compatible with our
Create function in the virtualmachine helper package, with the only
addition being that we require the datastore cluster to deploy to.

We will use this PoC for VM creation to create helpers for cloning,
reconfiguration, and migration as well. All of these should be similarly
simple to creation.

The only drawback to this approach is that immediate reconfiguration of
the VM after cloning to do things like grow the disk or what not may
result in SDRS recommendations that may either cause the operation to
fail or result in an immediate storage vMotion should the
reconfiguration operation push the disk into SDRS violation territory.
However, when comparing how complicated the path of getting
recommendations for a new VM with the final specs and transposing those
recommendations on the clone spec, versus just making the two SDRS apply
calls, the latter is much cleaner and less complicated from a code
perspective - the former would require massive refactoring of the
current resource to be accommodated with any level of sanity.

TODO: Add datastore cluster self-hosting into acceptance tests.
Fixed the test so that it's self-hosting now.

Also did a bit of refactoring to the CreateVM function, moved the
functionality for fetching and applying SDRS recommendations to an
internal helper as we will need this functionality for other helpers,
and it helps make CreateVM less complex.
Part 1. We need to update the datastore fetching logic in the cloneSpec
and appropriate relocate sections still to ensure that it doesn't fail
looking for empty datastore IDs.
This makes the few minor changes necessary to the cloning workflow to
ensure that a datastore ID does not need to be specified. It also adds a
step to the reconfigure operation that just defers to standard
reconfigure if there are actually no disk creation operations happening
in a reconfigure, as an error is issued if there are no SDRS disk
placements requested.
We added this validation last commit, but it's deeper down the stack
right now than it needs to be - this moves it to a very early point to
save some API calls.
This gets us NewValueKnown, which we use to properly compute whether or
not datastore_cluster_id is defined in configuration, even if
interpolation is still pending at diff time.
This commit includes pretty much complete support for vMotion through
the Storage DRS API, which we use if we have a datastore cluster
defined.

The workflow is properly triggered not necessarily by checking if there
is a change in datastore_cluster_id, but rather checking the datastores
that VM configuration and individual disks are in, and ensuring that
those datastores are members of the defined datastore cluster. We also
handle the event in which a datastore cluster ID is not known but
defined at diff time (ie: pending interpolation), which basically will
force the default datastore back to computed, which will in turn request
a vMotion.

Some extra tests are/may be pending:

* (Only) host migration when a datstore cluster is defined. This will
determine if we need to handle the case where it's possible that no SDRS
recommendations will be given.
* Migration of an individual disk to a datastore that is not a member of
the defined datastore cluster. This is to determine if the resource will
properly trigger a migration to move the disk back.
* Possible handling of a scenario where TF has triggered a change in
datastore cluster (ie: set it back to computed) but no real change is
happening (ie: the possibly the current cluster is being destroyed and
its datastores moved to a newly created cluster). Admittedly this is an
edge case, and depending on the difficultly of writing the test (which
should be an indication of the likelihood of the edge case happening),
we may just skip this. Technically, this poses the possibility of a
failed operation due to no recommendations, but would more than likely
resolve itself in the next TF plan, which would then be empty on part of
the new datastore cluster ID being available, with the datastore the VM
resides in being a member.
As mentioned in the previous commit, this update adds two out of the
three mentioned tests: the host vMotion test, and the individual disk
migration test. We found a minor bug in the latter path that was fixed.

The last path I have opted to not add. The main reason being is that
it's too hard to predict that we would not have a race condition under
this scenario in normal use without even more use of workarounds like
depends_on, as more than likely any changes that would happen within
datastore clusters would be happening in the graph the same time as the
virtual machine changes would be happening, which could possibly lead to
SDRS errors for more standard reasons. The best solution would be to
manage these things in separate stacks anyway, and the likelihood that
people are going to be managing datastore clusters and VMs in the same
configuration is probably small anyway.
This completes the SDRS support for the VM resource. Documentation has
been added along with examples.

A small refactor has been necessary as well as we missed one initial
config validation scenario - this has been added and we have split off
the membership validation part of the SDRS diff customization step to
address gocyclo.
@vancluever vancluever added the enhancement Type: Enhancement label Mar 31, 2018
@vancluever vancluever requested a review from a team March 31, 2018 17:50
It was incorrectly assumed that this was not necessary but it for sure
is - we need to account for a computed datastore ID for the same reasons
that we need to account for the computed datastore cluster ID - it could
be specified, but not available in its final value in the diff.
This updates the vMotion check so that a change in datastore cluster ID
is handled properly.

The old logic included a change check to datastore_cluster_id, only
really to ensure that we caught any possible changes to disk
sub-resources when a datastore cluster was defined (ie: in the event of
relocating a disk that had been migrated out of the cluster). This was
producing a side-effect that relocations where being attempted even when
there was no actual changes in datastores.

The new logic adds a return flag to the DiskRelocateOperation function
for disk sub-resources, which gets flagged on the first genuine
relocation operation that gets processed - including default datastore,
pinned datastore, and datastore cluster-triggered changes. This allows
us to check for these things without triggering a migration due to a
change in datastore_cluster_id that does not result in any actual
change.

Removing datastore_cluster_id from the logic does not have any
implications to other migration workflows as all of these operations are
handled naturally through diff customization. The source of truth for
whether or not a migration is necessary are always the datastore_id
attributes.
The recommendAndApplySDRS function handles multiple operations, some of
which are not migrations.
Copy link
Contributor

@bill-rich bill-rich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the one typo in the docs. Other than that, LGTM

ignored unless disks drift outside of the datastore cluster. The example below
makes use of the [`vsphere_datastore_cluster` data
source][tf-vsphere-datastore-cluster-data-source], and the
[`datastore_clsuter_id`](#datastore_cluster_id) configuration setting. Note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be datastore_cluster_id

Copy link
Contributor Author

@vancluever vancluever Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bill-rich! Correcting this now

@vancluever vancluever merged commit 897e062 into master Apr 4, 2018
@vancluever vancluever deleted the f-vm-datastore-cluster-support-new branch April 4, 2018 21:55
@ghost ghost locked and limited conversation to collaborators Apr 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Type: Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants