Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r/virtual_machine: Current limitations and roadmap for disk device management and vMotion #295

Closed
2 tasks done
vancluever opened this issue Dec 8, 2017 · 6 comments
Closed
2 tasks done
Labels
enhancement Type: Enhancement

Comments

@vancluever
Copy link
Contributor

vancluever commented Dec 8, 2017

NOTE: Read update for more details on current plans

This is something we need to block across the board - running storage vMotion on a linked clone does not necessarily work anyway, but there are situations where it can function (so long as vSphere has determined that it can properly link the disks).

However, upon migration of these disks, The filename of the backings will change to a delta disk filename, removing the original name from the snapshot chain. Worse, the new filename gets saved to state. The next plan will look something like:

  ~ vsphere_virtual_machine.example_virtual_machines[0]
      disk.#:                  "1" => "2"
      disk.0.name:             "vancluever-test0/vancluever-test0_1-000001.vmdk" => "<deleted>"
      disk.1.attach:           "" => "false"
      disk.1.disk_mode:        "" => "persistent"
      disk.1.disk_sharing:     "" => "sharingNone"
      disk.1.eagerly_scrub:    "" => "false"
      disk.1.io_limit:         "" => "-1"
      disk.1.io_reservation:   "" => "0"
      disk.1.io_share_count:   "" => "0"
      disk.1.io_share_level:   "" => "normal"
      disk.1.keep_on_remove:   "" => "false"
      disk.1.key:              "" => "0"
      disk.1.name:             "" => "vancluever-test0.vmdk"
      disk.1.size:             "" => "32"
      disk.1.thin_provisioned: "" => "true"
      disk.1.unit_number:      "" => "0"
      disk.1.write_through:    "" => "false"

This means that the disk will be deleted, with the disk being completely destroyed.

At this point in time I can't think of a good way to 100% support this and in the spirit of making sure the provider is as safe as possible we need to be able to block this completely.

1.1.x mitigation steps

  • Block in plan/CustomizeDiff
    Block in apply (if possible) NOTE: This is not necessary with CustomizeDiff
  • Document the limitations of vMotion
@vancluever
Copy link
Contributor Author

vancluever commented Dec 13, 2017

Further details on this. In light of this, I want to get away from using file name to track virtual disks and switch to another scheme, discussed below.

Migration with Storage vMotion changes virtual machine files on the destination datastore to match the inventory name of the virtual machine. The migration renames all virtual disk, configuration, snapshot, and .nvram files. If the new names exceed the maximum filename length, the migration does not succeed.

Source

This has been confirmed by manual testing.

As such, the stop-gap for now will be to block all storage vMotions where the disk name does not match the standard vSphere naming convention of disk.vmdk and disk_INDEX.vmdk, computed off of the VM name post-modification (to catch situations where the VM is renamed in the same apply operation, as this currently happens pre-migration).

We will target 1.2 for removing this limitation, where name will only dictate the initial file name of a virtual disk. For tracking virtual disks, we will also be switching to using disk UUIDs rather than device address. Actual disk path will be tracked in path, which will only be settable when an external disk is being attached. name will be removed in 2.0 to indicate that we will no longer support the ability to choose the name of non-externally attached virtual disks.

Plan, to recap:

1.1.x

  • During the rest of 1.1, storage vMotions will be blocked if the name on the disk does not match the the standard vSphere naming convention of disk.vmdk and disk_INDEX.vmdk, computed off of the VM name post-modification (to catch situations where the VM is renamed in the same apply operation, as this currently happens pre-migration).
  • vMotions will also be blocked on clone.linked_clone == true.
  • vMotions for external disks (when attach is set) will also be blocked and will remain blocked after the other limitations are removed.

1.2.0/1.3.0

  • Introduction of three new disk attributes, label, path, and uuid. label, along with uuid will form a composite key that should be able to track the entire disk lifecycle from creation (or attachment), regardless of if the disk changes position within vSphere. path will only be configurable by the user if attach is set, with changes blocked during the diff validation stage.
  • name will still be a valid option, serving both the role of label and path and determining the initial name of the disk. It will be set to conflict with both attributes so that one scheme or the other has to be chosen. A transition from name to label will be supported in diff, but not the other way around.
  • With the introduction of this functionality we will lift all of the storage vMotion restrictions except for the attach restriction, as we will now be able to track a virtual disk across file name changes without risk to the data.
  • Users will get a deprecation warning for the name attribute.

2.0.0

  • The name option will be removed, with a forced state migration to label and path if the change has not been made already. This will effectively remove support for the ability to choose the name for all disks that are no attached externally.

Some notes:

  • Disk UUIDs are shared during an entire snapshot chain it appears, so linked clones will share the disk UUID of the template. However, historically, as per here, you cannot add disks with the same UUIDs to a virtual machine. I'm checking to see how up to date this information is. EDIT: This no longer seems to be the case as I was able to offline-copy a VMDK file and add it as a second disk with the same UUID as the source. Trying to think if we can work around this. Note that I had to actually do this on the back end, messing with the VM configuration in a way that would normally not happen in vSphere. We could possibly just refuse to read the state of a virtual machine where this is the case, but that's kind of what r/virtual_machine: Failsafe against disk name mismatch #304 was going for and that didn't go over so well internally. I'm not too sure if the fact that a duplicate UUID situation would never happen during normal vSphere operation changes this though.

@vancluever vancluever changed the title Storage vMotion: Block linked clones r/virtual_machine: Current limitations and roadmap for disk device management and vMotion Dec 13, 2017
@vancluever vancluever added the enhancement Type: Enhancement label Dec 13, 2017
vancluever added a commit that referenced this issue Dec 14, 2017
This commit adds restrictions to storage vMotion:

* Externally-attached disks cannot be migrated.
* Disk names must match the standard vSphere naming convention based on
the current name of the virtual machine.
* Linked clones are blocked from being migrated at this time.

The latter two restrictions will be lifted once #295 is fully realized
and we are tracking disks via a label/UUID combination.
@vancluever vancluever removed the bug Type: Bug label Dec 15, 2017
@vancluever
Copy link
Contributor Author

Now that 1.1.1 is out and we've mitigated this, I've removed the "bug" label from this so that it's classified as an enhancement, and it will be taken care of in the next set of changes for possibly 1.2 or 1.3.

@vancluever
Copy link
Contributor Author

vancluever commented Jan 17, 2018

Next checklist!

  • Add in fallback code for label and path naming
  • Adjust diff normalization
  • Ensure that you cannot roll back to using name after switching to label
  • Review import
  • Review migration
  • Ensure disk uuids are source of truth for device location with device addresses as fallback
  • Remove (nearly) all vMotion restrictions
  • Smoke test transition from name to label
  • Smoke test VM creation while using old name attribute
  • Smoke test VM creation while using new label attribute
  • Audit the V1 -> V3 migration path
  • Audit/smoke test importing
  • Update most acceptance tests for new config
  • Add acceptance test for transition of name to label
  • ^ Test for blocking rollback after transition
  • ^ Test for transitioning external disks to label/path
  • Acceptance test for state V3 migration
  • Update acceptance test for V2 migration
  • Test vMotion of linked clones (acceptance test)
  • Update docs
  • Swap VM with two disks in state around in the console (smoke test)
  • Test vMotion of linked clones (smoke test)

@vancluever
Copy link
Contributor Author

vancluever commented Jan 18, 2018

Just an update that this is moving along and is looking pretty positive. General behaviour for transitioning is looking stable during smoke testing. Our acceptance tests will need to be updated mostly, and we will probably add a couple specifically for the transition and move some cases that were error cases before to expect success instead. But otherwise I'm pretty excited about how this is looking 👍

@vancluever
Copy link
Contributor Author

vancluever commented Jan 25, 2018

#363 was merged, so this is pretty much done. Everything works as intended and nothing out of the ordinary came out of the smoke tests, except an unrelated issue for which a fix has been submitted as #371.

One last thing I'm testing before we commit to a release is what happens when someone does something like add multiple disks with the same UUID. Again, this should never happen during regular operation, but in the event it can happen, we need to be prepared via some sort of fail-safe.

  • Add disk with duplicate UUIDs manually though the console, with UUID in state
  • Add disk with duplicate UUIDs manually though the console, without UUID in state
  • Try adding disk with attach in Terraform

@vancluever
Copy link
Contributor Author

Alright, all three of these operations are safe in terms of not destroying data, and ultimately providing errors when Terraform detects a UUID conflict in the read cycle. The differences in the above three scenarios lies in where Terraform fails:

  • Disk added in console with UUID in state: Terraform fails on the next plan.
  • Disk added in console with UUID not in state: Terraform will read in the disks as orphaned for the next plan, but fails the apply as the read happens again and detects the disks as duplicates.
  • Disks added using attach: The apply passes, but the next read gives an error.

The errors resemble something like:

vsphere_virtual_machine.foo: disk.1: cannot find disk device: multiple virtual disks with UUID **** found

This seems to be safe, and as we established internally in #304 Terraform should not be working any harder than normal to work with situations that should never happen during normal operation (such as a user doing major modifications on the back end to a virtual machine outside of Terraform). The only issue I could see coming up is the possible rendering of the state un-correctable after the latter two scenarios, but these are edge cases, and the only one that we could possibly track is the last one, with heavy effort on Terraform's part to check not just state and the current devices on the virtual machine, but comparing the UUIDs of all disks to be attached as well, which is not something we currently handle right now anyway.

As everything in this issue is mainly complete, save removal of the name attribute completely (which will be done closer to around 2.0 and will have a separate issue associated with it), I'm closing this issue now!

@ghost ghost locked and limited conversation to collaborators Apr 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Type: Enhancement
Projects
None yet
Development

No branches or pull requests

1 participant