Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address the delete of the donor instance #69

Merged
merged 4 commits into from
Apr 2, 2021

Conversation

Michael-Sinz
Copy link
Collaborator

The donor instance no longer is deleted when creating the prototype
image. This was always the plan, to not delete the donor, but there
were two problems with it.

The first, and most important, this now closes #26.
The problem was that the cloud-init data "knew" the node was a different
name in a prior instance and thus registered a DNS name delete of its
"old" name. This, unfortunately would remove the DNS entry for the
donor node. The entry would come back when the donor node restarts or
otherwise reruns its DDNS code but since every scaled in node with that
donor node's instance would do that, it would continually cause problems
for the donor. This has been addressed with this change.

Another issue was that when certain strong VMSS restarts are done,
the cloud-init may rerun and certain versions of aks-engine had set
up cloud-init data to zap the azure.json file. This could only impact
the donor node and one if the donor node was not yet a node built by
vmss-prototype (so, basically the first one for each pool).

Newer aks-engine versions don't have this problem and since it is a
"one-time" type of problem, we can just ignore it as most do not run
the problem aks-engine versions.

I got some of the documentation updated with respect to the new
behavior. We will need to do a run through all of it to see
if there are any other changes.

One new feature is that we now track the ancestry of the image.
Each time we create an image, we append to the /var/log/ancestry.log
a line with the timestamp and node name that we are building the
image with. This thus has the genetic heritage of the node
image.

The donor instance no longer is deleted when creating the prototype
image.  This was always the plan, to not delete the donor, but there
were two problems with it.

The first, and most important, was issue jackfrancis#26 which is now addressed.
The problem was that the cloud-init data "knew" the node was a different
name in a prior instance and thus registered a DNS name delete of its
"old" name.  This, unfortunately would remove the DNS entry for the
donor node.  The entry would come back when the donor node restarts or
otherwise reruns its DDNS code but since every scaled in node with that
donor node's instance would do that, it would continually cause problems
for the donor.  This has been addressed with this change.

Another issue was that when certain strong VMSS restarts are done,
the cloud-init may rerun and certain versions of aks-engine had set
up cloud-init data to zap the azure.json file.  This could only impact
the donor node and one if the donor node was not yet a node built by
vmss-prototype (so, basically the first one for each pool).

Newer aks-engine versions don't have this problem and since it is a
"one-time" type of problem, we can just ignore it as most do not run
the problem aks-engine versions.

I got some of the documentation updated with respect to the new
behavior.  We will need to do a run through all of it to see
if there are any other changes.

One new feature is that we now track the ancestry of the image.
Each time we create an image, we append to the /var/log/ancestry.log
a line with the timestamp and node name that we are building the
image with.  This thus has the genetic heritage of the node
image.
This is a complete pull from a new log (redacted to hide actual
cluster and subscription)
2021-04-01T21:32:47.143864731Z k8s-agentpool1-33778956-vmss INFO: ===> Executing command: ['kubectl' 'annotate' 'node' 'k8s-agentpool1-33778956-vmss000001' 'cluster-autoscaler.kubernetes.io/scale-down-disabled-']
2021-04-01T21:32:47.233459689Z k8s-agentpool1-33778956-vmss INFO: Creating sig image version - this can take quite a long time...
2021-04-01T21:32:47.233523790Z k8s-agentpool1-33778956-vmss INFO: ===> Executing command: ['az' 'sig' 'image-version' 'create' '--subscription' '00000000-0000-0000-0000-000000000001' '--resource-group' 'testCluster2' '--gallery-name' 'SIG_testCluster2' '--gallery-image-definition' 'kamino-k8s-agentpool1-33778956-vmss-prototype' '--gallery-image-version' '2021.04.01' '--replica-count' '3' '--os-snapshot' 'snapshot_k8s-agentpool1-33778956-vmss' '--tags' 'BuiltFrom=k8s-agentpool1-33778956-vmss000001' 'BuiltAt=2021-04-01 21:32:35.083641' '--storage-account-type' 'Standard_ZRS']
2021-04-01T21:43:34.575631298Z k8s-agentpool2-33778956-vmss INFO: ===> Completed in 664.61s: ['az' 'sig' 'image-version' 'create' '--subscription' '00000000-0000-0000-0000-000000000001' '--resource-group' 'testCluster2' '--gallery-name' 'SIG_testCluster2' '--gallery-image-definition' 'kamino-k8s-agentpool2-33778956-vmss-prototype' '--gallery-image-version' '2021.04.01' '--replica-count' '3' '--os-snapshot' 'snapshot_k8s-agentpool2-33778956-vmss' '--tags' 'BuiltFrom=k8s-agentpool2-33778956-vmss000000' 'BuiltAt=2021-04-01 21:31:23.482360' '--storage-account-type' 'Standard_ZRS'] # RC=0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is getting the new SIG image copy performance that has not yet GA'ed
This is why it is so much faster than before. Just over 11 minutes to copy/create the SIG image. The prior run was 7,000 and 12,000 seconds (vs these two at 664 seconds). That an order of magnitude improvement as shown this test run.

Jack, we need to go through the usage docs and aim people at the
right thing.  The low level usage and the more automated usage.

My guess is we should have the basic usage be the automatic form
with the more detailed, lower level use for the "advanced" users.
@jackfrancis
Copy link
Owner

For the record v0.57.0 was the version of aks-engine that removed the zero-byte cloud-init-paved azure.json.

https://github.com/Azure/aks-engine/releases/tag/v0.57.0

This change:

Azure/aks-engine#3876

@Michael-Sinz
Copy link
Collaborator Author

Michael-Sinz commented Apr 2, 2021 via email

Copy link
Owner

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Owner

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@jackfrancis jackfrancis merged commit 40688a5 into jackfrancis:main Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In Azure, prototype based VMs sometimes trigger a IDNS delete of their donor name
2 participants