-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address the delete of the donor instance #69
Conversation
The donor instance no longer is deleted when creating the prototype image. This was always the plan, to not delete the donor, but there were two problems with it. The first, and most important, was issue jackfrancis#26 which is now addressed. The problem was that the cloud-init data "knew" the node was a different name in a prior instance and thus registered a DNS name delete of its "old" name. This, unfortunately would remove the DNS entry for the donor node. The entry would come back when the donor node restarts or otherwise reruns its DDNS code but since every scaled in node with that donor node's instance would do that, it would continually cause problems for the donor. This has been addressed with this change. Another issue was that when certain strong VMSS restarts are done, the cloud-init may rerun and certain versions of aks-engine had set up cloud-init data to zap the azure.json file. This could only impact the donor node and one if the donor node was not yet a node built by vmss-prototype (so, basically the first one for each pool). Newer aks-engine versions don't have this problem and since it is a "one-time" type of problem, we can just ignore it as most do not run the problem aks-engine versions. I got some of the documentation updated with respect to the new behavior. We will need to do a run through all of it to see if there are any other changes. One new feature is that we now track the ancestry of the image. Each time we create an image, we append to the /var/log/ancestry.log a line with the timestamp and node name that we are building the image with. This thus has the genetic heritage of the node image.
This is a complete pull from a new log (redacted to hide actual cluster and subscription)
2021-04-01T21:32:47.143864731Z k8s-agentpool1-33778956-vmss INFO: ===> Executing command: ['kubectl' 'annotate' 'node' 'k8s-agentpool1-33778956-vmss000001' 'cluster-autoscaler.kubernetes.io/scale-down-disabled-'] | ||
2021-04-01T21:32:47.233459689Z k8s-agentpool1-33778956-vmss INFO: Creating sig image version - this can take quite a long time... | ||
2021-04-01T21:32:47.233523790Z k8s-agentpool1-33778956-vmss INFO: ===> Executing command: ['az' 'sig' 'image-version' 'create' '--subscription' '00000000-0000-0000-0000-000000000001' '--resource-group' 'testCluster2' '--gallery-name' 'SIG_testCluster2' '--gallery-image-definition' 'kamino-k8s-agentpool1-33778956-vmss-prototype' '--gallery-image-version' '2021.04.01' '--replica-count' '3' '--os-snapshot' 'snapshot_k8s-agentpool1-33778956-vmss' '--tags' 'BuiltFrom=k8s-agentpool1-33778956-vmss000001' 'BuiltAt=2021-04-01 21:32:35.083641' '--storage-account-type' 'Standard_ZRS'] | ||
2021-04-01T21:43:34.575631298Z k8s-agentpool2-33778956-vmss INFO: ===> Completed in 664.61s: ['az' 'sig' 'image-version' 'create' '--subscription' '00000000-0000-0000-0000-000000000001' '--resource-group' 'testCluster2' '--gallery-name' 'SIG_testCluster2' '--gallery-image-definition' 'kamino-k8s-agentpool2-33778956-vmss-prototype' '--gallery-image-version' '2021.04.01' '--replica-count' '3' '--os-snapshot' 'snapshot_k8s-agentpool2-33778956-vmss' '--tags' 'BuiltFrom=k8s-agentpool2-33778956-vmss000000' 'BuiltAt=2021-04-01 21:31:23.482360' '--storage-account-type' 'Standard_ZRS'] # RC=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is getting the new SIG image copy performance that has not yet GA'ed
This is why it is so much faster than before. Just over 11 minutes to copy/create the SIG image. The prior run was 7,000 and 12,000 seconds (vs these two at 664 seconds). That an order of magnitude improvement as shown this test run.
Jack, we need to go through the usage docs and aim people at the right thing. The low level usage and the more automated usage. My guess is we should have the basic usage be the automatic form with the more detailed, lower level use for the "advanced" users.
For the record v0.57.0 was the version of aks-engine that removed the zero-byte cloud-init-paved azure.json. https://github.com/Azure/aks-engine/releases/tag/v0.57.0 This change: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
The donor instance no longer is deleted when creating the prototype
image. This was always the plan, to not delete the donor, but there
were two problems with it.
The first, and most important, this now closes #26.
The problem was that the cloud-init data "knew" the node was a different
name in a prior instance and thus registered a DNS name delete of its
"old" name. This, unfortunately would remove the DNS entry for the
donor node. The entry would come back when the donor node restarts or
otherwise reruns its DDNS code but since every scaled in node with that
donor node's instance would do that, it would continually cause problems
for the donor. This has been addressed with this change.
Another issue was that when certain strong VMSS restarts are done,
the cloud-init may rerun and certain versions of aks-engine had set
up cloud-init data to zap the azure.json file. This could only impact
the donor node and one if the donor node was not yet a node built by
vmss-prototype (so, basically the first one for each pool).
Newer aks-engine versions don't have this problem and since it is a
"one-time" type of problem, we can just ignore it as most do not run
the problem aks-engine versions.
I got some of the documentation updated with respect to the new
behavior. We will need to do a run through all of it to see
if there are any other changes.
One new feature is that we now track the ancestry of the image.
Each time we create an image, we append to the /var/log/ancestry.log
a line with the timestamp and node name that we are building the
image with. This thus has the genetic heritage of the node
image.