Address the delete of the donor instance #69

Michael-Sinz · 2021-04-01T21:31:57Z

The donor instance no longer is deleted when creating the prototype
image. This was always the plan, to not delete the donor, but there
were two problems with it.

The first, and most important, this now closes #26.
The problem was that the cloud-init data "knew" the node was a different
name in a prior instance and thus registered a DNS name delete of its
"old" name. This, unfortunately would remove the DNS entry for the
donor node. The entry would come back when the donor node restarts or
otherwise reruns its DDNS code but since every scaled in node with that
donor node's instance would do that, it would continually cause problems
for the donor. This has been addressed with this change.

Another issue was that when certain strong VMSS restarts are done,
the cloud-init may rerun and certain versions of aks-engine had set
up cloud-init data to zap the azure.json file. This could only impact
the donor node and one if the donor node was not yet a node built by
vmss-prototype (so, basically the first one for each pool).

Newer aks-engine versions don't have this problem and since it is a
"one-time" type of problem, we can just ignore it as most do not run
the problem aks-engine versions.

I got some of the documentation updated with respect to the new
behavior. We will need to do a run through all of it to see
if there are any other changes.

One new feature is that we now track the ancestry of the image.
Each time we create an image, we append to the /var/log/ancestry.log
a line with the timestamp and node name that we are building the
image with. This thus has the genetic heritage of the node
image.

The donor instance no longer is deleted when creating the prototype image. This was always the plan, to not delete the donor, but there were two problems with it. The first, and most important, was issue jackfrancis#26 which is now addressed. The problem was that the cloud-init data "knew" the node was a different name in a prior instance and thus registered a DNS name delete of its "old" name. This, unfortunately would remove the DNS entry for the donor node. The entry would come back when the donor node restarts or otherwise reruns its DDNS code but since every scaled in node with that donor node's instance would do that, it would continually cause problems for the donor. This has been addressed with this change. Another issue was that when certain strong VMSS restarts are done, the cloud-init may rerun and certain versions of aks-engine had set up cloud-init data to zap the azure.json file. This could only impact the donor node and one if the donor node was not yet a node built by vmss-prototype (so, basically the first one for each pool). Newer aks-engine versions don't have this problem and since it is a "one-time" type of problem, we can just ignore it as most do not run the problem aks-engine versions. I got some of the documentation updated with respect to the new behavior. We will need to do a run through all of it to see if there are any other changes. One new feature is that we now track the ancestry of the image. Each time we create an image, we append to the /var/log/ancestry.log a line with the timestamp and node name that we are building the image with. This thus has the genetic heritage of the node image.

This is a complete pull from a new log (redacted to hide actual cluster and subscription)

Michael-Sinz · 2021-04-02T12:08:26Z

helm/vmss-prototype/auto-update.md

+2021-04-01T21:32:47.143864731Z k8s-agentpool1-33778956-vmss	INFO: ===> Executing command: ['kubectl' 'annotate' 'node' 'k8s-agentpool1-33778956-vmss000001' 'cluster-autoscaler.kubernetes.io/scale-down-disabled-']
+2021-04-01T21:32:47.233459689Z k8s-agentpool1-33778956-vmss	INFO: Creating sig image version - this can take quite a long time...
+2021-04-01T21:32:47.233523790Z k8s-agentpool1-33778956-vmss	INFO: ===> Executing command: ['az' 'sig' 'image-version' 'create' '--subscription' '00000000-0000-0000-0000-000000000001' '--resource-group' 'testCluster2' '--gallery-name' 'SIG_testCluster2' '--gallery-image-definition' 'kamino-k8s-agentpool1-33778956-vmss-prototype' '--gallery-image-version' '2021.04.01' '--replica-count' '3' '--os-snapshot' 'snapshot_k8s-agentpool1-33778956-vmss' '--tags' 'BuiltFrom=k8s-agentpool1-33778956-vmss000001' 'BuiltAt=2021-04-01 21:32:35.083641' '--storage-account-type' 'Standard_ZRS']
+2021-04-01T21:43:34.575631298Z k8s-agentpool2-33778956-vmss	INFO: ===> Completed in 664.61s: ['az' 'sig' 'image-version' 'create' '--subscription' '00000000-0000-0000-0000-000000000001' '--resource-group' 'testCluster2' '--gallery-name' 'SIG_testCluster2' '--gallery-image-definition' 'kamino-k8s-agentpool2-33778956-vmss-prototype' '--gallery-image-version' '2021.04.01' '--replica-count' '3' '--os-snapshot' 'snapshot_k8s-agentpool2-33778956-vmss' '--tags' 'BuiltFrom=k8s-agentpool2-33778956-vmss000000' 'BuiltAt=2021-04-01 21:31:23.482360' '--storage-account-type' 'Standard_ZRS'] # RC=0


Note that this is getting the new SIG image copy performance that has not yet GA'ed
This is why it is so much faster than before. Just over 11 minutes to copy/create the SIG image. The prior run was 7,000 and 12,000 seconds (vs these two at 664 seconds). That an order of magnitude improvement as shown this test run.

Jack, we need to go through the usage docs and aim people at the right thing. The low level usage and the more automated usage. My guess is we should have the basic usage be the automatic form with the more detailed, lower level use for the "advanced" users.

jackfrancis · 2021-04-02T16:41:21Z

For the record v0.57.0 was the version of aks-engine that removed the zero-byte cloud-init-paved azure.json.

https://github.com/Azure/aks-engine/releases/tag/v0.57.0

This change:

Azure/aks-engine#3876

Michael-Sinz · 2021-04-02T16:55:29Z

Yup, just after it no longer supported 1.18.6 🙁

…

__ Michael Sinz – Architect – <http://midori/> 緑 – Microsoft

________________________________ From: Jack Francis ***@***.***> Sent: Friday, April 2, 2021 9:41 AM To: jackfrancis/kamino ***@***.***> Cc: Michael Sinz ***@***.***>; Author ***@***.***> Subject: Re: [jackfrancis/kamino] Address the delete of the donor instance (#69) For the record v0.57.0 was the version of aks-engine that removed the zero-byte cloud-init-paved azure.json. https://github.com/Azure/aks-engine/releases/tag/v0.57.0<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Faks-engine%2Freleases%2Ftag%2Fv0.57.0&data=04%7C01%7CMichael.Sinz%40microsoft.com%7C3638a1aee7f54a7ce0c808d8f5f62d9e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637529784978731745%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Zn6Wf%2BY97jxUobaxaOGwpgCJ3m%2FsMd8VWC3CKI8nlgY%3D&reserved=0> This change: Azure/aks-engine#3876<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Faks-engine%2Fpull%2F3876&data=04%7C01%7CMichael.Sinz%40microsoft.com%7C3638a1aee7f54a7ce0c808d8f5f62d9e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637529784978741740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AfUaym8uK6tqcZEYYC5jg7LiwTiym7yoc%2FjsIzQH7lE%3D&reserved=0> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjackfrancis%2Fkamino%2Fpull%2F69%23issuecomment-812608740&data=04%7C01%7CMichael.Sinz%40microsoft.com%7C3638a1aee7f54a7ce0c808d8f5f62d9e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637529784978741740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Tae%2FEXvV3caXtYhkHGIyLg8L7LcU1XrQpltK7Kb1fmk%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIZINKUHPB7U5GK6QBGXWU3TGXXT5ANCNFSM42H2RZKA&data=04%7C01%7CMichael.Sinz%40microsoft.com%7C3638a1aee7f54a7ce0c808d8f5f62d9e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637529784978751733%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uEx%2FscEFYsqOv6cByHi1po7L9CE5VE6Jm3QTQoneZnI%3D&reserved=0>.

vmss-prototype/vmss-prototype

jackfrancis

/lgtm

…re.json behavior

jackfrancis

/lgtm

Michael-Sinz added 2 commits April 1, 2021 21:27

Updated the testCluster2 run with an actual output

bc3534a

This is a complete pull from a new log (redacted to hide actual cluster and subscription)

Michael-Sinz mentioned this pull request Apr 2, 2021

In Azure, prototype based VMs sometimes trigger a IDNS delete of their donor name #26

Closed

Michael-Sinz commented Apr 2, 2021

View reviewed changes

More documentation updates.

b227926

Jack, we need to go through the usage docs and aim people at the right thing. The low level usage and the more automated usage. My guess is we should have the basic usage be the automatic form with the more detailed, lower level use for the "advanced" users.

jackfrancis reviewed Apr 2, 2021

View reviewed changes

vmss-prototype/vmss-prototype Show resolved Hide resolved

jackfrancis approved these changes Apr 2, 2021

View reviewed changes

Add the specific aks-engine version that is fixed with respect to azu…

57b32e1

…re.json behavior

jackfrancis approved these changes Apr 2, 2021

View reviewed changes

jackfrancis merged commit 40688a5 into jackfrancis:main Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address the delete of the donor instance #69

Address the delete of the donor instance #69

Michael-Sinz commented Apr 1, 2021

Michael-Sinz Apr 2, 2021

jackfrancis commented Apr 2, 2021

Michael-Sinz commented Apr 2, 2021 via email

jackfrancis left a comment

jackfrancis left a comment

Address the delete of the donor instance #69

Address the delete of the donor instance #69

Conversation

Michael-Sinz commented Apr 1, 2021

Michael-Sinz Apr 2, 2021

Choose a reason for hiding this comment

jackfrancis commented Apr 2, 2021

Michael-Sinz commented Apr 2, 2021 via email

jackfrancis left a comment

Choose a reason for hiding this comment

jackfrancis left a comment

Choose a reason for hiding this comment