Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

Performance Improvement - reduce identity assignment time #219

Merged
merged 7 commits into from
Jul 1, 2019

Conversation

aramase
Copy link
Member

@aramase aramase commented May 17, 2019

resolves
#145
#181
#217

  • Batch process id assignments to node on each sync cycle (pooling delete and add ids, issuing one CreateOrUpdate call for each sync cycle)
  • Reduces identity assignment for 44 pods from 29m to ~3m

Add states to the assigned identities

  1. Created - assigned identity will be in created state when the crd is created.
  2. Assigned - once the identity has been successfully assigned to underlying node, the state of the assigned identity will be changed to Assigned.
  3. Unassigned - the assigned identity will be in this state when the identity has been successfully removed from the underlying node.

Note

  1. Assigned identities will be created and then the identity will be tried to be assigned to the node. If assignment fails, then the identity will be deleted. No change in this behavior compared to previous version.
  2. Assigned identities will be deleted only after the identity has been successfully removed from the node. This is change in behavior from previous version to ensure we successfully remove the identities from the underlying node.

@aramase aramase changed the title [WIP] Taking care of some todos [WIP] Performance Improvement - reduce identity assignment time May 20, 2019
@kkmsft
Copy link
Contributor

kkmsft commented May 21, 2019

Thank you @aramase . A good e2e could be validating that #145 is resolved and the identities are assigned in a reasonable amount of time.

pkg/cloudprovider/identity.go Outdated Show resolved Hide resolved
pkg/cloudprovider/vm.go Outdated Show resolved Hide resolved
pkg/mic/mic.go Outdated
fmt.Sprintf("Lookup of node %s for pod %s resulted in error %v", createID.Spec.NodeName, createID.Name, err))
continue
}
err = c.CloudClient.CheckUserMSI(id.Spec.ResourceID, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the new interface? Can remove just do the right thing?

@aramase
Copy link
Member Author

aramase commented May 22, 2019

@kkmsft I was able to validate #145 with my changes. Created close to 44 pods with 2 assigned identities. With version 1.3 it took about 29m to create all assigned identities and assign msi to node. With my changes, it takes only about a minute since it's batching all the updates with a single call. I've a lot of refactoring of my PR still though.

@kkmsft
Copy link
Contributor

kkmsft commented May 22, 2019

Thank you @aramase. Good to know that it's moving in the right direction. After our initial discussion on the approach have not looked deeply into the changes. Please let me know when you want me to start looking at the changes.

Also, any chance (I know I am being greedy here ;-)) you can convert that test to an e2e test so that we can track any regression in future.

@aramase
Copy link
Member Author

aramase commented May 22, 2019

@kkmsft Sure, I'll add that as an e2e. Still working on testing a couple of things. I'll need to refactor the PR. I'll let you know when it's ready for review.

@@ -112,6 +115,59 @@ func withInspection() autorest.PrepareDecorator {
}
}

// CheckUserMSI checks if the identity has been assigned to the node
func (c *Client) CheckUserMSI(userAssignedMSIID string, node *corev1.Node) bool {
idH, _, err := c.getIdentityResource(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arent' these calls costly ? in the sense that they have to ARM every time ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkmsft You're right! we can avoid the multiple calls to ARM by just getting the list once and then validating existence/non-existence of all assigned ids based on that list.

pkg/mic/mic.go Outdated
@@ -261,24 +275,48 @@ func (c *Client) Sync(exit <-chan struct{}) {

glog.V(5).Infof("Initiating assigned id creation for pod - %s, binding - %s", createID.Spec.Pod, binding.Name)

err = c.createAssignedIdentityDeps(&createID, id, node)
err = c.createAssignedIdentity(&createID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can perform this create when we start looking at the node map ? Since we don't track state of the assigned identities, we could have a large time gap between the time the assignment is created and then the node assignment was made. Is that a possibility ? If so, can we mitigate it some how without additional state ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkmsft This mimics the current logic. Currently, we create the assigned identity, then make the request for node assignment. The timegap still remains the same with this change because we are batching all the assignments and making the single call. So the time for assignment still remains the same. There shouldn't be a long time gap here.

@@ -159,9 +163,11 @@ func (c *TestCloudClient) CompareMSI(nodeName string, userIDs []string) bool {

func (c *TestCloudClient) PrintMSI() {
for key, val := range c.ListMSI() {
fmt.Printf("\nNode name: %s\n", key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this debugging logs ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll clean that up.

@aramase
Copy link
Member Author

aramase commented Jun 17, 2019

Tests passed -

Ran 8 of 9 Specs in 1339.441 seconds
SUCCESS! -- 8 Passed | 0 Failed | 0 Pending | 1 Skipped
--- PASS: TestAADPodIdentity (1339.44s)
PASS
ok  	github.com/Azure/aad-pod-identity/test/e2e	1339.459s

@kkmsft

@aramase aramase changed the title [WIP] Performance Improvement - reduce identity assignment time Performance Improvement - reduce identity assignment time Jun 17, 2019
pkg/mic/mic.go Outdated
glog.V(5).Infof("Initiating assigned id creation for pod - %s, binding - %s", createID.Spec.Pod, binding.Name)

if isUserAssignedMSI {
err := c.createAssignedIdentity(&createID)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assigned identity will only be created once the identity assignment on the node is successfully complete.

pkg/mic/mic.go Outdated
fmt.Sprintf("Binding %s removed from node %s for pod %s", removedBinding.Name, delID.Spec.NodeName, delID.Spec.Pod))

// remove assigned identity crd from cluster
if err := c.removeAssignedIdentity(&delID); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, removing the assigned identity only when the identity has been successfully removed from the node

@@ -231,7 +231,10 @@ var _ = Describe("Kubernetes cluster using aad-pod-identity", func() {
identityValidatorName := fmt.Sprintf("identity-validator-%d", i)

setUpIdentityAndDeployment(keyvaultIdentity, fmt.Sprintf("%d", i))
time.Sleep(5 * time.Second)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the sleep, so the test can actually proceed once the assigned identity exists.

}

// AppendOrRemoveUserMSI will batch process the removal and addition of ids
func (c *Client) AppendOrRemoveUserMSI(addUserAssignedMSIIDs []string, removeUserAssignedMSIIDs []string, node *corev1.Node) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to split this into two separate functions, one which adds and another which removes and then we can add individually add UTs for them ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or may be naming can be UpdateUserMSI - AppendOrRemove feels like CreateOrUpdate which is doing a different thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the method

pkg/mic/mic.go Outdated Show resolved Hide resolved
pkg/mic/mic.go Outdated

func (c *Client) appendToRemoveListForNode(nodeMap map[string]trackUserAssignedMSIIds, resourceID string, node *corev1.Node) {
if trackList, ok := nodeMap[node.Name]; ok {
if trackList.removeUserAssignedMSIIDs != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary ?

pkg/mic/mic.go Outdated
if err != nil {
// check which all identity assignment failed
// remove those assigned identities
// TODO check with identity team if CreateOrUpdate results in error, is it all or some failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this before the PR gets merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent out an email.

pkg/mic/mic.go Outdated Show resolved Hide resolved
pkg/mic/mic.go Outdated
continue
}
idList, err := c.getUserMSIListForNode(node)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan here. If we just log and continue then idList can be having random values..?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getUserMSIListForNode returns nil on error. So we'll fail all of them.

pkg/mic/mic.go Outdated

err = c.CloudClient.AssignUserMSI(id.Spec.ResourceID, node)
nodeIdentityList := make(map[string][]string)
for _, n := range nodesWithError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we crash or get restarted at this time ?

@@ -626,6 +626,8 @@ func setUpIdentityAndDeployment(azureIdentityName, suffix, replicas string) {
ok, err = daemonset.WaitOnReady(nmiDaemonSet)
Expect(err).NotTo(HaveOccurred())
Expect(ok).To(Equal(true))

time.Sleep(30 * time.Second)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this sleep in the states PR because there we can start waiting for n assigned identities with desired state.

Copy link
Member Author

@aramase aramase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran 9 of 10 Specs in 2221.232 seconds
SUCCESS! -- 9 Passed | 0 Failed | 0 Pending | 1 Skipped
--- PASS: TestAADPodIdentity (2221.23s)
PASS
ok github.com/Azure/aad-pod-identity/test/e2e 2221.249s

@aramase aramase added this to the v1.5 milestone Jun 21, 2019
@aramase aramase force-pushed the refactor branch 2 times, most recently from 5ecf29d to 20e4e14 Compare June 25, 2019 05:54
wip

batch process createorupdate

update logic

update events

fix events

generate set to use unique values

update interface

reorder check flow

update to using one get per node

update tests

add e2e test for testing scale perf

create assigned identity before assignment

add unit tests for UpdateUserMSI interface
@aramase
Copy link
Member Author

aramase commented Jun 25, 2019

Screen Shot 2019-06-25 at 12 40 56 PM

}

// getListOfIdsToDelete will go over the delete list to determine if the id is required to be deleted
// only user assigned identity not in use are added to the remove list for the node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: two sentences.

@aramase
Copy link
Member Author

aramase commented Jul 1, 2019

Ran 10 of 10 Specs in 2353.337 seconds
SUCCESS! -- 10 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAADPodIdentity (2353.34s)
PASS
ok github.com/Azure/aad-pod-identity/test/e2e 2353.353s

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants