Skip to content
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.

NMI retries and ticker for periodic sync reconcile #272

Merged
merged 8 commits into from
Jul 11, 2019

Conversation

aramase
Copy link
Member

@aramase aramase commented Jul 2, 2019

Reason for Change:

  • Wrap list pod ids call with retries
  • Filter assigned ids based on state in NMI
  • Change the status code to 404 instead of 403 when no valid assigned ids found after retries.
  • Add a ticker to periodically run sync cycle to reconcile any crds that are in error state. This is required because if no pod events are generated, then the sync loop will never get triggered to resync.
  • Make the sync retry interval configurable

Issue Fixed:

fixes #257

Notes for Reviewers:

pkg/crd/crd.go Outdated
@@ -267,7 +267,7 @@ func (c *Client) ListPodIds(podns, podname string) (*[]aadpodid.AzureIdentity, e

var matchedIds []aadpodid.AzureIdentity
for _, v := range azAssignedIDList.(*aadpodid.AzureAssignedIdentityList).Items {
if v.Spec.Pod == podname && v.Spec.PodNamespace == podns {
if v.Spec.Pod == podname && v.Spec.PodNamespace == podns && v.Status.Status == "Assigned" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these states a constant ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my initial plan, but the only confusion is IdentityCreated is an int (EventType) in the v1 pkg.

pkg/mic/mic.go Outdated
glog.Info("Sync thread started.")
var event aadpodid.EventType
for {
select {
case <-exit:
return
case event = <-c.EventChannel:
glog.V(6).Infof("Received event: %v", event)
case <-ticker.C:
glog.V(6).Infof("Running sync retry loop")
}

stats.Init()
// This is the only place where the AzureAssignedIdentity creation is initiated.
begin := time.Now()
workDone := false
glog.V(6).Infof("Received event: %v", event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will event show here ?

msg := fmt.Sprintf("no AzureAssignedIdentity found for pod:%s/%s", podns, podname)
logger.Errorf("%s, %+v", msg, err)
http.Error(w, msg, http.StatusForbidden)
http.Error(w, msg, http.StatusNotFound)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hosthandler and mishandler looks identical through most parts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please unify the new code which you are adding into one single function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this as a TODO and refactor in the next PR. Have a few questions on why we check filter identities using force namespaced only in 1 handler and not another. We can go over that during the review.

logger.Warningf("failed to get assigned ids for pod:%s/%s, retrying attempt: %d", podns, podname, attempt)
time.Sleep(listPodIDsRetryIntervalInSeconds * time.Second)
}
return nil, fmt.Errorf("getting assigned identities for pod %s/%s failed after %d attempts. Error: %v", podns, podname, maxAttempts, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this error propagate to the caller in the application ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will be propagated if it fails after retries.

}
attempt++
logger.Warningf("failed to get assigned ids for pod:%s/%s, retrying attempt: %d", podns, podname, attempt)
time.Sleep(listPodIDsRetryIntervalInSeconds * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we break if we go >= attempt

for {
podIDs, err = kubeClient.ListPodIds(podns, podname)
if err == nil && len(*podIDs) != 0 {
if len(rqClientID) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two kinds of error, one is that after a few seconds, we still did not find the AzureAssignedIdentity and another is that it never moved to Assigned. Not able to gather that different from retry attempts and varying time durations for them..?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry attempts are not different based on the states. The final desired state is Assigned. So I think this encompasses all possible scenarios.

@aramase
Copy link
Member Author

aramase commented Jul 5, 2019

Ran 10 of 10 Specs in 4998.515 seconds
SUCCESS! -- 10 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAADPodIdentity (4998.52s)
PASS
ok github.com/Azure/aad-pod-identity/test/e2e 4998.532s

@@ -111,7 +111,7 @@ var _ = Describe("Kubernetes cluster using aad-pod-identity", func() {
fmt.Println("\nTearing down the test environment...")

// Ensure a clean cluster after the end of each test
cmd := exec.Command("kubectl", "delete", "AzureIdentity,AzureIdentityBinding,AzureAssignedIdentity", "--all")
cmd := exec.Command("kubectl", "delete", "AzureIdentity,AzureIdentityBinding", "--all")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing azureassignedidentity from the delete command. It should be deleted by MIC once the identity and binding are removed and it's a good test to ensure we reliably delete the assigned identities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is the exception we should add the check at the end to see if no AzureAssignedIdentities exists and fail the test if it does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already ensure the length of azureassignedidentities is equal to 0 at the end of AfterEach.

@aramase aramase changed the title [WIP] NMI retries and ticker for periodic sync reconcile NMI retries and ticker for periodic sync reconcile Jul 8, 2019
@@ -141,7 +141,7 @@ e2e:

.PHONY: unit-test
unit-test:
go test $(shell go list ./... | grep -v /test/e2e) -v
go test -count=1 $(shell go list ./... | grep -v /test/e2e) -v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the usage of count here ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to ensure the test runs at least once instead of showing the cached result.

pkg/crd/crd.go Outdated
@@ -268,7 +268,11 @@ func (c *Client) ListPodIds(podns, podname string) (*[]aadpodid.AzureIdentity, e
var matchedIds []aadpodid.AzureIdentity
for _, v := range azAssignedIDList.(*aadpodid.AzureAssignedIdentityList).Items {
if v.Spec.Pod == podname && v.Spec.PodNamespace == podns {
matchedIds = append(matchedIds, *v.Spec.AzureIdentityRef)
for _, desiredState := range assignedIDStates {
if v.Status.Status == desiredState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.EqualFold ? or compareState function which we can change as per version ?

func listPodIDsInCreatedStateWithRetries(ctx context.Context, kubeClient k8s.Client, logger *log.Entry, podns, podname, rqClientID string, maxAttempts int) (*[]aadpodid.AzureIdentity, error) {
// created and assigned states are added to statesToCheck. This is to ensure if the call is made after the assigned identity is in assigned state
// for the pod, we shouldn't fail on the hard check for created state. Both states are accepted values for passing first check.
return listPodIDsWithRetry(ctx, kubeClient, logger, podns, podname, rqClientID, []string{aadpodid.AssignedIDCreated, aadpodid.AssignedIDAssigned}, maxAttempts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are both state assigned ids added when the function name indicates we are going to gather Created ones.

logger.Errorf("%s, %+v", msg, err)
http.Error(w, msg, http.StatusForbidden)
http.Error(w, msg, http.StatusNotFound)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to do it two times ? Shouldn't we check and do accordingly ?

cmd/mic/main.go Outdated
@@ -24,6 +26,8 @@ func main() {
flag.StringVar(&cloudconfig, "cloudconfig", "", "Path to cloud config e.g. Azure.json file")
flag.BoolVar(&forceNamespaced, "forceNamespaced", false, "Forces namespaced identities, binding, and assignment")
flag.BoolVar(&versionInfo, "version", false, "Prints the version information")
flag.StringVar(&syncRetryInterval, "syncRetryInterval", "3600s", "The interval in seconds at which sync loop should periodically check for errors and reconcile.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to duration var

@aramase
Copy link
Member Author

aramase commented Jul 10, 2019

Ran 11 of 11 Specs in 3839.814 seconds
SUCCESS! -- 11 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestAADPodIdentity (3839.81s)
PASS
ok  	github.com/Azure/aad-pod-identity/test/e2e	3839.837s

for _, v := range azAssignedIDList.(*aadpodid.AzureAssignedIdentityList).Items {
if v.Spec.Pod == podname && v.Spec.PodNamespace == podns {
matchedIds = append(matchedIds, *v.Spec.AzureIdentityRef)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add a comment in the next PR indicating the backward compatibility aspects.

@@ -85,6 +85,8 @@ metadata:
name: nmi
namespace: {{.Namespace}}
spec:
updateStrategy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the significance of this ?

}
logger.Warningf("failed to get assigned ids for pod:%s/%s in ASSIGNED state, retrying attempt: %d", podns, podname, attempt)
}
return nil, fmt.Errorf("getting assigned identities for pod %s/%s in ASSIGNED state failed after %d attempts. Error: %v", podns, podname, 2*maxAttemptsPerCheck, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have the duration in the error logs on how much it tried to do retry.

@kkmsft kkmsft merged commit 5ff008a into Azure:master Jul 11, 2019
@aramase aramase deleted the nmi-retry branch July 11, 2019 17:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retries in NMI and return 404 when assigned identity not found
2 participants