🌱 Fix self-hosted flakes in E2E tests #3639

fabriziopandini · 2020-09-15T10:03:57Z

What this PR does / why we need it:
This PR aims to fix self-hosted flakes by ensuring that mhc is not run during this test and by adding two delay in order to avoid doing move too aggressively.

In order to make this happen the entire management of cluster templates was re-architected so:

cluster templates for tests are now generated from a set composable bases
MHC-remediation has a new dedicated template
all the other templates are without MHC
also, in order to simplify things, this PR drops the distinction between CI and DEV test config

On top of that, this PR includes, a set of nits for improving test logs (mostly uppercase)

Which issue(s) this PR fixes:
Fixes #3589

fabriziopandini · 2020-09-15T10:04:16Z

/milestone v0.3.10

fabriziopandini · 2020-09-15T10:26:42Z

/retest

test/e2e/data/infrastructure-docker/cluster-template-kcp-adoption.yaml

test/e2e/data/infrastructure-docker/cluster-template-mhc.yaml

test/e2e/data/infrastructure-docker/cluster-template.yaml

test/e2e/config/docker-dev.yaml

test/e2e/data/infrastructure-docker/metadata.yaml

test/e2e/kcp_adoption.go

fabriziopandini · 2020-09-15T13:31:34Z

/hold
investigating an error on ci-e2e.sh that shows up un CI only, while locally everything works just fine 🤔

scripts/ci-e2e.sh

test/e2e/Makefile

vincepri · 2020-09-15T15:16:19Z

From the PR's description, seems we're now splitting manifests in order to separate MHC components? Would this be something that we would expect users to do as well?

During move, we should be able to pause controllers and look at the state of the cluster if it's safe to move, is there anything stopping us to do so?

Can we split this PR in multiple ones? In particular:

also, in order to simplify things, this PR drops the distinction between CI and DEV test config
On top of that, this PR includes, a set of nits for improving test logs (mostly uppercase)

These changes seems outside of the PR's scope

sedefsavas

Thanks for the fix!

This PR is also addressing this issue: #3461
I am fine with adding it in the title and the issues fixed by this PR or making this separate PRs.

test/e2e/config/docker.yaml

test/e2e/data/infrastructure-docker/cluster-template/kustomization.yaml

test/e2e/Makefile

test/framework/machine_helpers.go

sedefsavas · 2020-09-16T05:19:54Z

@fabriziopandini Locally, works for me as well.

/test pull-cluster-api-e2e

sedefsavas · 2020-09-16T05:36:03Z

In test/e2e/Makefile

cd $(TOOLS_DIR); go build -tags=tools -o $(BIN_DIR)/kustomize sigs.k8s.io/kustomize/kustomize/v3

should be

cd $(TOOLS_DIR) && go build -tags=tools -o $(BIN_DIR)/kustomize sigs.k8s.io/kustomize/kustomize/v3

fabriziopandini · 2020-09-16T13:11:03Z

@vincepri

From the PR's description, seems we're now splitting manifests in order to separate MHC components? Would this be something that we would expect users to do as well?

No. The current MHC object in E2E tests is specifically configured to always trigger remediation after 30s the node is started, and I removed it from the other e2e tests in order to avoid flakes due to remediation kicking in case of slow execution

During move, we should be able to pause controllers and look at the state of the cluster if it's safe to move, is there anything stopping us to do so?

This is not how the move logics works. Currently we are pausing reconciliation on clusters included in the Move scope, but the controller will continue to run.

Can we split this PR in multiple ones?

Done! rif
#3649
#3650
#3651

These changes seems outside of the PR's scope

Some of them are nits, but refactoring templates for MHC is potentially related to self-hosted flakes, see the note above about MHC and #3589 (comment).
However, no problem to move to another PR

fabriziopandini · 2020-09-16T13:12:15Z

/hold cancel
/test pull-cluster-api-e2e-full

vincepri · 2020-09-16T13:59:45Z

No. The current MHC object in E2E tests is specifically configured to always trigger remediation after 30s the node is started, and I removed it from the other e2e tests in order to avoid flakes due to remediation kicking in case of slow execution

I assume this change has moved to the other PR, I'll take a look there before commenting

This is not how the move logics works. Currently we are pausing reconciliation on clusters included in the Move scope, but the controller will continue to run.

We should probably consider stopping all controllers before move, for safety purposes and to avoid any running slow worker to perform actions while move is running.

vincepri · 2020-09-16T15:49:58Z

/test pull-cluster-api-e2e-full

randomvariable · 2020-09-16T16:14:31Z

/lgtm

pending e2e pass

vincepri

/approve
/lgtm

k8s-ci-robot · 2020-09-16T16:45:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fabriziopandini requested a review from sedefsavas September 15, 2020 10:03

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 15, 2020

k8s-ci-robot requested review from benmoss and justinsb September 15, 2020 10:04

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 15, 2020

k8s-ci-robot added this to the v0.3.10 milestone Sep 15, 2020

fabriziopandini force-pushed the fix-self-hosted-flakes branch 3 times, most recently from 343f90f to 9121d59 Compare September 15, 2020 12:10