gate: high failure ratio because of Workload cluster (without lb) creation #927

jichenjc · 2021-07-01T04:34:51Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

several PRs suffer same issue and retry 1-2 times still have same issue

[1] 
[1] �[91m�[1m• Failure [1891.637 seconds]�[0m
[1] e2e tests
[1] �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:49�[0m
[1]   Workload cluster (without lb)
[1]   �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:133�[0m
[1]     �[91m�[1mIt should be creatable and deletable [It]�[0m
[1]     �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:175�[0m
[1] 
[1]     �[91mTimed out after 1800.001s.
[1]     Expected
[1]         <bool>: false
[1]     to be true�[0m
[1]

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/924/pull-cluster-api-provider-openstack-e2e-test/1410422201777131520/build-log.txt

need check what's happened lead to so high failure ratio

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built):
Cluster-API version:
OpenStack version:
Minikube/KIND version:
Kubernetes version (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

jichenjc · 2021-07-09T07:59:31Z

per
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/924/pull-cluster-api-provider-openstack-e2e-test/1412692551504236544/artifacts/clusters/bootstrap/controllers/capo-controller-manager/capo-controller-manager-6f4cdd5947-42qv4/manager.log

it's because we didn't associate the floating correctly

I0707 09:26:57.481439 1 openstackmachine_controller.go:258] controller-runtime/manager/controller/openstackmachine "msg"="Error state detected, skipping reconciliation"

Assume it's timing issue (create floating IP then need wait some time to associate? )
because we create a floating ip then tried to associate right after creation
and that might lead to timing issue, so better to retry association after wait some time about the creation

tobiasgiese · 2021-07-09T08:27:11Z

@jichenjc what PR do you mean? If you mean #925, then I don't think that it's related to this issue. It was already present prior to the merge of the PR.

jichenjc · 2021-07-09T12:02:04Z

@jichenjc what PR do you mean? If you mean #925, then I don't think that it's related to this issue. It was already present prior to the merge of the PR.

not that one, I only give the log as example...

jichenjc · 2021-07-13T03:56:17Z

looks like I made some wrong assumption based on previous info

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/924/pull-cluster-api-provider-openstack-e2e-test/1414766955063152640/artifacts/clusters/bootstrap/controllers/capo-controller-manager/capo-controller-manager-6f4cdd5947-fjlxx/manager.log

first create floating IP

I0713 02:49:33.906406 1 recorder.go:104] controller-runtime/manager/events "msg"="Normal" "message"="Created floating IP 172.24.4.66 with id 6c0753d4-b5a8-43e0-9ea5-c7cb1373ecb6" "object"={"kind":"OpenStackCluster","namespace":"e2e-ae45wj","name":"cluster-e2e-ae45wj","uid":"cf4ef561-17c1-4818-9592-9e442fbed6e3","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"5528"} "reason"="SuccessfulCreateFloatingIP"

then I am still seeing those errors (repeat multiple times)

I0713 02:50:07.053860 1 openstackmachine_controller.go:371] controller-runtime/manager/controller/openstackmachine "msg"="Floating IP association failed, will retry." "cluster"="cluster-e2e-ae45wj" "machine"="cluster-e2e-ae45wj-control-plane-b9qn6" "name"="cluster-e2e-ae45wj-control-plane-lhsdm" "namespace"="e2e-ae45wj" "openStackCluster"="cluster-e2e-ae45wj" "openStackMachine"="cluster-e2e-ae45wj-control-plane-lhsdm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "floating-ip"="172.24.4.66" "instance-id"="be74324c-e2cf-4d93-ad80-6c8df076ce9a"

so apparaently it's not the floating IP not create on time issue, might be some other concurrent issue ..
need further read logs

jichenjc · 2021-07-13T04:05:42Z

not able to find openstack logs so not sure what happened (why keep getting NotFound error)
will add more debug logs

tobiasgiese · 2021-07-13T06:35:06Z

What I have seen in many of my PR e2e tests

[2] • Failure [1507.447 seconds]
[2] e2e tests
[2] /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:49
[2]   Workload cluster (without lb)
[2]   /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:133
[2]     Should create port(s) with custom options [It]
[2]     /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:134
[2] 
[2]     Expected
      
[2]         <int>: 0
[2]     to equal
[2]         <int>: 1
[2] 
[2]     /home/prow/go/src/sigs.k8s.io/cluster-api-provider-openstack/test/e2e/suites/e2e/e2e_test.go:171

xref: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/932/pull-cluster-api-provider-openstack-e2e-test/1414482588461961216

macaptain · 2021-07-13T09:57:33Z

The Ginkgo tests run in parallel:

cluster-api-provider-openstack/Makefile

Line 117 in a1ccb52

    
           time $(GINKGO) -trace -progress -v -tags=e2e --nodes=2 $(E2E_GINKGO_ARGS) ./test/e2e/suites/e2e/... -- -config-path="$(E2E_CONF_PATH)" -artifacts-folder="$(ARTIFACTS)" --data-folder="$(E2E_DATA_DIR)" $(E2E_ARGS)

Setting -nodes to anything greater than 1 implies a parallelized test run.

Is it possible that the two specs in the "Workload cluster (without lb)" description are running at the same time, and one of them is deleting the cluster while the other is trying to create ports on it?

iamemilio · 2021-07-13T15:24:49Z

I think that Should create port(s) with custom options [It] is also flaking.

See: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/921/pull-cluster-api-provider-openstack-e2e-test/1414589415367380992

jichenjc · 2021-07-14T07:10:36Z

The Ginkgo tests run in parallel:

cluster-api-provider-openstack/Makefile

Line 117 in a1ccb52

time $(GINKGO) -trace -progress -v -tags=e2e --nodes=2 $(E2E_GINKGO_ARGS) ./test/e2e/suites/e2e/... -- -config-path="$(E2E_CONF_PATH)" -artifacts-folder="$(ARTIFACTS)" --data-folder="$(E2E_DATA_DIR)" $(E2E_ARGS)

Setting -nodes to anything greater than 1 implies a parallelized test run.

Is it possible that the two specs in the "Workload cluster (without lb)" description are running at the same time, and one of them is deleting the cluster while the other is trying to create ports on it?

Thanks for the info, as the ratio is high suddenly and we should not delete/create ports frequently ,so I guess less likely but I will take into consideration, thanks

jichenjc · 2021-07-14T07:12:01Z

I think that Should create port(s) with custom options [It] is also flaking.

See: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/921/pull-cluster-api-provider-openstack-e2e-test/1414589415367380992

yes, and less flaky than the other one, still need check it ...

macaptain · 2021-07-14T08:15:34Z

I noticed the ports e2e test creates a machine and then waits for an error condition not to occur before continuing. But it's possible that the error condition didn't occur because the machine wasn't created yet, and then the tests continue anyway. This may be contributing to the flakiness. I raised #938 to hopefully wait for the machine to be created before trying to list the ports.

tobiasgiese · 2021-07-15T06:12:43Z

Was it intended to close this issue @hidekazuna?
I think it's because of your description "Because #930 did not fix #927." :)
(i.e., "fix #927" will close the issue)

hidekazuna · 2021-07-15T08:30:41Z

Was it intended to close this issue @hidekazuna?
I think it's because of your description "Because #930 did not fix #927." :)
(i.e., "fix #927" will close the issue)

@tobiasgiese Ah, I do not want to close this issue definitely :)

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 1, 2021

This was referenced Jul 1, 2021

✨ add description and tag to floating ips #925

Merged

🐛 Fix failure to create server with specified tags #924

Merged

jichenjc mentioned this issue Jul 9, 2021

reassociate Floating IP if first associate fails #930

Merged

3 tasks

tobiasgiese mentioned this issue Jul 11, 2021

✨ Add OpenStackClusterTemplates Type #933

Merged

3 tasks

k8s-ci-robot closed this as completed in #930 Jul 12, 2021

jichenjc reopened this Jul 13, 2021

macaptain mentioned this issue Jul 14, 2021

🐛 Wait for ports creation in ports e2e test #938

Merged

1 task

hidekazuna mentioned this issue Jul 15, 2021

Revert "reassociate Floating IP if first associate fails" #939

Merged

k8s-ci-robot closed this as completed in #939 Jul 15, 2021

jichenjc reopened this Jul 15, 2021

k8s-ci-robot closed this as completed in #938 Jul 16, 2021

tobiasgiese mentioned this issue Jul 16, 2021

REQUEST: New membership for tobiasgiese kubernetes/org#2835

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gate: high failure ratio because of Workload cluster (without lb) creation #927

gate: high failure ratio because of Workload cluster (without lb) creation #927

jichenjc commented Jul 1, 2021

jichenjc commented Jul 9, 2021 •

edited

Loading

tobiasgiese commented Jul 9, 2021

jichenjc commented Jul 9, 2021

jichenjc commented Jul 13, 2021 •

edited

Loading

jichenjc commented Jul 13, 2021

tobiasgiese commented Jul 13, 2021

macaptain commented Jul 13, 2021

iamemilio commented Jul 13, 2021

jichenjc commented Jul 14, 2021

jichenjc commented Jul 14, 2021

macaptain commented Jul 14, 2021

tobiasgiese commented Jul 15, 2021 •

edited

Loading

hidekazuna commented Jul 15, 2021

gate: high failure ratio because of Workload cluster (without lb) creation #927

gate: high failure ratio because of Workload cluster (without lb) creation #927

Comments

jichenjc commented Jul 1, 2021

jichenjc commented Jul 9, 2021 • edited Loading

tobiasgiese commented Jul 9, 2021

jichenjc commented Jul 9, 2021

jichenjc commented Jul 13, 2021 • edited Loading

jichenjc commented Jul 13, 2021

tobiasgiese commented Jul 13, 2021

macaptain commented Jul 13, 2021

iamemilio commented Jul 13, 2021

jichenjc commented Jul 14, 2021

jichenjc commented Jul 14, 2021

macaptain commented Jul 14, 2021

tobiasgiese commented Jul 15, 2021 • edited Loading

hidekazuna commented Jul 15, 2021

jichenjc commented Jul 9, 2021 •

edited

Loading

jichenjc commented Jul 13, 2021 •

edited

Loading

tobiasgiese commented Jul 15, 2021 •

edited

Loading