-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gate: high failure ratio because of Workload cluster (without lb) creation #927
Comments
it's because we didn't associate the floating correctly I0707 09:26:57.481439 1 openstackmachine_controller.go:258] controller-runtime/manager/controller/openstackmachine "msg"="Error state detected, skipping reconciliation" Assume it's timing issue (create floating IP then need wait some time to associate? ) |
looks like I made some wrong assumption based on previous info first create floating IP I0713 02:49:33.906406 1 recorder.go:104] controller-runtime/manager/events "msg"="Normal" "message"="Created floating IP 172.24.4.66 with id 6c0753d4-b5a8-43e0-9ea5-c7cb1373ecb6" "object"={"kind":"OpenStackCluster","namespace":"e2e-ae45wj","name":"cluster-e2e-ae45wj","uid":"cf4ef561-17c1-4818-9592-9e442fbed6e3","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha4","resourceVersion":"5528"} "reason"="SuccessfulCreateFloatingIP" then I am still seeing those errors (repeat multiple times) I0713 02:50:07.053860 1 openstackmachine_controller.go:371] controller-runtime/manager/controller/openstackmachine "msg"="Floating IP association failed, will retry." "cluster"="cluster-e2e-ae45wj" "machine"="cluster-e2e-ae45wj-control-plane-b9qn6" "name"="cluster-e2e-ae45wj-control-plane-lhsdm" "namespace"="e2e-ae45wj" "openStackCluster"="cluster-e2e-ae45wj" "openStackMachine"="cluster-e2e-ae45wj-control-plane-lhsdm" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="OpenStackMachine" "floating-ip"="172.24.4.66" "instance-id"="be74324c-e2cf-4d93-ad80-6c8df076ce9a" so apparaently it's not the floating IP not create on time issue, might be some other concurrent issue .. |
not able to find openstack logs so not sure what happened (why keep getting NotFound error) |
What I have seen in many of my PR e2e tests
|
The Ginkgo tests run in parallel: cluster-api-provider-openstack/Makefile Line 117 in a1ccb52
Is it possible that the two specs in the "Workload cluster (without lb)" description are running at the same time, and one of them is deleting the cluster while the other is trying to create ports on it? |
I think that |
Thanks for the info, as the ratio is high suddenly and we should not delete/create ports frequently ,so I guess less likely but I will take into consideration, thanks |
yes, and less flaky than the other one, still need check it ... |
I noticed the ports e2e test creates a machine and then waits for an error condition not to occur before continuing. But it's possible that the error condition didn't occur because the machine wasn't created yet, and then the tests continue anyway. This may be contributing to the flakiness. I raised #938 to hopefully wait for the machine to be created before trying to list the ports. |
Was it intended to close this issue @hidekazuna? |
@tobiasgiese Ah, I do not want to close this issue definitely :) |
/kind bug
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
several PRs suffer same issue and retry 1-2 times still have same issue
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/924/pull-cluster-api-provider-openstack-e2e-test/1410422201777131520/build-log.txt
need check what's happened lead to so high failure ratio
What did you expect to happen:
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
git rev-parse HEAD
if manually built):kubectl version
):/etc/os-release
):The text was updated successfully, but these errors were encountered: