Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e test ' [Feature: AdvancedStatefulSet][Feature: Webhook] Scaling tidb cluster with advanced statefulset' is flaky #1846

Closed
cofyc opened this issue Mar 2, 2020 · 1 comment · Fixed by #1875
Assignees

Comments

@cofyc
Copy link
Contributor

cofyc commented Mar 2, 2020

Bug Report

https://internal.pingcap.net/idc-jenkins/blue/organizations/jenkins/operator_ghpr_e2e_test_kind/detail/operator_ghpr_e2e_test_kind/3272/tests

Stacktrace
/home/jenkins/agent/workspace/operator_ghpr_e2e_test_kind/go/src/github.com/pingcap/tidb-operator/tests/e2e/tidbcluster/serial.go:152
Mar 2 12:00:35.493: Unexpected error:
<*errors.errorString | 0xc0000d6580>: {
s: "timed out waiting for the condition",
}
timed out waiting for the condition
occurred
/home/jenkins/agent/workspace/operator_ghpr_e2e_test_kind/go/src/github.com/pingcap/tidb-operator/tests/e2e/tidbcluster/serial.go:282
Standard Output
[BeforeEach] [tidb-operator][Serial]
/go/pkg/mod/k8s.io/[email protected]/test/e2e/framework/framework.go:151
�[1mSTEP�[0m: Creating a kubernetes client
Mar 2 11:37:25.892: INFO: >>> kubeConfig: /etc/kubernetes/admin.conf
�[1mSTEP�[0m: Building a namespace api object, basename serial
Mar 2 11:37:25.961: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled.
�[1mSTEP�[0m: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [tidb-operator][Serial]
/home/jenkins/agent/workspace/operator_ghpr_e2e_test_kind/go/src/github.com/pingcap/tidb-operator/tests/e2e/tidbcluster/serial.go:79
Mar 2 11:37:25.963: INFO: >>> kubeConfig: /etc/kubernetes/admin.conf
[BeforeEach] [Feature: AdvancedStatefulSet][Feature: Webhook]
/home/jenkins/agent/workspace/operator_ghpr_e2e_test_kind/go/src/github.com/pingcap/tidb-operator/tests/e2e/tidbcluster/serial.go:115
Mar 2 11:37:25.966: INFO: >>> kubeConfig: /etc/kubernetes/admin.conf
�[1mSTEP�[0m: Installing CRDs
Mar 2 11:37:27.051: INFO: CRD "backups.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "backupschedules.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "dataresources.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "restores.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "statefulsets.apps.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "tidbclusterautoscalers.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "tidbclusters.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "tidbinitializers.pingcap.com" is established
Mar 2 11:37:27.051: INFO: CRD "tidbmonitors.pingcap.com" is established
�[1mSTEP�[0m: Installing tidb-operator
Mar 2 11:37:27.891: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is not available yet
Mar 2 11:37:32.897: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is not available yet
Mar 2 11:37:37.896: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is not available yet
Mar 2 11:37:42.896: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is not available yet
Mar 2 11:37:47.897: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is not available yet
Mar 2 11:37:56.411: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is not available yet
Mar 2 11:38:00.353: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is not available yet
Mar 2 11:38:03.367: INFO: APIService "v1." is available
Mar 2 11:38:03.367: INFO: APIService "v1.apps" is available
Mar 2 11:38:03.367: INFO: APIService "v1.apps.pingcap.com" is available
Mar 2 11:38:03.367: INFO: APIService "v1.authentication.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1.authorization.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1.autoscaling" is available
Mar 2 11:38:03.367: INFO: APIService "v1.batch" is available
Mar 2 11:38:03.367: INFO: APIService "v1.networking.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1.rbac.authorization.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1.storage.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1alpha1.admission.tidb.pingcap.com" is available
Mar 2 11:38:03.367: INFO: APIService "v1alpha1.apps.pingcap.com" is available
Mar 2 11:38:03.367: INFO: APIService "v1alpha1.pingcap.com" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.admissionregistration.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.apiextensions.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.apps" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.authentication.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.authorization.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.batch" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.certificates.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.coordination.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.events.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.extensions" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.policy" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.rbac.authorization.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.scheduling.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta1.storage.k8s.io" is available
Mar 2 11:38:03.367: INFO: APIService "v1beta2.apps" is available
Mar 2 11:38:03.367: INFO: APIService "v2beta1.autoscaling" is available
Mar 2 11:38:03.367: INFO: APIService "v2beta2.autoscaling" is available
[It] Scaling tidb cluster with advanced statefulset
/home/jenkins/agent/workspace/operator_ghpr_e2e_test_kind/go/src/github.com/pingcap/tidb-operator/tests/e2e/tidbcluster/serial.go:152
�[1mSTEP�[0m: Scaling in tikv from 5 to 3 by deleting pods 1 and 3
�[1mSTEP�[0m: Scaling sts serial-1878/scaling-with-asts-tikv to replicas 3 and setting deleting pods to [1 3] (old replicas: 5, old delete slots: [])
�[1mSTEP�[0m: Waiting for all pods of tidb cluster component tikv (sts: serial-1878/scaling-with-asts-tikv) are in desired state (replicas: 3, delete slots: [1 3])
�[1mSTEP�[0m: Verify other pods of sts serial-1878/scaling-with-asts-tikv should not be affected
�[1mSTEP�[0m: Scaling out tikv from 3 to 4 by adding pod 3
�[1mSTEP�[0m: Scaling sts serial-1878/scaling-with-asts-tikv to replicas 4 and setting deleting pods to [1] (old replicas: 3, old delete slots: [1 3])
�[1mSTEP�[0m: Waiting for all pods of tidb cluster component tikv (sts: serial-1878/scaling-with-asts-tikv) are in desired state (replicas: 4, delete slots: [1])
Mar 2 12:00:35.493: FAIL: Unexpected error:
<*errors.errorString | 0xc0000d6580>: {
s: "timed out waiting for the condition",
}
timed out waiting for the condition
occurred
@cofyc
Copy link
Contributor Author

cofyc commented Mar 6, 2020

the failures on statefulset object in statefulset controller queue will accumulate until it is synced successfully, e.g. updating the replicas from 5 to 3. however, scaling TiKV pods may take a long time and the retry interval of the statefulset object will be very large soon because it's exponential.

this is unacceptable. one solution is to update statefulset object every time (e.g. updating annotations or labels), then statefulsets objects will be put onto the process queue by our operator periodically.

rate limiter used by statefulset controller:

func DefaultControllerRateLimiter() RateLimiter {
	return NewMaxOfRateLimiter(
		NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
		// 10 qps, 100 bucket size.  This is only for retry speed and its only the overall factor (not per item)
		&BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}

https://github.com/kubernetes/kubernetes/blob/v1.17.0/staging/src/k8s.io/client-go/util/workqueue/default_rate_limiters.go#L39-L45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants