-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLInstance: reference SQLInstance X/Y is not ready #294
Comments
/cc @jcanseco |
Hey @rnaveiras , could you share what the events look like for the |
Events in the sqlinstance:
Events in the namespace:
I hope this helps |
@rnaveiras , thank you for sharing the events with us. Would you be able to also share the configuration you're using for your |
Hey @caieo - I work on the same team as @rnaveiras Apologies it took us a while to get back to you on this! Here's a dump of an instance that's (still) exhibiting this issue: ---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
annotations:
cnrm.cloud.google.com/management-conflict-prevention-policy: resource
cnrm.cloud.google.com/observed-secret-versions: '{}'
cnrm.cloud.google.com/project-id: project-redacted
cnrm.cloud.google.com/supports-ssa: "true"
creationTimestamp: "2020-09-30T08:48:02Z"
finalizers:
- cnrm.cloud.google.com/finalizer
- cnrm.cloud.google.com/deletion-defender
generation: 17641
labels:
app: abacus
app.kubernetes.io/instance: prd-abacus-sandbox-staging-abacus
environment: sandbox-staging
part-of: abacus
release: abacus
service: abacus
managedFields:
- apiVersion: sql.cnrm.cloud.google.com/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:cnrm.cloud.google.com/supports-ssa: {}
manager: supports-ssa
operation: Apply
time: "2020-10-08T14:41:36Z"
- apiVersion: sql.cnrm.cloud.google.com/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:cnrm.cloud.google.com/management-conflict-prevention-policy: {}
f:cnrm.cloud.google.com/observed-secret-versions: {}
f:cnrm.cloud.google.com/project-id: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:finalizers:
v:"cnrm.cloud.google.com/deletion-defender": {}
v:"cnrm.cloud.google.com/finalizer": {}
f:labels:
f:app: {}
f:app.kubernetes.io/instance: {}
f:environment: {}
f:part-of: {}
f:release: {}
f:service: {}
f:spec:
f:databaseVersion: {}
f:region: {}
f:settings:
f:activationPolicy: {}
f:availabilityType: {}
f:backupConfiguration:
f:enabled: {}
f:startTime: {}
f:diskAutoresize: {}
f:diskSize: {}
f:diskType: {}
f:ipConfiguration:
f:authorizedNetworks: {}
f:ipv4Enabled: {}
f:requireSsl: {}
f:locationPreference:
f:zone: {}
f:pricingPlan: {}
f:replicationType: {}
f:tier: {}
f:status:
f:connectionName: {}
f:firstIpAddress: {}
f:ipAddress: {}
f:publicIpAddress: {}
f:selfLink: {}
f:serverCaCert:
f:cert: {}
f:commonName: {}
f:createTime: {}
f:expirationTime: {}
f:sha1Fingerprint: {}
f:serviceAccountEmailAddress: {}
manager: before-first-apply
operation: Update
- apiVersion: sql.cnrm.cloud.google.com/v1beta1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions: {}
manager: cnrm-controller-manager
operation: Update
time: "2021-01-12T00:17:57Z"
name: abacus
namespace: abacus-sandbox-staging
resourceVersion: "765985076"
selfLink: /apis/sql.cnrm.cloud.google.com/v1beta1/namespaces/abacus-sandbox-staging/sqlinstances/abacus
uid: 108919d2-af60-4f72-a478-b7dd20fa2222
spec:
databaseVersion: POSTGRES_12
region: europe-west4
settings:
activationPolicy: ALWAYS
availabilityType: REGIONAL
backupConfiguration:
enabled: true
startTime: "07:00"
diskAutoresize: true
diskSize: 10
diskType: PD_SSD
ipConfiguration:
authorizedNetworks:
- name: all
value: 0.0.0.0/0
ipv4Enabled: true
requireSsl: true
locationPreference:
zone: europe-west4-a
pricingPlan: PER_USE
replicationType: SYNCHRONOUS
tier: db-custom-1-3840
status:
conditions:
- lastTransitionTime: "2021-01-12T00:17:57Z"
message: The resource is up to date
reason: UpToDate
status: "True"
type: Ready
connectionName: project-redacted:europe-west4:abacus
firstIpAddress: 1.2.3.4
ipAddress:
- ipAddress: 1.2.3.4
type: PRIMARY
publicIpAddress: 1.2.3.4
selfLink: https://sqladmin.googleapis.com/sql/v1beta4/projects/project-redacted/instances/abacus
serverCaCert:
cert: |-
redacted
commonName: C=US,O=Google\, Inc,CN=Google Cloud SQL Server CA,dnQualifier=5548eefb-f843-458c-a67a-ea2f396e55c1
createTime: "2020-09-30T08:49:11.127Z"
expirationTime: "2030-09-28T08:50:11.127Z"
sha1Fingerprint: 2b4fc8716cb4fdf29b4269ae79cdbf6a33c11083
serviceAccountEmailAddress: [email protected] |
We have this issue for most of our SQL instances. We're currently on 1.34.0 but have seen this on multiple versions. Same symptoms as above. Eventually all events balance out (e.g. 498 UpToDate and 498 Updating.) Based on the fact the |
Here is the controller's log for it happening and the diff between the two versions:
diff between resourceVersion/generation yamls (generation, resourceVersion, status, and managed fields for status (I'm guessing that last one is what is broken))
|
Hi @snuggie12, do your sql instance resources have "cnrm.cloud.google.com/management-conflict-prevention-policy: resource" annotation? If so, that means the label lease is enabled for conflict prevention. ConfigConnector will need to update your instance's labels to renew the lease. You can disable it per https://cloud.google.com/config-connector/docs/concepts/managing-conflicts#modifying_conflict_prevention. |
@xiaobaitusi We only add
Based on that link it says the default is determined by the resource type and whether it supports labels. I believe the SQL Instance does support labels so that explains the default. Seeing as how we only have one controller am I understanding you correctly that setting it to |
@xiaobaitusi that did indeed fix the problem for us. |
@xiaobaitusi Just to clarify though, disabling conflict prevention shouldn't really be required right? This still sounds like a bug in the controller if it's not able to renew the lease on the resource without changing its |
Hi @benwh, I apologize that we missed your question. Yes, you are correct: the controller should not be marking the resource |
I'm seeing new behavior with this. This seems specific to only one of our SQL instances. It also seems specific to I have it set to Here is an example diff as well as the md5sums of the yaml taken approximately every second:
|
Hi @snuggie12 , sorry to hear that you ran into a similar issue again. Just to clarify, are you observing this SQLInstance getting updated regularly?
Did you observe any value changes of any fields? If so, could you share more details?
|
@maqiuyujoyce yes it is updating regularly. You can see the update patterns based on the file names (they are epoch times.) The changes are also pasted above. Aside from the expected 2 fields, it is the time field for 2 of the managed field entries. Nothing else manages these resources. Config connector is rapidly changing those fields. If I delete the kubernetes resource with abandon set for my delete policy I believe managed fields (or at least any changes it is documenting,) will be wiped out and config connector should stop making updates. |
I'm seeing the same behaviour, in my case setting |
@snuggie12 and @eyalzek Thank you for your confirmation & new data point! Could you provide the following information so that we can try to reproduce?
Yes @snuggie12, you can mark the deletion policy as abandon and delete the K8s resource without impacting the underlying SQL instance. After the SQLInstance resource is deleted from K8s, the underlying instance should stop making changes. |
v1.16.15-gke.7800 Went heavy on the redaction and commented the two fields in the reconciliation update loop.
|
---
apiVersion: compute.cnrm.cloud.google.com/v1beta1
kind: ComputeAddress
metadata:
annotations:
cnrm.cloud.google.com/deletion-policy: abandon
cnrm.cloud.google.com/project-id: ${GCP_PROJECT}
name: google-managed-services-default
spec:
addressType: INTERNAL
description: IP Range for peer networks.
location: global
purpose: VPC_PEERING
prefixLength: 20
networkRef:
external: default
---
apiVersion: servicenetworking.cnrm.cloud.google.com/v1beta1
kind: ServiceNetworkingConnection
metadata:
annotations:
cnrm.cloud.google.com/deletion-policy: abandon
cnrm.cloud.google.com/project-id: ${GCP_PROJECT}
name: peer-network
spec:
networkRef:
external: default
reservedPeeringRanges:
- name: google-managed-services-default
service: servicenetworking.googleapis.com the problematic resource is the instance itself: ---
apiVersion: sql.cnrm.cloud.google.com/v1beta1
kind: SQLInstance
metadata:
annotations:
cnrm.cloud.google.com/project-id: ${GCP_PROJECT}
cnrm.cloud.google.com/deletion-policy: abandon
#### the resource is still stuck in updating loop even with this annotation set....
cnrm.cloud.google.com/management-conflict-prevention-policy: none
name: development-mysql-master
spec:
databaseVersion: MYSQL_5_7
region: europe-west4
settings:
tier: db-n1-standard-1
availabilityType: ZONAL
ipConfiguration:
ipv4Enabled: true
privateNetworkRef:
external: default
backupConfiguration:
binaryLogEnabled: false
enabled: true
location: eu
startTime: 00:00
maintenanceWindow:
day: 1
hour: 2
updateTrack: canary One this to note here is that the instance was already created with terraform beforehand. After applying the manifest, the instance was "Updating" in the GCP console but became ready within the minute. I used $ diff /tmp/dev-mysql.yaml /tmp/dev-mysql-2.yaml
4c4
< etag: cbc7ad26f61e42f4baddad8bd3b87a1602391d358255e6785e15d0cfe2b343b7
---
> etag: 69ee84b689d851abc30f769586a793996a0563db8694f120b0fdf5768e7c1d1c
76c76
< settingsVersion: '146'
---
> settingsVersion: '150'
79a80,83
> userLabels:
> cnrm-lease-expiration: '1617967055'
> cnrm-lease-holder-id: bvgug84inp3o783qoq40
> managed-by-cnrm: 'true' first time I applied it the Here are some logs from the
|
Thanks for the output! I'm having trouble reproducing the issue. Here are the highlights:
So I'm wondering if there are any other instances of KCC managing the resource, which would be applying those labels. Here's the manifest I've used (the private network I believe is irrelevant, as it's a hard-coded reference so doesn't depend on the status of another resource in the cluster).
With this configuration, I arrived an up to date cluster, that stayed that way for at least 40 minutes when I stopped checking:
38 minutes later...
Note the number of Updating, UpdateFailed events did not increase. This was a new instance that didn't exist previously. |
@eyalzek Looks like you supplied the yaml that you apply, but not the actual resource on the cluster. Could you provide @toumorokoshi I don't think a fresh installation is going to show any issues. I deleted one of my instances (with I think specifically what is wrong is this entry:
I'm presuming that just like normal infinite reconciliation loops, To fully re-create, could you:
|
After reading up on I spotted this via: This eventually led me to the fact that the non-kcc controller was trying to set |
I just tried creating it again and monitoring the metadata, but the resource does not have a $ k get sqlinstances.sql.cnrm.cloud.google.com
NAME AGE READY STATUS STATUS AGE
development-mysql-master 2m13s True UpToDate 101s we had a cluster upgrade to |
@snuggie12 great! thanks for spelunking. Would it be fair to say your issue is fixed then? @eyalzek sounds good! keep us posted. If it does happen, please continue to report the GKE master version and Config Connector version so we can try to repro. |
Yes, I'm good. Thanks for your help |
@eyalzek I'm going to close the issue for now since things seem like they're working ok. Ping me on this thread if it's still not fixed and I'll re-open the issue. |
FYI it seems that another way a This seems to be due to a bug in Terraform: hashicorp/terraform-provider-google#10492 Until the issue is fixed, we recommend using |
We observed the behaviour that all resources related to
sql.cnrm.cloud.google.com/v1beta1
have multiple events where they state transition fromReady
toDependencyNotReady
. It seems that the state is flapping.Sometime later:
Checking the events in the resource:
It happens for the rest of the resources related to CloudSQL, like
sqlsslcert.sql.cnrm.cloud.google.com
,sqldatabase.sql.cnrm.cloud.google.com
,sqlinstance.sql.cnrm.cloud.google.com
Example from
sqlsslcert.sql.cnrm.cloud.google.com
Could you advise about this issue, please?
The text was updated successfully, but these errors were encountered: