Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Backup-restore falsely detects a single member restoration as bootstrap case when snapstore is not configured in etcd cluster. #760

Closed
ishan16696 opened this issue Aug 13, 2024 · 1 comment · Fixed by #761
Assignees
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)

Comments

@ishan16696
Copy link
Member

ishan16696 commented Aug 13, 2024

Describe the bug:
It has been observed in one of our production cluster that when etcd's data-dir got removed somehow, backup-restore failed to detect this as a single member restoration scenario for a etcd pod when snapstore is not configured and backup-restore falsely detect this case as bootstrap case. This leads to etcd-events-0 pod not starting up as it failed to join the cluster due to memberID mismatch.

❯ k get pods etcd-events-0
etcd-events-0                                          1/2     Running   0             2m14s  

How To Reproduce (as minimally and precisely as possible):

  1. Start a 3 member etcd cluster when snapstore is not configured.
  2. Start a debug container to etcd-0 pod then remove the data-dir completely.
  3. Kill the etcd container to restart/trigger the restoration.

Logs:
backup-restore logs of etcd-events-0 pod:

2024-08-07 23:59:36 | {"log":"Served config for ETCD instance.","severity":"INFO"}
2024-08-07 23:59:36 | {"log":"checking the presence of a learner in a cluster...","severity":"INFO"}
2024-08-07 23:59:35 | {"log":{"attempt":0,"caller":"clientv3/retry_interceptor.go:62","error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\"","level":"warn","msg":"retrying of unary invoker failed","target":"passthrough:///https://etcd-events-local:2379","ts":"2024-08-07T23:59:35.845Z"}}
2024-08-07 23:59:35 | {"log":"failed to get status of etcd endPoint: https://etcd-events-local:2379 with error: context deadline exceeded","severity":"ERR"}
2024-08-07 23:59:35 | {"log":"Updating status from Successful to New","severity":"INFO"}
2024-08-07 23:59:35 | {"log":"Responding to status request with: Successful","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Successfully initialized data directory for etcd.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Removing directory(/var/etcd/data/new.etcd) since snapstore is empty.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"storage provider name not specified","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Checking whether the backup bucket is empty or not...","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Validation mode: full","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Validation failBelowRevision: ","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Setting status to : 503","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Updating status from New to Progress","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Received start initialization request.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Responding to status request with: New","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"No snapstore storage provider configured.","severity":"WARN"}
2024-08-07 23:59:33 | {"log":"TLS enabled. Starting HTTPS server.","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Starting HTTP server at addr: :8080","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Checking if etcd is running","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Starting the http server...","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Registering the http request handlers...","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Setting status to : 503","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"compressionConfig:\\n  enabled: true\\n  policy: gzip\\ndefragmentationSchedule: 17 1 */3 * *\\netcdConnectionConfig:\\n  caFile: /var/etcd/ssl/client/ca/bundle.crt\\n  certFile: /var/etcd/ssl/client/client/tls.crt\\n  connectionTimeout: 5m0s\\n  defragTimeout: 15m0s\\n  endpoints:\\n  - https://etcd-events-local:2379\\n  keyFile: /var/etcd/ssl/client/client/tls.key\\n  serviceEndpoints:\\n  - https://etcd-events-client:2379\\n  snapshotTimeout: 15m0s\\nexponentialBackoffConfig:\\n  attemptLimit: 6\\n  multiplier: 2\\n  thresholdTime: 2m8s\\nhealthConfig:\\n  deltaSnapshotLeaseName: delta-snapshot-revisions\\n  fullSnapshotLeaseName: full-snapshot-revisions\\n  heartbeatDuration: 10s\\n  memberGCDuration: 1m0s\\n  memberLeaseRenewalEnabled: true\\nleaderElectionConfig:\\n  etcdConnectionTimeout: 5s\\n  reelectionPeriod: 5s\\nrestorationConfig:\\n  MaxRequestBytes: 10485760\\n  MaxTxnOps: 10240\\n  autoCompactionMode: periodic\\n  autoCompactionRetention: 30m\\n  dataDir: /var/etcd/data/new.etcd\\n  embeddedEtcdQuotaBytes: 8589934592\\n  initialAdvertisePeerURLs:\\n  - http://localhost:2380\\n  initialCluster: default=http://localhost:2380\\n  initialClusterToken: etcd-cluster\\n  maxCallSendMsgSize: 10485760\\n  maxFetchers: 6\\n  name: default\\n  tempDir: /var/etcd/data/restoration.temp\\nserverConfig:\\n  port: 8080\\n  server-cert: /var/etcd/ssl/client/server/tls.crt\\n  server-key: /var/etcd/ssl/client/server/tls.key\\nsnapshotterConfig:\\n  deltaSnapshotMemoryLimit: 104857600\\n  deltaSnapshotPeriod: 20s\\n  deltaSnapshotRetentionPeriod: 0s\\n  garbageCollectionPeriod: 12h0m0s\\n  garbageCollectionPolicy: Exponential\\n  maxBackups: 7\\n  schedule: 0 */1 * * *\\nsnapstoreConfig:\\n  container: \\\"\\\"\\n  maxParallelChunkUploads: 5\\n  minChunkSize: 5242880\\n  prefix: v2\\n  tempDir: /var/etcd/data/temp\\n","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Go OS/Arch: linux/amd64","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Go Version: go1.20.3","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Git SHA: 6a8f2198","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"etcd-backup-restore Version: v0.28.2","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"No snapstore storage provider configured. Will not start backup schedule.","severity":"WARN"}
2024-08-07 23:17:38 | {"log":"HTTPS server closed gracefully.","severity":"INFO"}
2024-08-07 23:17:38 | {"log":"Shutting down LeaderElection...","severity":"INFO"}

Screenshots (if applicable):

Environment (please complete the following information):

  • Etcd version/commit ID :
  • Etcd-backup-restore version/commit ID:
  • Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:

Anything else we need to know?:
This issue can only be occur for 0th pod.

@ishan16696 ishan16696 added the kind/bug Bug label Aug 13, 2024
@ishan16696
Copy link
Member Author

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug status/closed Issue is closed (either delivered or triaged)
Projects
None yet
2 participants