-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to start up the cluster when upgrading to 1.13.0 #14107
Comments
@tuxillo , I couldn't reproduce the error probably I have a very small snapshot from v1.12.3
|
Seeing the same (repeatably). Downgrade to 1.12.3 works fine.
With the perhaps slightly more useful log message: Full startup logs below.
Just re-checked and no additional logs are available with level set to debug. |
I can confirm that we had to downgrade to 1.12.3 to bring back the cluster. In another cluster we have we left 2 nodes in 1.12.3 and one in 1.13.0, which doesn't boot. Here's the inspect of the snapshot that fails to restore:
|
Same problem on a 3 nodes test cluster, running on AlmaLinux 8. Downgraded to 1.12.3 to have the agent able to start |
FYI - Same here on a single-node dev cluster. Snapshot info:
|
@MikeN123 , thanks so much for sharing; it's really useful. |
Here's mine
Started by this systemd unit
Nothing special to reproduce, just running a consul cluster in 1.12.3 (which has been installed and bootstraped with 1.12.3), change the binary to the 1.13.0 version and restart the service |
Config here (1.12.3 works fine, 1.13.0 does not start) - almost completely default except for enabling Connect.
Steps to reproduce - no idea, just upgraded the node which has been running for several versions and never had issues with upgrades. Maybe it's related to Connect, as all installs above seem to have that in common? |
We have found the root cause and are on fixing the issue. Thanks to all for helping troubleshoot the issue! |
Thank you everyone for the reports. We narrowed down the issue to a breaking change that has been addressed in 1.13.1. This bug in 1.13.0 affects those upgrading from Consul 1.11+ AND using Connect service mesh (i.e. Connect proxies registered). Everyone is advised to upgrade to 1.13.1 instead. |
@kisunji I am trying to upgrade from 1.12.4 to 1.13.2 and get this: 2022-09-29T14:03:33.851Z [WARN] agent: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'tls_min_version' field is deprecated. Use the 'tls.defaults.tls_min_version' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'verify_incoming_https' field is deprecated. Use the 'tls.https.verify_incoming' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
2022-09-29T14:03:33.852Z [WARN] agent: bootstrap_expect > 0: expecting 3 servers
2022-09-29T14:03:33.852Z [WARN] agent: if auto_encrypt.allow_tls is turned on, tls.internal_rpc.verify_incoming should be enabled (either explicitly or via tls.defaults.verify_incoming). It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'ca_file' field is deprecated. Use the 'tls.defaults.ca_file' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'cert_file' field is deprecated. Use the 'tls.defaults.cert_file' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'key_file' field is deprecated. Use the 'tls.defaults.key_file' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'tls_min_version' field is deprecated. Use the 'tls.defaults.tls_min_version' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'verify_incoming' field is deprecated. Use the 'tls.defaults.verify_incoming' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'verify_incoming_https' field is deprecated. Use the 'tls.https.verify_incoming' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'verify_incoming_rpc' field is deprecated. Use the 'tls.internal_rpc.verify_incoming' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'verify_outgoing' field is deprecated. Use the 'tls.defaults.verify_outgoing' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: The 'verify_server_hostname' field is deprecated. Use the 'tls.internal_rpc.verify_server_hostname' field instead.
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: bootstrap_expect > 0: expecting 3 servers
2022-09-29T14:03:33.859Z [WARN] agent.auto_config: if auto_encrypt.allow_tls is turned on, tls.internal_rpc.verify_incoming should be enabled (either explicitly or via tls.defaults.verify_incoming). It is necessary to turn it off during a migration to TLS, but it should definitely be turned on afterwards.
2022-09-29T14:03:33.862Z [INFO] agent.server.raft: starting restore from snapshot: id=79518-2867789-1664381847462 last-index=2867789 last-term=79518 size-in-bytes=101474
2022-09-29T14:03:33.863Z [INFO] agent.server.raft: snapshot restore progress: id=79518-2867789-1664381847462 last-index=2867789 last-term=79518 size-in-bytes=101474 read-bytes=53 percent-complete=0.05%
2022-09-29T14:03:33.863Z [ERROR] agent.server.raft: failed to restore snapshot: id=79518-2867789-1664381847462 last-index=2867789 last-term=79518 size-in-bytes=101474 error="object missing primary index"
2022-09-29T14:03:33.863Z [INFO] agent.server.raft: starting restore from snapshot: id=79518-2851404-1664285318345 last-index=2851404 last-term=79518 size-in-bytes=101474
2022-09-29T14:03:33.863Z [INFO] agent.server.raft: snapshot restore progress: id=79518-2851404-1664285318345 last-index=2851404 last-term=79518 size-in-bytes=101474 read-bytes=53 percent-complete=0.05%
2022-09-29T14:03:33.863Z [ERROR] agent.server.raft: failed to restore snapshot: id=79518-2851404-1664285318345 last-index=2851404 last-term=79518 size-in-bytes=101474 error="object missing primary index"
2022-09-29T14:03:33.863Z [INFO] agent.server: shutting down server
2022-09-29T14:03:33.863Z [ERROR] agent: Error starting agent: error="Failed to start Consul server: Failed to start Raft: failed to load any existing snapshots"
2022-09-29T14:03:33.863Z [INFO] agent: Exit code: code=1 I use Consul Connect with Nomad 1.3.5. |
I tried to upgrade from 1.12.4 to 1.14.0 and still get the same error as I mentioned above, A snapshot of raft from 1.12.4 before attempting the upgrade looks like this: [core@f03 ~]$ consul snapshot inspect backup.snap
ID 79661-3562292-1668597879567
Size 77849
Index 3562292
Term 79661
Version 1
Type Count Size
---- ---- ----
Register 52 45.8KB
KVS 17 8.5KB
ACLToken 15 5.7KB
ConfigEntry 14 4.9KB
Index 69 2.7KB
ACLPolicy 9 2.2KB
ACLRole 4 1.6KB
ConnectCA 1 1.2KB
ConnectCAProviderState 1 1.1KB
CoordinateBatchUpdate 5 835B
Session 3 528B
Autopilot 1 199B
ConnectCAConfig 1 195B
SystemMetadata 3 191B
Tombstone 2 183B
ServiceVirtualIP 3 151B
FederationState 1 139B
FreeVirtualIP 1 33B
ChunkingState 1 12B
---- ---- ----
Total 76KB |
@ahjohannessen Could you please raise a new issue with your Consul versions (OSS or Enterprise?) and log fragments (ideally containing the agent version)? From a quick glance I didn't find any diffs between If there is an issue with specific versions it would be ideal to track it separately since this one is considered closed. |
Overview of the Issue
Upgrading from 1.12.3 to 1.13.0 leads to a failure to start the cluster.
Reproduction Steps
Steps to reproduce this issue, eg:
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
Centos 7 LXC running on Proxmox.
Linux consul-01.mysite 5.11.22-5-pve #1 SMP PVE 5.11.22-10 (Tue, 28 Sep 2021 08:15:41 +0200) x86_64 x86_64 x86_64 GNU/Linux
Log Fragments
The text was updated successfully, but these errors were encountered: