Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVM incremental snapshot feature #9270

Open
wants to merge 36 commits into
base: main
Choose a base branch
from

Conversation

JoaoJandre
Copy link
Contributor

@JoaoJandre JoaoJandre commented Jun 18, 2024

Description

This PR solves issue #8907.

Currently, when taking a volume snapshot/backup with KVM as the hypervisor, it is always a full snapshot/backup. However, always taking full snapshots of volumes is costly for both the storage network and storage systems. To solve the aforementioned issues, this PR extends the volume snapshot feature in KVM, allowing users to create incremental volume snapshots using KVM as a hypervisor.

To give operators control over which type of snapshot is being created, a new global setting kvm.incremental.snapshot has been added, which can be changed at zone and cluster scopes; this setting is false by default. Also, the snapshot.delta.max configuration, used to control the maximum deltas when using XenServer, was extended to also limit the size of the backing chain of snapshots on primary/secondary storage.

This functionality is only available in environments with Libvirt 7.6.0+ and qemu 6.1+. If the kvm.incremental.snapshot setting is true, and the hosts do not have the required Libvirt and qemu versions, an error will be thrown when trying to take a snapshot. Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC that require a shared mount-point storage file system for KVM such as OCFS2 or GlusterFS), NFS, and local storage. Other storage types for KVM, such as CLVM and RBD, need different approaches to enable incremental backups; therefore, these are not currently supported.

Issue #8907 has more details and flowcharts of all the mapped workflows.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Description of tests

During testing, the kvm.incremental.snapshot setting was changed to true and the snapshot.delta.max setting was changed to 3.

Tests with snapshot.backup.to.secondary = false

For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

Test Result
Access the VM, create any file in it and create volume snapshot 1 while the VM running Full snapshot created
Access the VM, create a second file in it, create volume snapshot 2 while the VM running Incremental snapshot created with correct size and backing chain (snapshot 1)
Stop the VM and create volume snapshot 3 Correctly created incremental snapshot
Start the VM again, create volume snapshot 4 Full snapshot created
Migrate the VM and create volume snapshot 5 Incremental snapshot created from snapshot 4
Migrate VM + ROOT volume Exception

Snapshot restore tests

Test Result
Access the VM, delete all previously created files, stop the VM, restore snapshot 1 and start the VM again Restoration correctly performed, the file created in snapshot creation test 1 was present on the volume
Access the VM, delete the file restored in snapshot restore test 2, stop the VM, restore snapshot 2 and start the VM again Restoration correctly performed, the files created in tests 1 and 2 of snapshot creation were present on the volume

Snapshot removal tests

Test Result
Delete snapshot 5 Snapshot deleted and removed from storage
Delete snapshot 1 Snapshot deleted and not removed from storage
Delete snapshots 2 and 3 Snapshots deleted and removed from storage; furthermore, snapshot 1 was also removed from storage

Template creation test

# Test Result
1 Create template from snapshot 4 and create a VM using the template Template created correctly, VM had the files created in the original VM

Tests with snapshot.backup.to.secondary = true

All tests performed in the previous sections were repeated with snapshot.backup.to.secondary = false, in addition, two additional tests were performed. For the tests in this section, a test VM was created and reused for all tests.

Snapshot creation tests

N Test Result
1 Migrate the VM + ROOT volume and take snapshot 6 Migration carried out and full snapshot created
2 Stop the VM, migrate the volume and take snapshot 7 Volume migration performed and incremental snapshot created from snapshot 6

I have also tested that the bitmaps are removed once the snapshots are deleted.

Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 4.04%. Comparing base (a82a242) to head (e63fb7a).

❗ There is a different number of reports uploaded between BASE (a82a242) and HEAD (e63fb7a). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (a82a242) HEAD (e63fb7a)
unittests 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #9270       +/-   ##
============================================
- Coverage     15.80%   4.04%   -11.77%     
============================================
  Files          5627     392     -5235     
  Lines        492341   32175   -460166     
  Branches      59693    5678    -54015     
============================================
- Hits          77827    1301    -76526     
+ Misses       405990   30726   -375264     
+ Partials       8524     148     -8376     
Flag Coverage Δ
uitests 4.04% <ø> (ø)
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@weizhouapache
Copy link
Member

Good job @JoaoJandre

@DaanHoogland DaanHoogland added this to the 4.20.0.0 milestone Jun 19, 2024
@DaanHoogland
Copy link
Contributor

Good job @JoaoJandre

second that, tnx

@DaanHoogland
Copy link
Contributor

not gotten through all of it yet but looks good so far.

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10011

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-10507)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 53282 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9270-t10507-kvm-centos7.zip
Smoke tests completed. 111 look OK, 23 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_role_account_acls_multiple_mgmt_servers Error 2.28 test_dynamicroles.py
test_query_async_job_result Error 102.03 test_async_job.py
test_revoke_certificate Error 0.01 test_certauthority_root.py
test_configure_ha_provider_invalid Error 0.02 test_hostha_simulator.py
test_configure_ha_provider_valid Error 0.01 test_hostha_simulator.py
test_ha_configure_enabledisable_across_clusterzones Error 0.01 test_hostha_simulator.py
test_ha_disable_feature_invalid Error 0.01 test_hostha_simulator.py
test_ha_enable_feature_invalid Error 0.01 test_hostha_simulator.py
test_ha_list_providers Error 0.01 test_hostha_simulator.py
test_ha_multiple_mgmt_server_ownership Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_available Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_degraded Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_fenced Error 0.01 test_hostha_simulator.py
test_ha_verify_fsm_recovering Error 0.01 test_hostha_simulator.py
test_hostha_configure_default_driver Error 0.01 test_hostha_simulator.py
test_hostha_configure_invalid_provider Error 0.01 test_hostha_simulator.py
test_hostha_disable_feature_valid Error 0.01 test_hostha_simulator.py
test_hostha_enable_feature_valid Error 0.01 test_hostha_simulator.py
test_hostha_enable_feature_without_setting_provider Error 0.01 test_hostha_simulator.py
test_list_ha_for_host Error 0.01 test_hostha_simulator.py
test_list_ha_for_host_invalid Error 0.01 test_hostha_simulator.py
test_list_ha_for_host_valid Error 0.01 test_hostha_simulator.py
test_01_host_ping_on_alert Error 0.08 test_host_ping.py
test_01_host_ping_on_alert Error 0.08 test_host_ping.py
test_01_browser_migrate_template Error 15.32 test_image_store_object_migration.py
test_01_invalid_upgrade_kubernetes_cluster Failure 251.12 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 241.89 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 241.81 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 231.70 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 222.51 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 243.68 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 347.95 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 232.24 test_kubernetes_clusters.py
test_11_test_unmanaged_cluster_lifecycle Error 91.64 test_kubernetes_clusters.py
test_01_add_delete_kubernetes_supported_version Error 0.14 test_kubernetes_supported_versions.py
ContextSuite context=TestListIdsParams>:teardown Error 1.12 test_list_ids_parameter.py
login_test_saml_user Error 3.20 test_login.py
test_01_deployVMInSharedNetwork Error 77.64 test_network.py
test_03_destroySharedNetwork Failure 1.07 test_network.py
ContextSuite context=TestSharedNetwork>:teardown Error 2.17 test_network.py
test_oobm_issue_power_cycle Error 2.31 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_off Error 3.31 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_on Error 3.33 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_reset Error 3.34 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_soft Error 3.30 test_outofbandmanagement_nestedplugin.py
test_oobm_issue_power_status Error 1.22 test_outofbandmanagement_nestedplugin.py
test_oobm_background_powerstate_sync Failure 21.68 test_outofbandmanagement.py
test_oobm_background_powerstate_sync Error 21.69 test_outofbandmanagement.py
test_oobm_configure_default_driver Error 0.06 test_outofbandmanagement.py
test_oobm_configure_invalid_driver Error 0.05 test_outofbandmanagement.py
test_oobm_disable_feature_invalid Error 0.05 test_outofbandmanagement.py
test_oobm_disable_feature_valid Error 1.15 test_outofbandmanagement.py
test_oobm_enable_feature_invalid Error 0.04 test_outofbandmanagement.py
test_oobm_enable_feature_valid Error 1.11 test_outofbandmanagement.py
test_oobm_enabledisable_across_clusterzones Error 11.88 test_outofbandmanagement.py
test_oobm_enabledisable_across_clusterzones Error 11.88 test_outofbandmanagement.py
test_oobm_issue_power_cycle Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_cycle Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_off Error 4.35 test_outofbandmanagement.py
test_oobm_issue_power_off Error 4.35 test_outofbandmanagement.py
test_oobm_issue_power_on Error 2.32 test_outofbandmanagement.py
test_oobm_issue_power_on Error 2.33 test_outofbandmanagement.py
test_oobm_issue_power_reset Error 4.33 test_outofbandmanagement.py
test_oobm_issue_power_reset Error 4.33 test_outofbandmanagement.py
test_oobm_issue_power_soft Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_soft Error 4.34 test_outofbandmanagement.py
test_oobm_issue_power_status Error 4.36 test_outofbandmanagement.py
test_oobm_issue_power_status Error 4.36 test_outofbandmanagement.py
test_oobm_multiple_mgmt_server_ownership Error 1.16 test_outofbandmanagement.py
test_oobm_multiple_mgmt_server_ownership Error 1.16 test_outofbandmanagement.py
test_oobm_zchange_password Error 2.27 test_outofbandmanagement.py
test_oobm_zchange_password Error 2.27 test_outofbandmanagement.py
test_02_edit_primary_storage_tags Error 0.01 test_primary_storage.py
test_01_vpc_privategw_acl Error 0.03 test_privategw_acl_ovs_gre.py
test_03_vpc_privategw_restart_vpc_cleanup Error 0.02 test_privategw_acl_ovs_gre.py
test_05_vpc_privategw_check_interface Error 0.02 test_privategw_acl_ovs_gre.py
test_01_vpc_privategw_acl Error 53.57 test_privategw_acl.py
test_02_vpc_privategw_static_routes Error 213.98 test_privategw_acl.py
test_03_vpc_privategw_restart_vpc_cleanup Error 209.22 test_privategw_acl.py
test_04_rvpc_privategw_static_routes Error 337.92 test_privategw_acl.py
test_01_snapshot_root_disk Error 1.14 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 50.05 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 50.05 test_snapshots.py
ContextSuite context=TestSnapshotStandaloneBackup>:setup Error 170.10 test_snapshots.py
test_CreateTemplateWithDuplicateName Error 21.75 test_templates.py
test_01_register_template_direct_download_flag Error 0.16 test_templates.py
test_01_positive_tests_usage Error 10.51 test_usage_events.py
test_01_ISO_usage Error 1.08 test_usage.py
test_01_lb_usage Error 4.25 test_usage.py
test_01_nat_usage Error 8.33 test_usage.py
test_01_public_ip_usage Error 1.07 test_usage.py
test_01_snapshot_usage Error 3.18 test_usage.py
test_01_template_usage Error 13.47 test_usage.py
test_01_vm_usage Error 134.27 test_usage.py
test_01_volume_usage Error 125.61 test_usage.py
test_01_vpn_usage Error 9.58 test_usage.py
test_12_start_vm_multiple_volumes_allocated Error 10.54 test_vm_life_cycle.py
test_01_vmschedule_create Error 0.09 test_vm_schedule.py
test_disable_oobm_ha_state_ineligible Error 0.05 test_hostha_kvm.py
test_hostha_configure_default_driver Error 0.04 test_hostha_kvm.py
test_hostha_enable_ha_when_host_disabled Error 0.04 test_hostha_kvm.py
test_hostha_enable_ha_when_host_disconected Error 0.04 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 0.06 test_hostha_kvm.py
test_hostha_kvm_host_degraded Error 0.04 test_hostha_kvm.py
test_hostha_kvm_host_fencing Error 0.04 test_hostha_kvm.py
test_hostha_kvm_host_recovering Error 0.04 test_hostha_kvm.py
test_remove_ha_provider_not_possible Error 0.04 test_hostha_kvm.py

@weizhouapache
Copy link
Member

@blueorangutan test rocky8 kvm-rocky8

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

@alexandremattioli
Copy link
Contributor

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

@JoaoJandre
Copy link
Contributor Author

@JoaoJandre nice one. Just one remark, the following sentence sounds contradictory to me "Additionally, this functionality is only available when using file based storage, such as shared mount-point (iSCSI and FC)", if it supports iSCSI and FC (through a shared mountpoint) it does support block storage, I think the phrasing could cause some confusion as to which types of storage are supported.

Hey @alexandremattioli, I understand your confusion. However, when using shared mount-point, as far as ACS is concerned, the storage is file-based, we will not be working with blocks directly, only files (as ACS does already for shared mount point). The mentions on parenthesis are there to give an example of underlying storages that might be behind the shared mount-point.

I have updated the description to add a little more context.

@blueorangutan
Copy link

[SF] Trillian test result (tid-10523)
Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8
Total time taken: 47815 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9270-t10523-kvm-rocky8.zip
Smoke tests completed. 131 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestListIdsParams>:teardown Error 1.15 test_list_ids_parameter.py
test_01_snapshot_root_disk Error 6.17 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 49.21 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 49.21 test_snapshots.py
ContextSuite context=TestSnapshotStandaloneBackup>:teardown Error 60.81 test_snapshots.py
test_01_snapshot_usage Error 26.05 test_usage.py

Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@JoaoJandre
Copy link
Contributor Author

@slavkap I have addressed all of your comments. Could you take another look at the PR?

Copy link

github-actions bot commented Nov 4, 2024

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@JoaoJandre
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11532

@JoaoJandre
Copy link
Contributor Author

@DaanHoogland could we run the CI here?

@shwstppr
Copy link
Contributor

shwstppr commented Nov 8, 2024

@blueorangutan test

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-11768)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 53713 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9270-t11768-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deployVMInSharedNetwork Failure 64.58 test_network.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:teardown Error 64.65 test_network.py

Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@JoaoJandre
Copy link
Contributor Author

@slavkap @winterhazel there have been multiple changes since you last reviewed, could you take another look to see if everything looks good to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KVM Incremental Snapshots/Backups