Implenemt playbook to recover vmhost server automatically #2800

bingwang-ms · 2021-01-13T10:40:29Z

Signed-off-by: bingwang [email protected]

Description of PR

Summary:
Fixes # (issue)
This commit introduced several playbook to recover vmhost server automatically.
It's extremely time consuming to redeploy all testbeds on a host server if the server is down or rebooted. This PR adds a new option in testbed-cli.sh to do a cleanup of host server, and adds a respin of vm that failed to start.

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Approach

What is the motivation for this PR?

This PR is to add new playbooks to support auto testbed recovery.

How did you do it?

Add a respin playbook to respin failed VMs
Add a cleanup playbook to cleanup host servers.

How did you verify/test it?

Verified in starlab.

$ ./recover_server.py --testbed-servers server_16 --testbed testbed.csv --vm-file veos --inventory str2 --passfile password.txt
INFO - LOG PATH: /tmp/recover_server_2021-01-19_07-19-44
INFO - Start running task server_16_cleanup_vmhost
INFO - Finish running task server_16_cleanup_vmhost
INFO - Start running task vms16-dual-t0-7050-2_start_topo_vms
INFO - Finish running task vms16-dual-t0-7050-2_start_topo_vms
INFO - Start running task vms16-dual-t0-7050-2_add_topo
INFO - Finish running task vms16-dual-t0-7050-2_add_topo
INFO - Start running task vms16-dual-t0-7050-2_deloy_mg
INFO - Finish running task vms16-dual-t0-7050-2_deloy_mg

============= server_16 recovery summary =============

Server server_16 recovery result:
server_16             start-topo-vms    add-topo    deploy-mg
--------------------  ----------------  ----------  -----------
vms16-dual-t0-7050-2  passed            passed      passed

======================================================

Any platform specific information?

No.

Supported testbed topology if it's a new test case?

No.

Documentation

This commit introduced several playbook to recover vmhost server automatically. Signed-off-by: bingwang <[email protected]>

Signed-off-by: bingwang <[email protected]>

lolyu · 2021-01-19T04:38:36Z

ansible/roles/vm_set/tasks/kickstart_vm.yml

+      src_disk_image: "{{ home_path }}/{{ root_path }}/images/{{ hdd_image_filename }}"
+      disk_image: "{{ home_path }}/{{ root_path }}/disks/{{ vm_name }}_hdd.vmdk"
+      cdrom_image: "{{ home_path }}/{{ root_path }}/images/{{ cd_image_filename }}"
+    when: '"kickstart_code" in kickstart_output and kickstart_output.kickstart_code != 0'


Since almost all tasks in respin_vm.yml needs escalated privilege, it will be nicer to apply become like:

- name: Respin failed vm include_tasks: respin_vm.yml vars: src_disk_image: "{{ home_path }}/{{ root_path }}/images/{{ hdd_image_filename }}" disk_image: "{{ home_path }}/{{ root_path }}/disks/{{ vm_name }}_hdd.vmdk" cdrom_image: "{{ home_path }}/{{ root_path }}/images/{{ cd_image_filename }}" apply: become: True when: '"kickstart_code" in kickstart_output and kickstart_output.kickstart_code != 0'

https://docs.ansible.com/ansible/latest/collections/ansible/builtin/include_tasks_module.html

lolyu · 2021-01-19T04:41:02Z

ansible/testbed_cleanup.yml

+# This playbook will cleanup a vm_host, including removing all veos, containers and net bridges.
+
+- hosts: servers:&vm_host
+  gather_facts: no


Same as above to put become here.

wangxin · 2021-01-29T10:03:56Z

ansible/roles/vm_set/tasks/kickstart_vm.yml

    set_fact:
      kickstart_failed_vms: "{{ kickstart_failed_vms + [vm_name] }}"
-    when: '"kickstart_code" in kickstart_output_final and kickstart_output_final.kickstart_code != 0'
+    when: '"kickstart_code" in kickstart_output and kickstart_output.kickstart_code != 0'


Is it possible to retry respine if one round of respine still failed? To avoid endless retry, we can add a max retry limitation, like 3 times?

Good suggestion. I'll make this change in next PR.

bingwang-ms added 2 commits January 13, 2021 02:37

Implenemt playbook for testbed auto recovery.

bbe13c5

This commit introduced several playbook to recover vmhost server automatically. Signed-off-by: bingwang <[email protected]>

Fix typo

c4b7e5a

Signed-off-by: bingwang <[email protected]>

bingwang-ms force-pushed the recover_server branch from 00d562b to c4b7e5a Compare January 13, 2021 16:02

Add missing testbed_cleanup.yml

9f6186b

Signed-off-by: bingwang <[email protected]>

bingwang-ms marked this pull request as ready for review January 19, 2021 03:48

bingwang-ms requested a review from a team January 19, 2021 03:48

bingwang-ms changed the title ~~[draft]Implenemt playbook to recover vmhost server automatically~~ Implenemt playbook to recover vmhost server automatically Jan 19, 2021

lolyu reviewed Jan 19, 2021

View reviewed changes

bingwang-ms force-pushed the recover_server branch from 11bad96 to 9f6186b Compare January 19, 2021 08:10

wangxin reviewed Jan 29, 2021

View reviewed changes

wangxin approved these changes Feb 25, 2021

View reviewed changes

bingwang-ms merged commit 0756f4c into sonic-net:master Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implenemt playbook to recover vmhost server automatically #2800

Implenemt playbook to recover vmhost server automatically #2800

bingwang-ms commented Jan 13, 2021 •

edited

Loading

lolyu Jan 19, 2021

lolyu Jan 19, 2021

wangxin Jan 29, 2021

bingwang-ms Feb 25, 2021

Implenemt playbook to recover vmhost server automatically #2800

Implenemt playbook to recover vmhost server automatically #2800

Conversation

bingwang-ms commented Jan 13, 2021 • edited Loading

Description of PR

Type of change

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

lolyu Jan 19, 2021

Choose a reason for hiding this comment

lolyu Jan 19, 2021

Choose a reason for hiding this comment

wangxin Jan 29, 2021

Choose a reason for hiding this comment

bingwang-ms Feb 25, 2021

Choose a reason for hiding this comment

bingwang-ms commented Jan 13, 2021 •

edited

Loading