Skip to content

Commit

Permalink
Merge branch 'sonic-net:master' into pr_lossless_response_to_external…
Browse files Browse the repository at this point in the history
…_pause_storms
  • Loading branch information
dks0692 authored Mar 26, 2024
2 parents d74f37c + 1cf09ed commit ef1fcb8
Show file tree
Hide file tree
Showing 191 changed files with 12,023 additions and 1,181 deletions.
34 changes: 17 additions & 17 deletions .azure-pipelines/baseline_test/baseline.test.template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,20 +115,20 @@ jobs:
STOP_ON_FAILURE: "False"
TEST_PLAN_NUM: $(BASELINE_MGMT_PUBLIC_MASTER_TEST_NUM)

- job: dpu_elastictest
displayName: "kvmtest-dpu by Elastictest"
timeoutInMinutes: 240
continueOnError: false
pool: ubuntu-20.04
steps:
- template: ../run-test-elastictest-template.yml
parameters:
TOPOLOGY: dpu
MIN_WORKER: $(T0_SONIC_INSTANCE_NUM)
MAX_WORKER: $(T0_SONIC_INSTANCE_NUM)
KVM_IMAGE_BRANCH: "master"
MGMT_BRANCH: "master"
BUILD_REASON: "BaselineTest"
RETRY_TIMES: "0"
STOP_ON_FAILURE: "False"
TEST_PLAN_NUM: $(BASELINE_MGMT_PUBLIC_MASTER_TEST_NUM)
# - job: dpu_elastictest
# displayName: "kvmtest-dpu by Elastictest"
# timeoutInMinutes: 240
# continueOnError: false
# pool: ubuntu-20.04
# steps:
# - template: ../run-test-elastictest-template.yml
# parameters:
# TOPOLOGY: dpu
# MIN_WORKER: $(T0_SONIC_INSTANCE_NUM)
# MAX_WORKER: $(T0_SONIC_INSTANCE_NUM)
# KVM_IMAGE_BRANCH: "master"
# MGMT_BRANCH: "master"
# BUILD_REASON: "BaselineTest"
# RETRY_TIMES: "0"
# STOP_ON_FAILURE: "False"
# TEST_PLAN_NUM: $(BASELINE_MGMT_PUBLIC_MASTER_TEST_NUM)
103 changes: 103 additions & 0 deletions .azure-pipelines/recover_testbed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Automatically recover unhealthy testbed via console

## Background
The success rate of nightly test depends on the health of the testbeds.
In the past, we used pipelines to re-deploy testbeds when they had problems. This could fix some issues like configuration loss, but it was not enough.
Sometimes, the pipeline failed to restore the testbeds, and we had to do it manually. This was time-consuming and inefficient.
Therefore, we need a better way to automatically recover unhealthy testbeds, which can handle more situations and respond faster.

## Design
Our script is designed to recover devices that lose their management ip and cannot be accessed via ssh.
The script uses console as an alternative way to connect to the device and reinstall the image from the boot loader.

The script first checks the ssh connectivity of the device.
If ssh is working, it then checks the availability of `sonic-installer` on the device.
If `sonic-installer` is working, the device is considered healthy and no further action is needed.
Otherwise, the script proceeds to the recovery process via console.

The recovery process depends on the console access of the device.
If console access is not possible, the script cannot proceed.
The script obtains a console session and power cycles the device. It then waits for the right timing to enter the boot loader.
The script supports four types of boot loaders:
+ ONIE: used by Mellanox, Cisco, Acs, Celestica hwskus
+ Marvell: used by Nokia hwskus
+ Loader: used by Nexus hwskus
+ Aboot: used by Arista hwskus

In the boot loader, the script sets the temporary management ip and default route, and then reinstalls the image.
After the image is reinstalled, the script logs in to the device via console again and sets the permanent management ip and default route in Sonic.
It also writes these configurations to `/etc/network/interfaces` file to prevent losing them after reboot.

Finally, the script verifies that ssh and `sonic-installer` are working on the device. If both are ok, the recovery process is completed.

## Structure
Our scripts are under the folder `.azure-pipelines/recover_testbed`
```buildoutcfg
.azure-pipelines
|
|-- recover_testbed
|
|-- common.py
|-- constants.py
|-- dut_connection.py
|-- interfaces.j2
|-- recover_testbed.py
|-- testbed_status.py
```

+ `common.py` - This module contains the common functions that are used for recovering testbeds, such as how to enter the boot loader mode.
These functions are imported by other modules that implement the specific recovery steps for different devices.


+ `constants.py` - This module defines the constants that are used under the recover_testbed folder, such as sonic prompt, key words of timing.
These constants are used to avoid hard-coding and to make the code more readable and maintainable.


+ `dut_connection.py` - This module defines the connection of the DUT, including ssh and console connections.
It provides functions to create these connections, as well as to handle exceptions and errors.
These functions are used to communicate with the DUT and execute commands on it.


+ `interfaces.j2` - This is a Jinja2 template file that is used to generate the file `/etc/network/interfaces` on the DUT.
It defines the network interfaces and their configurations, such as IP address, netmask, gateway, etc.
The template file takes some variables as input, such as the interface name, the IP address range, etc. These variables are passed by the recover_testbed.py module.


+ `recover_testbed.py` - This is the main module that implements the recovery process for the testbed.
It takes some arguments as input, such as the inventory, the device name, the hwsku, etc.
It then calls the appropriate functions from the common.py and dut_connection.py modules to establish a connection with the DUT and enter the recovery mode.
It also uses the interfaces.j2 template file to generate and apply the network configuration on the DUT.
Finally, it verifies that the DUT is successfully recovered and reports the result.


+ `testbed_status.py` - This module defines some status of the DUT, such as losing management IP address.
It provides functions to check and update these status, as well as to log them.
These functions are used by the recover_testbed.py module to monitor and troubleshoot the recovery process.



## Description of parameters
+ `inventory` - The name of the inventory file that contains the information about the devices in the testbed, such as hostname, IP address, hwsku, etc.


+ `testbed-name` - The name of the testbed. The testbed name should match the name of the testbed file that defines the topology and connections of the devices in the testbed.


+ `tbfile` - The name of the testbed file that defines the topology and connections of the devices in the testbed. The default value is `testbed.yaml`.


+ `verbosity` - The level of verbosity that is used for logging the automation steps and results. Verbosity level can be 0 (silent), 1 (brief), 2 (detailed), or 3 (verbose). The default value is 2.


+ `log-level` - The level of severity that is used for logging the automation messages. Log level can be Error, Warning, Info, or Debug. The default value is Debug.


+ `image` - The URL of the golden image that is used to install DUT. The golden image should be a valid SONiC image file that can be downloaded from a image server.


+ `hwsku` - The hardware SKU that identifies the model and configuration of the DUT in the testbed.

## How to run the script
The script should be run from the `sonic-mgmt/ansible` directory with the following command:
`python3 ../.azure-pipelines/recover_testbed/recover_testbed.py -i {inventory} -t {tbname} --tbfile {tbfile} --log-level {log-level} --image {image url} --hwsku {hwsku}
`
61 changes: 40 additions & 21 deletions .azure-pipelines/recover_testbed/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
import time
import pexpect
import ipaddress
from constants import OS_VERSION_IN_GRUB, ONIE_ENTRY_IN_GRUB, INSTALL_OS_IN_ONIE, \
ONIE_START_TO_DISCOVERY, SONIC_PROMPT, MARVELL_ENTRY
from constants import OS_VERSION_IN_GRUB, ONIE_ENTRY_IN_GRUB, ONIE_INSTALL_MODEL, \
ONIE_START_TO_DISCOVERY, SONIC_PROMPT, MARVELL_ENTRY, BOOTING_INSTALL_OS, ONIE_RESCUE_MODEL

_self_dir = os.path.dirname(os.path.abspath(__file__))
base_path = os.path.realpath(os.path.join(_self_dir, "../.."))
Expand Down Expand Up @@ -46,7 +46,7 @@ def get_pdu_managers(sonichosts, conn_graph_facts):
return pdu_managers


def posix_shell_onie(dut_console, mgmt_ip, image_url, is_nexus=False, is_nokia=False):
def posix_shell_onie(dut_console, mgmt_ip, image_url, is_nexus=False, is_nokia=False, is_celestica=False):
enter_onie_flag = True
gw_ip = list(ipaddress.ip_interface(mgmt_ip).network.hosts())[0]

Expand Down Expand Up @@ -80,38 +80,57 @@ def posix_shell_onie(dut_console, mgmt_ip, image_url, is_nexus=False, is_nokia=F
dut_console.remote_conn.send(b'\x1b[B')
continue

if ONIE_ENTRY_IN_GRUB in x and INSTALL_OS_IN_ONIE not in x:
if ONIE_ENTRY_IN_GRUB in x and ONIE_INSTALL_MODEL not in x and ONIE_RESCUE_MODEL not in x:
dut_console.remote_conn.send("\n")
enter_onie_flag = False

if ONIE_RESCUE_MODEL in x:
dut_console.remote_conn.send(b'\x1b[A')
dut_console.remote_conn.send("\n")

if is_celestica and BOOTING_INSTALL_OS in x:
dut_console.remote_conn.send("\n")

# "ONIE: Starting ONIE Service Discovery"
if ONIE_START_TO_DISCOVERY in x:
dut_console.remote_conn.send("\n")

# TODO: Define a function to send command here
for i in range(5):
dut_console.remote_conn.send('onie-discovery-stop\n')
dut_console.remote_conn.send("\n")
dut_console.remote_conn.send('onie-discovery-stop\n')
dut_console.remote_conn.send("\n")

if is_nokia:
enter_onie_flag = False
dut_console.remote_conn.send('umount /dev/sda2\n')
if is_nokia:
enter_onie_flag = False
dut_console.remote_conn.send('umount /dev/sda2\n')

dut_console.remote_conn.send("ifconfig eth0 {} netmask {}".format(mgmt_ip.split('/')[0],
ipaddress.ip_interface(mgmt_ip).with_netmask.split('/')[1]))
dut_console.remote_conn.send("\n")
dut_console.remote_conn.send("ifconfig eth0 {} netmask {}".format(mgmt_ip.split('/')[0],
ipaddress.ip_interface(mgmt_ip).with_netmask.split('/')[1]))
dut_console.remote_conn.send("\n")

dut_console.remote_conn.send("ip route add default via {}".format(gw_ip))
dut_console.remote_conn.send("\n")
dut_console.remote_conn.send("ip route add default via {}".format(gw_ip))
dut_console.remote_conn.send("\n")

dut_console.remote_conn.send("onie-nos-install {}".format(image_url))
dut_console.remote_conn.send("\n")
# We will wait some time to connect to image server
# Remove the image if it already exists
dut_console.remote_conn.send("rm -f {}".format(image_url.split("/")[-1]))
dut_console.remote_conn.send("\n")

dut_console.remote_conn.send("wget {}".format(image_url))
dut_console.remote_conn.send("\n")

# Waiting downloading finishing
for i in range(5):
time.sleep(60)
x = dut_console.remote_conn.recv(1024)
x = x.decode('ISO-8859-9')
# TODO: Give a sample output here
if "ETA" in x:
# If we see "0:00:00", it means we finish downloading sonic image
# Sample output:
# sonic-mellanox-202012 100% |*******************************| 1196M 0:00:00 ETA
if "0:00:00" in x:
break

dut_console.remote_conn.send("onie-nos-install {}".format(image_url.split("/")[-1]))
dut_console.remote_conn.send("\n")

if SONIC_PROMPT in x:
dut_console.remote_conn.close()

Expand Down Expand Up @@ -178,7 +197,7 @@ def posix_shell_aboot(dut_console, mgmt_ip, image_url):
dut_console.remote_conn.send("\n")

for i in range(5):
time.sleep(10)
time.sleep(60)
x = dut_console.remote_conn.recv(1024)
x = x.decode('ISO-8859-9')
if "ETA" in x:
Expand Down
11 changes: 9 additions & 2 deletions .azure-pipelines/recover_testbed/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,18 @@
# Press enter to boot the selected OS, `e' to edit the commands
# before booting or `c' for a command-line.

INSTALL_OS_IN_ONIE = "Install OS"
ONIE_INSTALL_MODEL = "Install"
ONIE_RESCUE_MODEL = "Rescue"

# While entering into ONIE, we will get some output like
# " Booting `ONIE: Install OS' "
# " OS Install Mode"
BOOTING_INSTALL_OS = "Booting"

# After enter into the installation in ONIE, it will discover some configuration
# And finally, we will get the string "ONIE: Starting ONIE Service Discovery"
ONIE_START_TO_DISCOVERY = "Discovery"
# To fit the scenario of Celestica, we finally use the string "covery"
ONIE_START_TO_DISCOVERY = "covery"

# At last, if installation successes in ONIE, we will get the prompt
SONIC_PROMPT = "sonic login:"
Expand Down
1 change: 0 additions & 1 deletion .azure-pipelines/recover_testbed/dut_connection.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,6 @@ def get_ssh_info(sonichost):
host=sonichost.im.get_hosts(pattern='sonic')[0]).get("ansible_altpassword")
sonic_password = [creds['sonicadmin_password'], sonicadmin_alt_password]
sonic_ip = sonichost.im.get_host(sonichost.hostname).vars['ansible_host']
logging.info("sonic username: {}, password: {}".format(sonic_username, sonic_password))
return sonic_username, sonic_password, sonic_ip


Expand Down
15 changes: 13 additions & 2 deletions .azure-pipelines/recover_testbed/interfaces.j2
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,20 @@ auto eth0
iface eth0 inet static
address {{ addr }}
netmask {{ mask }}
network {{ network }}
broadcast {{ brd_ip }}
################ management network policy routing rules
#### management port up rules"
up ip route add default via {{ gwaddr }} dev eth0 table default
up ip rule add from {{ addr }}/32 table default
up ip -4 route add default via {{ gwaddr }} dev eth0 table default metric 201
up ip -4 route add {{ mgmt_ip }} dev eth0 table default

# management port down rules
pre-down ip -4 route delete default via {{ gwaddr }} dev eth0 table default
pre-down ip -4 route delete {{ mgmt_ip }} dev eth0 table default

#
source /etc/network/interfaces.d/*
#

{% endblock mgmt_interface %}
#
41 changes: 24 additions & 17 deletions .azure-pipelines/recover_testbed/recover_testbed.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import os
import sys
import ipaddress
import traceback
from common import do_power_cycle, check_sonic_installer, posix_shell_aboot, posix_shell_onie
from constants import RC_SSH_FAILED

Expand Down Expand Up @@ -44,17 +45,18 @@ def recover_via_console(sonichost, conn_graph_facts, localhost, mgmt_ip, image_u
posix_shell_aboot(dut_console, mgmt_ip, image_url)
elif device_type in ["nexus"]:
posix_shell_onie(dut_console, mgmt_ip, image_url, is_nexus=True)
elif device_type in ["mellanox", "cisco", "acs", "celestica"]:
posix_shell_onie(dut_console, mgmt_ip, image_url)
elif device_type in ["mellanox", "cisco", "acs", "celestica", "force10"]:
is_celestica = device_type in ["celestica"]
posix_shell_onie(dut_console, mgmt_ip, image_url, is_celestica=is_celestica)
elif device_type in ["nokia"]:
posix_shell_onie(dut_console, mgmt_ip, image_url, is_nokia=True)
else:
return
raise Exception("We don't support this type of testbed.")

dut_lose_management_ip(sonichost, conn_graph_facts, localhost, mgmt_ip)
except Exception as e:
logger.info(e)
return
traceback.print_exc()
raise Exception(e)


def recover_testbed(sonichosts, conn_graph_facts, localhost, image_url, hwsku):
Expand All @@ -69,15 +71,29 @@ def recover_testbed(sonichosts, conn_graph_facts, localhost, image_url, hwsku):
if type(dut_ssh) == tuple:
logger.info("SSH success.")

# May recover from boot loader, need to delete image file
sonichost.shell("sudo rm -f /host/{}".format(image_url.split("/")[-1]),
module_ignore_errors=True)

# Add ip info into /etc/network/interface
extra_vars = {
'addr': mgmt_ip.split('/')[0],
'mask': ipaddress.ip_interface(mgmt_ip).with_netmask.split('/')[1],
'gwaddr': list(ipaddress.ip_interface(mgmt_ip).network.hosts())[0]
'gwaddr': list(ipaddress.ip_interface(mgmt_ip).network.hosts())[0],
'mgmt_ip': mgmt_ip,
'brd_ip': ipaddress.ip_interface(mgmt_ip).network.broadcast_address,
'network': str(ipaddress.ip_interface(mgmt_ip).network).split('/')[0]
}
sonichost.vm.extra_vars.update(extra_vars)
sonichost.template(src="../.azure-pipelines/recover_testbed/interfaces.j2",
dest="/etc/network/interface")
dest="/etc/network/interfaces")

# Add management ip info into config_db.json
sonichost.template(src="../.azure-pipelines/recover_testbed/mgmt_ip.j2",
dest="/etc/sonic/mgmt_ip.json")
sonichost.shell("configlet -u -j {}".format("/etc/sonic/mgmt_ip.json"))

sonichost.shell("sudo config save -y")

sonic_username = dut_ssh[0]
sonic_password = dut_ssh[1]
Expand All @@ -94,8 +110,7 @@ def recover_testbed(sonichosts, conn_graph_facts, localhost, image_url, hwsku):
# Do power cycle
need_to_recover = True
else:
logger.info("Authentication failed. Passwords are incorrect.")
return
raise Exception("Authentication failed. Passwords are incorrect.")

if need_to_recover:
recover_via_console(sonichost, conn_graph_facts, localhost, mgmt_ip, image_url, hwsku)
Expand Down Expand Up @@ -182,14 +197,6 @@ def main(args):
help="Loglevel"
)

parser.add_argument(
"-o", "--output",
type=str,
dest="output",
required=False,
help="Output duts version to the specified file."
)

parser.add_argument(
"--image",
type=str,
Expand Down
1 change: 1 addition & 0 deletions .azure-pipelines/recover_testbed/testbed_status.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,5 @@ def dut_lose_management_ip(sonichost, conn_graph_facts, localhost, mgmt_ip):
logging.info(e)
finally:
logger.info("=====Recover finish=====")
localhost.pause(seconds=120, prompt="Wait for SONiC initialization")
dut_console.disconnect()
Loading

0 comments on commit ef1fcb8

Please sign in to comment.