Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sai_failure_dump]Invoking dump during SAI failure #2633

Merged
merged 2 commits into from
Feb 7, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 51 additions & 13 deletions scripts/generate_dump
Original file line number Diff line number Diff line change
Expand Up @@ -1053,21 +1053,26 @@ collect_mellanox() {
local sai_dump_folder="/tmp/saisdkdump"
local sai_dump_filename="${sai_dump_folder}/sai_sdk_dump_$(date +"%m_%d_%Y_%I_%M_%p")"

${CMD_PREFIX}docker exec syncd mkdir -p $sai_dump_folder
${CMD_PREFIX}docker exec syncd saisdkdump -f $sai_dump_filename

if [ $? != 0 ]; then
echo "Failed to collect saisdkdump."
fi
if [[ "$( docker container inspect -f '{{.State.Running}}' syncd )" == "true" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change part of this PR? This is not related to the feature, correct?
Suggest revert this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have saisdkdump run as part of regular techsupport flow which will be called during auto techsupport. Since during orchagent abort, syncd will not be running, this flow will throw errors. That's the main reason we are introducing this feature and since we are taking sai failure dump, a check is added here to ensure that saisdkdump is not called in regular techsupport flow when syncd is not running.

if [[ x"$(sonic-db-cli APPL_DB EXISTS PORT_TABLE:PortInitDone)" == x"1" ]]; then
# Run saisdkdump only after the create_switch is known to be successful
${CMD_PREFIX}docker exec syncd mkdir -p $sai_dump_folder
${CMD_PREFIX}docker exec syncd saisdkdump -f $sai_dump_filename

if [ $? != 0 ]; then
echo "Failed to collect saisdkdump."
fi

copy_from_docker syncd $sai_dump_folder $sai_dump_folder
echo "$sai_dump_folder"
for file in `ls $sai_dump_folder`; do
save_file ${sai_dump_folder}/${file} sai_sdk_dump true
done
copy_from_docker syncd $sai_dump_folder $sai_dump_folder
echo "$sai_dump_folder"
for file in `ls $sai_dump_folder`; do
save_file ${sai_dump_folder}/${file} sai_sdk_dump true
done

${CMD_PREFIX}rm -rf $sai_dump_folder
${CMD_PREFIX}docker exec syncd rm -rf $sai_dump_folder
${CMD_PREFIX}rm -rf $sai_dump_folder
${CMD_PREFIX}docker exec syncd rm -rf $sai_dump_folder
fi
fi

# run 'hw-management-generate-dump.sh' script and save the result file
HW_DUMP_FILE=/usr/bin/hw-management-generate-dump.sh
Expand Down Expand Up @@ -1429,6 +1434,38 @@ save_crash_files() {
fi
}

###############################################################################
# Collect SAI failure dump files under /var/log/sai_failure_dump/. These files are
# created because of the orchagent abort triggered by SAI programming failure
# Globals:
# None
# Arguments:
# None
# Returns:
# None
###############################################################################
save_sai_failure_dump(){
for file in $(find_files "/var/log/sai_failure_dump/"); do
if $TAR -tf $TARFILE | grep $BASE/log/$(basename $file); then
# if the files are already collected under the log/ dir
# just add a symbolic link
if [ ! -z "${file##*.gz}" ]; then
# files saved under log/ are zipped with gz
file=$file.gz
fi
${CMD_PREFIX}save_symlink ${file} sai_failure_dump log
else
if [ ! -z "${file##*.gz}" ]; then
${CMD_PREFIX}save_file ${file} sai_failure_dump true
else
${CMD_PREFIX}save_file ${file} sai_failure_dump false
fi
fi
#Clean up the file once its part of tech support
rm -f $file
done
}

###############################################################################
# Get number of ASICs in the platform
# Globals:
Expand Down Expand Up @@ -1706,6 +1743,7 @@ main() {
save_log_files
save_crash_files
save_warmboot_files
save_sai_failure_dump

if [[ "$asic" = "mellanox" ]]; then
collect_mellanox_dfw_dumps
Expand Down