Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding global-timeout, individual command timeout, log files collection #1249

Merged
merged 4 commits into from
Jan 4, 2021

Conversation

FuzailBrcm
Copy link
Contributor

and some other enhancements to techsupport

- What I did
Following is the brief description of the changes,

  • Adding a ‘--silent’ option to ‘show techsupport’ command. Various tar/untar, addition and removal logs appear on the console by default. This option would disable above logs.
  • Adding global and per-command timeouts. This would provide more user control on ‘show techsupport’ CLI.
  • Adding time profiling information for the commands in techsupport. Time profiling information would be part of the tarball and helps to analyse the time consumption per command.
  • Sometimes ‘syncd’ docker is down and bcmshell is unavailable. In such cases all the bcmcmd commands would timeout and result in tremendous increase in the total techsupport collection time. We provided an option to skip rest of the bcmcmd commands once one command times out.
  • Added ‘show services’, ‘show reboot-cause’ and various BGP, BFD, bcm shell and other commands
  • Optimised the /var/log files collection. If the number of files are large in /var/log folder, it takes a long time to add each individually to the tarball. If the folder is tar'ed at once, the time taken reduces significantly.
  • Following error was observed while tar'ing softlinks inside .etc folder.
    ** Tar append operation failed. Aborting for safety. **
    This issue was due to softlinks present at /etc folder where the destination file is absent. Fixed this issue by deleting such softlinks before adding them to the tarball.

- How I did it

  • Added new options to the CLICK command 'show techsupport'
  • Modified the 'generate_dump' script to accomodate other changes

- How to verify it
Here are some outputs,
root@sonic:/home/admin# show techsupport --silent
Techsupport is running with silent option. This command might take a long time.

HW Mgmt dump script /usr/bin/hw-management-generate-dump.sh does not exist
/var/dump/sonic_dump_sonic_20201117_161246.tar.gz
root@sonic:/home/admin#
root@sonic:~# show techsupport -h
Usage: show techsupport [OPTIONS]

Gather information for troubleshooting

Options:
--since TEXT Collect logs and core files since given date
-g, --global-timeout INTEGER Global timeout for techsupport in minutes.
Default 30 mins
-c, --cmd-timeout INTEGER Command timeout for techsupport in minutes.
Default 5 mins
--verbose Enable verbose output
--silent Run techsupport in silent mode
-?, -h, --help Show this message and exit.
root@sonic:~#

- Previous command output (if the output of a command-line utility has changed)

  • Previous command "show techsupport" works as is
    - New command output (if the output of a command-line utility has changed)

@FuzailBrcm
Copy link
Contributor Author

retest default please

@FuzailBrcm
Copy link
Contributor Author

retest this please

1 similar comment
@FuzailBrcm
Copy link
Contributor Author

retest this please

@smaheshm
Copy link
Contributor

smaheshm commented Dec 9, 2020

Thanks for enhancing the tech support collection. One suggestion I have is next time PRs like this can be split into multiple PRs with each PR having a specific fix or enhancement. This makes it easier to revert a PR when there's an issue, also the PR looks cleaner and easier on the reviewer.

@smaheshm
Copy link
Contributor

smaheshm commented Dec 9, 2020

Have you raised or going to raise a PR for tech support unit tests
https://github.com/Azure/sonic-mgmt/tree/master/tests/show_techsupport

scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
Copy link
Contributor

@smaheshm smaheshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to break this into multiple PRs, up to you.

Copy link
Contributor Author

@FuzailBrcm FuzailBrcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • As most of the review comments are regarding the multi_asic, I would like to keep all this in single PR.
  • I have done manual testing
  • I am adding some tests in sonic-mgmt for the changes

scripts/generate_dump Outdated Show resolved Hide resolved
@smaheshm
Copy link
Contributor

Thanks for making the changes. Looks good.
My main concern is the unit test for existing test cases. Would like to see the output of pytest.

Since you added new features, in case you are still working on writing unit tests for the new features, would like to see the output. For example.. the output of "global timeout" case and the "command timeout" case. I'm sure you've tested it, just 'copy/paste' the output.

Regarding "I am adding some tests in sonic-mgmt for the changes" -- have you raised a PR or are working on it.

Adding @qiluo-msft for another pair of eyes.

Rest looks good to me.

show/main.py Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
show/main.py Outdated Show resolved Hide resolved
@FuzailBrcm
Copy link
Contributor Author

retest this please

1 similar comment
@FuzailBrcm
Copy link
Contributor Author

retest this please

smaheshm
smaheshm previously approved these changes Dec 18, 2020
Copy link
Contributor

@smaheshm smaheshm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please wait for one more approval.

@ben-gale
Copy link
Collaborator

@lguohan - Can this one merge now pls?

local filepath="${LOGDIR}/$filename"
local do_gzip=${3:-false}
local tarpath="${BASE}/dump/$filename"
local timeout_cmd="timeout --foreground ${TIMEOUT_MIN}m"
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout [](start = 23, length = 7)

bcmcmd support timeout -t. Could you use that instead of timeout command? Could you use that and maybe together with timeout command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft
We can't predict the behaviour of 'bcmcmd' in case syncd is crashed/stuck/in bad shape. The change you are suggesting might require very extensive testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the my only concern, others lgtm.


In reply to: 546269345 [](ancestors = 546269345)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'timeout' supersedes the '-t' option of 'bcmcmd'. Is the concern regarding 'bcmcmd' may end up in a bad state like leaving the socket open?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bcmcmd is not well tested with signal termination. I am not sure.


In reply to: 550679331 [](ancestors = 550679331,546269345)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qiluo-msft
"Could you use that and maybe together with timeout command" --- The bcmcmd has a default timeout of 30 sec. If no argument is given using -t option, '-t 30' is automatically taken. So the outside 'timeout' command is being used together with bcmcmd timeout option.

root@sonic:/home/admin# bcmcmd
USAGE: bcmcmd [-f <sun_path>] -v <cmd>
  -v                         verbose mode
  -f                         domain socket filename, default /var/run/sswsyncd/sswsyncd.socket
  -t                         timeout in seconds, default 30
RETURN VALUE:
    0                        success
    5                        socket io error
   22                        invalid argument
   62                        timeout

root@sonic:/home/admin#

scripts/generate_dump Outdated Show resolved Hide resolved

handle_signal()
{
echo "Generate Dump received interrupt"
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

echo [](start = 4, length = 4)

echo to stderr? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

handle_signal()
{
echo "Generate Dump received interrupt"
$RM $V -rf $TARDIR
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$RM $V [](start = 4, length = 6)

These variables should be defined before this function. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

scripts/generate_dump Outdated Show resolved Hide resolved
if [[ ( "$NUM_ASICS" > 1 ) ]]; then
for (( i=0; i<$NUM_ASICS; i++ ))
do
local cmd="bcmcmd -n $i $1"
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bcmcmd -n [](start = 23, length = 9)

bcmcmd does not have -n option. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change was requested by @smaheshm.

bcmcmd has been modified to include the ASIC number. sonic-net/sonic-buildimage#4926

@@ -135,7 +233,7 @@ get_vtysh_namespace() {
else
ns=" -n ${asic_id}"
fi
echo "$ns"
#echo "$ns"
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#echo "$ns" [](start = 4, length = 11)

The function stdout is changed. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment

save_vtysh "show bfd peers json" "frr.bfd.peers.json"
save_vtysh "show bfd peers counters json" "frr.bfd.peers.counters.json"

}
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra blank line #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -335,13 +512,19 @@ save_proc() {
###############################################################################
# Dumps all fields and values from given Redis DB.
# Arguments:
# DB id: id of DB for sonic-db-cli
Copy link
Contributor

@qiluo-msft qiluo-msft Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB id [](start = 3, length = 5)

The option has nothing to do with DB id. Why add? Could you remove? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the comments appropriately

scripts/generate_dump Outdated Show resolved Hide resolved
@FuzailBrcm
Copy link
Contributor Author

retest this please

2 similar comments
@smaheshm
Copy link
Contributor

retest this please

@FuzailBrcm
Copy link
Contributor Author

retest this please

@ben-gale
Copy link
Collaborator

@qiluo-msft - can this proceed to approve/merge now? All checks are passing and I believe review comments have been addressed.

@ben-gale
Copy link
Collaborator

ben-gale commented Jan 4, 2021

Pls merge

@qiluo-msft qiluo-msft merged commit 22d79f3 into sonic-net:master Jan 4, 2021
anand-kumar-subramanian pushed a commit to anand-kumar-subramanian/sonic-utilities that referenced this pull request Mar 2, 2021
…on (sonic-net#1249)

 and some other enhancements to techsupport

**- What I did**
Following is the brief description of the changes,
- Adding a ‘--silent’ option to ‘show techsupport’ command. Various  tar/untar, addition and removal logs appear on the console by default. This option would disable above logs.
- Adding global and per-command timeouts. This would provide more user control on ‘show techsupport’ CLI.
- Adding time profiling information for the commands in techsupport. Time profiling information would be part of the tarball and helps to analyse the time consumption per command.
- Sometimes ‘syncd’ docker is down and bcmshell is unavailable. In such cases all the bcmcmd commands would timeout and result in tremendous increase in the total techsupport collection time. We provided an option to skip rest of the bcmcmd commands once one command times out.
- Added ‘show services’, ‘show reboot-cause’ and various BGP, BFD, bcm shell and other commands
- Optimised the /var/log files collection. If the number of files are large in /var/log folder, it takes a long time to add each individually to the tarball. If the folder is tar'ed at once, the time taken reduces significantly. 
- Following error was observed while tar'ing softlinks inside .etc folder. 
  ** Tar append operation failed. Aborting for safety. **
  This issue was due to softlinks present at /etc folder where the destination file is absent. Fixed this issue by deleting such softlinks before adding them to the tarball. 

**- How I did it**
- Added new options to the CLICK command 'show techsupport'
- Modified the 'generate_dump' script to accomodate other changes

**- How to verify it**
Here are some outputs,
root@sonic:/home/admin# show techsupport --silent
Techsupport is running with silent option. This command might take a long time.

HW Mgmt dump script /usr/bin/hw-management-generate-dump.sh does not exist
/var/dump/sonic_dump_sonic_20201117_161246.tar.gz
root@sonic:/home/admin#
root@sonic:~# show techsupport -h
Usage: show techsupport [OPTIONS]

  Gather information for troubleshooting

Options:
  --since TEXT                  Collect logs and core files since given date
  -g, --global-timeout INTEGER  Global timeout for techsupport in minutes.
                                Default 30 mins
  -c, --cmd-timeout INTEGER     Command timeout for techsupport in minutes.
                                Default 5 mins
  --verbose                     Enable verbose output
  --silent                      Run techsupport in silent mode
  -?, -h, --help                Show this message and exit.
root@sonic:~#

**- Previous command output (if the output of a command-line utility has changed)**
- Previous command "show techsupport" works as is
**- New command output (if the output of a command-line utility has changed)**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants