Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[techsupport] Removed interactive option for docker commands and Improved Error Reporting #1723

Merged
merged 12 commits into from
Sep 16, 2021

Conversation

vivekrnv
Copy link
Contributor

@vivekrnv vivekrnv commented Jul 24, 2021

Why I did

Recently, a bug was seen which was related to saisdkdump and particularly showed up when show techsupport was invoked. Although, it was fixed, the sonic-mgmt test failed to capture it beforehand.

This highlighted a few shortcomings of the generate_dump script and this PR addresses those and also a few additional issues seen

This PR fixes a few things, I'll explain each of them in the next section.

What I did

1) Remove the "Interactive option (-i) for the docker invocation commands"
This was the reason why the bug which was was not captured previously. When the techsupport was invoked remotely (Eg: using sshpass), the docker exec -it <docker> <cmd> command would fail saying ‘the input device is not a TTY'. Hence the (-i) option was removed.

2) Change the Return Code
Currently, the script doesn't return any non-zero error codes for most of the intermediate steps (even though they fail), which makes validation hard.

To handle this, a helper function and trap cmd are used.

handle_error() {
  if [ "$1" != "0" ]; then
    echo "ERR: RC:-$1 observed on line $2" >&2
    RETURN_CODE=1
  fi
}
trap 'handle_error $? $LINENO' ERR # This would trap any executions with non-zero return codes

The global variable RETURN_CODE is set when this is called and the same RETURN_CODE is returned when generate_dump invocation process exits

You may see this is used in multiple functions instead of placing it once on the top of the script. This is because, every function can itself be considered as a subshell and each of them requires a explicit trap command.

When a command is failed with error, this logic would get append a corresponding log to stderr.
ERR: RC:-1 observed on line 729

3) Improve Error Reporting for save_cmd function

Currently any error written to the stderr by the intermediate calls are redirected to the same location as stdout, which is usually the file we see under dump/ dir. This is perfectly fine, but the sonic-mgmt test only parses the text seen in stdout.

So, a new option (-r) is added to generate_dump script and subsequently to show techsupport to redirect any intermediate errors seen to the generate_dump's stderr.

With this option enabled, these sort of errors can be captured on the stderr.

root@sonic:/home/admin# show techsupport -r
..........
timeout --foreground 5m show queue counters > /var/dump/sonic_dump_r-tigon-04_20210714_062239/dump/queue.counters_1
Traceback (most recent call last):
  File "/usr/local/bin/queuestat", line 373, in <module>
    main()
  File "/usr/local/bin/queuestat", line 368, in main
    queuestat.get_print_all_stat(json_opt)
  File "/usr/local/bin/queuestat", line 239, in get_print_all_stat
    cnstat_dict = self.get_cnstat(self.port_queues_map[port])
  File "/usr/local/bin/queuestat", line 168, in get_cnstat
    cnstat_dict[queue] = get_counters(queue_map[queue])
  File "/usr/local/bin/queuestat", line 158, in get_counters
    fields[pos] = str(int(counter_data))
ValueError: invalid literal for int() with base 10: ''
handle_error $? $LINENO
ERR: RC:-1 observed on line 199
Command: show queue counters timedout after 5 minutes.
.............

Without that option, this'll be the output seen

root@sonic:/home/admin# show techsupport 
..........
timeout --foreground 5m show queue counters &> /var/dump/sonic_dump_r-tigon-04_20210714_062239/dump/queue.counters_1
handle_error $? $LINENO
ERR: RC:-1 observed on line 199
Command: show queue counters timedout after 5 minutes.
.............

4) Minor Error in sdk-dump collection logic handled
save_file is only called for the files seen in sdk_dump_path and not for directories

cp: -r not specified; omitting directory '/tmp/sdk-dumps'
handle_error $? $LINENO
ERR: RC:-1 observed on line 729
tar: sonic_dump_r-tigon-04_20210714_062239/sai_sdk_dump/sdk-dumps: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
tar append operation failed. Aborting to prevent data loss.

The reason being, find /tmp/sdk-dumps returns ["/tmp/sdk-dumps"] even if the dir is empty. In the next step, save_file cmd is applied on the folder and thus the error. This can be handled by the change specified above

5) Minor Error in custom plugins logic handled
Added a condition to check if the dir exists before proceeding forward.

if [[ -d ${PLUGINS_DIR} ]]; then
        local -r dump_plugins="$(find ${PLUGINS_DIR} -type f -executable)"
        for plugin in $dump_plugins; do
            # save stdout output of plugin and gzip it
            save_cmd "$plugin" "$(basename $plugin)" true false
        done
    fi

Otherwise, find command might fail saying

root@sonic:/home/admin# find /usr/local/bin/debug-dump -type f -executable
find: ‘/usr/local/bin/debug-dump’: No such file or directory

NOTE: The last two issues were found out because of the error reporting logic added

How I did it

How to verify it

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
dgsudharsan
dgsudharsan previously approved these changes Jul 31, 2021
liat-grozovik
liat-grozovik previously approved these changes Aug 26, 2021
scripts/generate_dump Outdated Show resolved Hide resolved
scripts/generate_dump Outdated Show resolved Hide resolved
local start_t=$(date +%s%3N)
local end_t=0
local cmd="$1"
local filename=$2
local filepath="${LOGDIR}/$filename"
local do_gzip=${3:-false}
local save_stderr=${4:-true}
local save_stderr=${4:-$SAVE_STDERR}
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_stderr

The intention of save_stderr is different from your new feature -r.

Actually I don't see the point of save_stderr=False.

Suggest you

  1. remove old feature, always save stderr
  2. add a new option like inherit_stderr, and still keep save stderr. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usecase of stderr=False as i see is only for the sonic-mgmt auto-techsupport test.

If this is set to false, any intermediate stderr will be redirected to the stderr of the techsupport and the test can be enhanced to capture the stderr. Thereby we can have a single location to view the errors reported by any of the intermediate steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usecase of stderr=False as i see is only for the sonic-mgmt auto-techsupport test.

Could you give a link of the usage in sonic-mgmt repo?

Copy link
Contributor Author

@vivekrnv vivekrnv Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not there yet. But we are planning to update it once this gets merged.

scripts/generate_dump Outdated Show resolved Hide resolved
local start_t=$(date +%s%3N)
local end_t=0
local docker=$1
local filename=$2
local dstpath=$3
local timeout_cmd="timeout --foreground ${TIMEOUT_MIN}m"

local touch_cmd="sudo docker exec -i ${docker} touch ${filename}"
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-i

Could you detect the environment and add -i as the default. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do we require -i for a touch command. Infact i've looked at other docker exec commands used in this script. All of them just write to stdout, so i don't see why -i has to be retained

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what do you think about this

scripts/generate_dump Outdated Show resolved Hide resolved
Signed-off-by: Vivek Reddy Karri <[email protected]>
Signed-off-by: Vivek Reddy Karri <[email protected]>
local start_t=$(date +%s%3N)
local end_t=0
local cmd="$1"
local filename=$2
local filepath="${LOGDIR}/$filename"
local do_gzip=${3:-false}
local save_stderr=${4:-true}
local save_stderr=${4:-$SAVE_STDERR}
Copy link
Contributor

@qiluo-msft qiluo-msft Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAVE_STDERR

You should add $SAVE_STDERR to caller place as an argument, and keep true as default parameter value. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what i understand, wouldn't that require adding an extra argument to all the save_cmd invocations across the script? That would be fine if this argument is something like filename which might change for different invocations.

But since the scope of this flag is global, i think it makes sense to have it inside the parameter value. Let me know if you think otherwise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could understand your point. Does it make more sense to only use global variable and remove $4 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stderr arg was first introduced here #1335.

And for some reason, this was set to false in the normal execution itself (probably because he only wanted to collect stdout). I did not want to disturb it and thus retained the argument, but if not provided read from the global variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you discuss with @stepanblyschak in the same company to understand the original intention? To me, use a global variable as default value is weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanblyschak, confirmed that this local variable $4 can be removed. I'll remove it and will only have the global variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@dgsudharsan
Copy link
Collaborator

@qiluo-msft Can you please review?

Signed-off-by: Vivek Reddy Karri <[email protected]>
@qiluo-msft qiluo-msft merged commit 826311c into sonic-net:master Sep 16, 2021
@qiluo-msft
Copy link
Contributor

This PR could not be cleanly cherry-picked to 202012. Please submit another PR.

@vivekrnv
Copy link
Contributor Author

This PR could not be cleanly cherry-picked to 202012. Please submit another PR.

Raised a separate PR: #1833

qiluo-msft pushed a commit that referenced this pull request Sep 21, 2021
…oved Error Reporting (#1833)

PR #1723 cannot be cherry-picked directly to 202012. Thus raised a separate PR
@judyjoseph
Copy link
Contributor

This PR could not be cleanly cherry-picked to 202106. Please submit another PR.

@vivekrnv
Copy link
Contributor Author

This PR could not be cleanly cherry-picked to 202106. Please submit another PR.
PR: #1843

yxieca added a commit that referenced this pull request Sep 28, 2021
What I did
Fix: sonic-net/sonic-buildimage#8850

Issue was introduced by #1723, #1833, and #1843 (pending merge)

The error_handler is a great idea but the bash script needs to be error free first.

How I did it
Fix bash script errors.

How to verify it
run show techsupport test..

Signed-off-by: Ying Xie <[email protected]>
qiluo-msft pushed a commit that referenced this pull request Sep 29, 2021
What I did
Fix: sonic-net/sonic-buildimage#8850

Issue was introduced by #1723, #1833, and #1843 (pending merge)

The error_handler is a great idea but the bash script needs to be error free first.

How I did it
Fix bash script errors.

How to verify it
run show techsupport test..

Signed-off-by: Ying Xie <[email protected]>
@qiluo-msft
Copy link
Contributor

@vivekreddynv @dgsudharsan The new behavior is that the script will fail immediately if any command return error. However in reality, this is not convenient. Let's assume there is an image bug, some command fail or this script itself has bug, we still need to collect as much as possible data. Could you improve and let the following commands continue running?

@vivekrnv
Copy link
Contributor Author

@vivekreddynv @dgsudharsan The new behavior is that the script will fail immediately if any command return error. However in reality, this is not convenient. Let's assume there is an image bug, some command fail or this script itself has bug, we still need to collect as much as possible data. Could you improve and let the following commands continue running?

Hi @qiluo-msft,

The script will run in it's entirety (will create the archive with all the dumps) even though if any of the intermediate steps are failed. And in the end, it will exit with an rc=1 .

But i understand the issue seen here #1844. When a timeout happens, the script returns with a non-zero exit code and bypasses the "Command timedout error log". i believe that is fixed in the PR.

Is exiting with non-zero code not okay in the command timedout case?

@qiluo-msft
Copy link
Contributor

@yxieca to check

yxieca added a commit to sonic-net/sonic-mgmt that referenced this pull request Oct 5, 2021
…#4409)

What is the motivation for this PR?
show tech support command test case is failing.

The change was from: sonic-net/sonic-utilities#1723

How did you do it?
Update the command format

How did you verify/test it?
Run show_techsupport test.

Signed-off-by: Ying Xie [email protected]
vivekrnv pushed a commit to vivekrnv/sonic-utilities that referenced this pull request Oct 16, 2021
…#1844)

What I did
Fix: sonic-net/sonic-buildimage#8850

Issue was introduced by sonic-net#1723, sonic-net#1833, and sonic-net#1843 (pending merge)

The error_handler is a great idea but the bash script needs to be error free first.

How I did it
Fix bash script errors.

How to verify it
run show techsupport test..

Signed-off-by: Ying Xie <[email protected]>
yxieca pushed a commit that referenced this pull request Oct 25, 2021
…oved Error Reporting (#1843)

What I did
Fix: sonic-net/sonic-buildimage#8850

Issue was introduced by #1723, #1833, and #1843 (pending merge)

The error_handler is a great idea but the bash script needs to be error free first.

How I did it
Fix bash script errors.

How to verify it
run show techsupport test..

Signed-off-by: Ying Xie <[email protected]>
qiluo-msft pushed a commit that referenced this pull request Nov 8, 2021
#### What I did

This PR include some fixes which were missed while manually porting the error reporting PR onto 202012 #1833. 

i.e. removing -it option from the docker exec commands. to understand why the -it option was removed, refer #1723 

This also include another fix which removes -d from the show ip interface command, which fails otherwise.

**Note:** -d option for "show ip interface" is working on master and 202106. and not for 202012. So, this change is particular to 202012. 

Master:
```
admin@sonic-master-imge:~$ show ip interfaces -d all
Interface        Master    IPv4 address/mask    Admin/Oper    BGP Neighbor    Neighbor IP
---------------  --------  -------------------  ------------  --------------  -------------
Loopback0                  10.1.0.32/32         up/up         N/A             N/A
PortChannel0001            10.0.0.56/31         up/up         ARISTA01T1      10.0.0.57
PortChannel0002            10.0.0.58/31         up/up         ARISTA02T1      10.0.0.59
PortChannel0003            10.0.0.60/31         up/up         ARISTA03T1      10.0.0.61
PortChannel0004            10.0.0.62/31         up/up         ARISTA04T1      10.0.0.63
Vlan1000                   192.168.0.1/21       up/up         N/A             N/A
docker0                    240.127.1.1/24       up/down       N/A             N/A
eth0                       10.75.206.180/24     up/up         N/A             N/A
lo                         127.0.0.1/16         up/up         N/A             N/A
```

202012:
```
admin@sonic-202012-image:~$ show ip interfaces -d all
Usage: show ip interfaces [OPTIONS]
Try "show ip interfaces -h" for help.

Error: no such option: -d
```



#### How I did it

#### How to verify it

- Run the show tech-support and check the return status. It should be zero. (Atleast, it was on mellanox platform. I couldn't check the functions which were platform specific)
- Run the "show techsupport" test.
liat-grozovik pushed a commit that referenced this pull request Nov 10, 2021
- What I did
This PR include some fixes which were missed for #1723
i.e. removing -t option from the docker exec commands. to understand why the -it option was removed, refer #1723.
Also, the show techsupport exits with $RETURN_CODE only when --redirect-stderr option is used.

Signed-off-by: Vivek Reddy Karri <[email protected]>
@qiluo-msft
Copy link
Contributor

This commit could not be cleanly cherry-picked to 202012. Please submit another PR.

praveen-li pushed a commit to praveen-li/sonic-utilities that referenced this pull request Feb 8, 2022
…oved Error Reporting (sonic-net#1833)

PR sonic-net#1723 cannot be cherry-picked directly to 202012. Thus raised a separate PR
praveen-li pushed a commit to praveen-li/sonic-utilities that referenced this pull request Feb 8, 2022
…#1844)

What I did
Fix: sonic-net/sonic-buildimage#8850

Issue was introduced by sonic-net#1723, sonic-net#1833, and sonic-net#1843 (pending merge)

The error_handler is a great idea but the bash script needs to be error free first.

How I did it
Fix bash script errors.

How to verify it
run show techsupport test..

Signed-off-by: Ying Xie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants