-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error handling #130
Improve error handling #130
Conversation
Error is silently ignored at the moment, which causes the process to continue even if the Terraform executable could not be moved to /usr/local/bin.
Separate quick operations (copy a file) from long-running, background operations (run ngrok). Use context to cancel background tasks.
Will fix the metalint errors later on today. |
Codecov Report
@@ Coverage Diff @@
## master #130 +/- ##
=======================================
Coverage 63.82% 63.82%
=======================================
Files 40 40
Lines 1957 1957
=======================================
Hits 1249 1249
Misses 644 644
Partials 64 64 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! And nice work debugging your original issue without any logs!
I ended up getting into a rabbit hole reviewing this so sorry for all the comments. In the end the main gist of my review is that I don't think we should use the cancelFn
and instead just use cmd.Process.Kill()
bootstrap/utils.go
Outdated
@@ -134,11 +135,23 @@ func downloadAndUnzip(url string, path string, target string) error { | |||
return unzip(path, target) | |||
} | |||
|
|||
func executeCmd(cmd string, args []string) (*exec.Cmd, error) { | |||
// Executes a command, waits for it to finish and returns any errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments should have the name of the function at the start
Notice this comment is a complete sentence that begins with the name of the element it describes
from https://blog.golang.org/godoc-documenting-go-code
So this should read
// executeCmd executes a command, waits for it to finish and returns any errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I'm aware of this convention and its implications of course, however for some reason I remembered this was only for exported functions, that is - functions whose name begins with a capital letter. Will fix!
bootstrap/utils.go
Outdated
return nil, err | ||
} | ||
return command, nil | ||
return command.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like changing this to .Run()
. Makes a lot more sense then calling .Wait()
!
The problem with just returning the error from .Run()
though is that if the command exits with code 1 then the only error you get is exit status 1
but you don't get the output from the program.
For example if I ran
err := executeCmd("mv", []string{"/tmp/terraform", "/usr/local/bin/"})
When there was no file /tmp/terraform
then err.Error() == exit status 1
.
Instead, I think we should capture the output as well and return an error that has the output too:
func executeCmd(cmd string, args []string) error {
command := exec.Command(cmd, args...) // #nosec
bytes, err := command.CombinedOutput()
if err != nil {
return fmt.Errorf("%s: %s", err, bytes)
}
return nil
}
Then you'd get: exit status 1: mv: /tmp/terraform: No such file or directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I actually saw that but blamed mv
for not presenting a more meaningful message... I didn't know about CombinedOutput()
- nice. I think your suggestion is a creative solution to the problem. I'll verify this doesn't introduce other problems and change accordingly.
bootstrap/utils.go
Outdated
return command.Run() | ||
} | ||
|
||
// Executes a command in the background. Returns a context so that the caller may cancel the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to start the comment with executeBackgroundCmd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix.
bootstrap/utils.go
Outdated
|
||
// Executes a command in the background. Returns a context so that the caller may cancel the | ||
// command prematurely if necessary, as well as an errors channel. | ||
func executeBackgroundCmd(cmd string, args []string) (context.CancelFunc, <-chan error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this new function but I've just been testing killing the command using cancel()
and in my tests it actually doesn't kill the process! There's a couple bugs about this floating around in the golang docs.
I've got my test case below but as a result of this issue. I think this function should return the command instead of the cancelFn and then callers can use defer func() {command.Process.Kill()}
:
func executeBackgroundCmd(cmd string, args []string) (*exec.Cmd, <-chan error) {
command := exec.Command(cmd, args...) // #nosec
errChan := make(chan error, 1)
go func() {
bytes, err := command.CombinedOutput()
errChan <- fmt.Errorf("%s: %s", err, bytes)
}()
return command, errChan
}
The weird part about this one is that I kept getting segmentation faults when I would Ctrl-C
the process during the defer ngrokCmd.Process.Kill()
functions. I found out that's because before .CombinedOutput()
is executed, cmd.Process
is nil
. And because the defer
gets evaluated immediately (although not executed) then it would be nil
. The workaround is to wrap the .Process.Kill()
in another func so that it's evaluated once the process gets started. So basically it should be used like:
ngrokCmd, ngrokErrors := executeBackgroundCmd("/tmp/ngrok", []string{"start", "atlantis", "--config", ngrokConfigFile.Name()})
// We need to wrap the .Process.Kill() call because at runtime .Process might be `nil` since it's started in another goroutine.
defer func() { ngrokCmd.Process.Kill() }()
Long explanation for why I don't think we should use the cancelFn:
First, I created a golang binary that won't exit if it receives a SIGINT
:
package main
import (
"os"
"syscall"
"os/signal"
)
func main() {
// This signal trapping isn't necessary but if I just use the select{} I get
// this error: fatal error: all goroutines are asleep - deadlock!
signalChan := make(chan os.Signal, 1)
signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
<-signalChan
select{}
}
Then I compiled this into a binary called proctest
and created another golang program to call proctest
:
package main
import (
"os/exec"
"log"
"os"
"syscall"
"os/signal"
"context"
)
func main() {
ctx, cancel := context.WithCancel(context.Background())
firstCmd := exec.CommandContext(ctx, "../proctest/proctest")
if err := firstCmd.Start(); err != nil {
log.Fatal(err)
}
defer cancel()
signalChan := make(chan os.Signal, 1)
signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
<-signalChan
}
When I run this program, I see that it's forked the proctest
process:
pstree | grep -A 2 "./[m]ain"
| | \-+= 80672 lkysow ./main
| | \--- 80673 lkysow ../proctest/proctest
But when I Ctrl-C
it, the proctest
process keeps running:
ps aux|grep "[p]roctest"
lkysow 80673 0.0 0.0 558432544 1180 s001 S 8:27pm 0:00.00 ../proctest/proctest
If I change the code from defer cancel()
to firstCmd.Process.Kill()
then it kills child properly.
The reason I think this is a bug with golang is because if I run a time.Sleep
after the cancel()
then it kills the child properly so it's a weird timing thing where the parent is executing before it sigkills the child or maybe the OS is doing something weird, I don't know :D
But because of this, I think we should get rid of the context.
Interestingly, most of this is moot because Ctrl-C
will send a SIGINT
to all of the processes in the foreground process group which means ngrok and atlantis server
all get a SIGINT before the defer even needs to execute. The defer is just useful in case those processes ignore the SIGINT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Nice job for catching this. I wouldn't jump to blaming the Golang std library for having a bug so fast, though. To me it makes more sense that we're simply doing something wrong...
I wanted to use context
since it was built exactly for these scenarios (chaining calls and cancelling them properly when something happens). It's definitely possible to kill the process, however this feels to me like the equivalent of flipping on a light switch, then breaking the light bulb to turn off the light instead of flipping the switch back off again :-) Hope that makes sense.
Let me review this behavior to try and pinpoint the issue with using the context library (unless you think it's wrong to use it here, assuming we could get it to kill the process properly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the correct way to do it so if you can get it to work then 👍
In the end because all processes get the SIGINT anyway they are most likely already exited by the time the defers run so this is just a safe-guard in case they ignore it. Therefore it's not a huge problem either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting that in the end the call to cancel()
just sends SIGKILL
so it's the same thing.
The provided context is used to kill the process (by calling os.Process.Kill) if the context becomes done before the command completes on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the end result is the same when cancelling or killing, I just felt that it's more correct Golang-wise to start a background task, then cancel it "properly" instead of killing it "from the outside" if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still investigating this, but just wanted to say in the meantime - cmd.Start()
must be followed by cmd.Wait()
, otherwise the command's resources aren't getting freed according to the docs. I actually used Run()
and not Start()
in the executeBackgroundCmd()
function.
Inspection of the source code for Run()
shows that it is actually Start()
followed by Wait()
:
func (c *Cmd) Run() error {
if err := c.Start(); err != nil {
return err
}
return c.Wait()
}
Therefore, I am not sure the test you've performed using Start()
is testing the real behavior we are dealing with. I am, however, indeed seeing that when running a child process which ignores SIGINT, the child is left behind when sending SIGINT to the main process. I'm getting the feeling this has something to do with the main process exiting and not letting defer
s complete. Updates will follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I figured out the problem. At the end of Start() there is the following:
if c.ctx != nil {
c.waitDone = make(chan struct{})
go func() {
select {
case <-c.ctx.Done():
c.Process.Kill()
case <-c.waitDone:
}
}()
}
The Start()
function blocks until either the channel returned by c.ctx.Done()
is closed (explicit cancellation of the context) or until c.waitDone
is closed (command exited on its own).
When the main process returns, it runs the deferred cancel()
which closes c.ctx.Done()
, however the main process then immediately terminates before c.Process.Kill()
had a chance to run. This happens because cancel()
doesn't block until the kill function returns, thus creating a race condition between the main function and the kill function. If the main function wins, the kill function never runs and the child process is leaked.
The solution - actively wait for the cancellation to complete using a WaitGroup:
func executeBackgroundCmd(wg *sync.WaitGroup, cmd string, args ...string) (context.CancelFunc, <-chan error) {
ctx, cancel := context.WithCancel(context.Background())
command := exec.CommandContext(ctx, cmd, args...)
errChan := make(chan error, 1)
go func() {
defer wg.Done()
err := command.Run()
errChan <- err
}()
return cancel, errChan
}
func main() {
var wg sync.WaitGroup
wg.Add(1)
cancel, errChan := executeBackgroundCmd(&wg, "./proctest/proctest")
defer func() {
cancel()
wg.Wait()
}()
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM)
select {
case <-stop:
log.Println("Got stop signal")
case err := <-errChan:
log.Printf("Got error: %v", err)
}
}
This seems to work fine.
bootstrap/bootstrap.go
Outdated
if err != nil { | ||
return errors.Wrapf(err, "creating ngrok tunnel") | ||
cancelNgrok, ngrokErrors := executeBackgroundCmd("/tmp/ngrok", []string{"start", "atlantis", "--config", ngrokConfigFile.Name()}) | ||
// Check if we got a fast error. Move on if we haven't (the command is still running). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing, this happens too fast to catch any errors from the command (for example change start
to ldjsjk
and see that it never hits this select
even though that errors out ngrok immediately.
Honestly what really needs to happen here to ensure ngrok is running is we should watch the logs for this line:
client session established
Then we can be sure ngrok is running properly and we can proceed. This would also fix #92.
But until that happens, if you follow what I say below about changing this command to return (exec.Cmd, error) then here you should just check the error and defer cmd.Process.Kill()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later down we verify that ngrok came up successfully:
tunnelURL, err := getTunnelAddr()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This was not my experience. Maybe it's a timing issue which changes from machine to machine. I'll review this.
bootstrap/bootstrap.go
Outdated
go func() { | ||
atlantisErrChan <- atlantisCmd.Wait() | ||
}() | ||
// Check if we got a fast error. Move on if we haven't (the command is still running). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't really work as expected (see above)
bootstrap/bootstrap.go
Outdated
if err != nil { | ||
colorstring.Printf("[red]=> opening pull request failed. please go to: %s on the browser", pullRequestURL) | ||
colorstring.Printf("[red]=> opening pull request failed. please go to: %s on the browser\n", pullRequestURL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🍺
colorstring.Println("\n[green]Thank you for using atlantis :) \n[white]For more information about how to use atlantis in production go to: https://github.com/runatlantis/atlantis") | ||
return nil | ||
|
||
// Keep checking for errors from ngrok or atlantis server. Exit normally on shutdown signal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like how this will exit if ngrok or atlantis has issues now.
- Wait for cancel functions to complete before main function returns. - Make executeCmd and executeBackgroundCmd variadic.
I've addressed all the open issues. We now have a (hopefully) proper handling of background task cancellation using In addition, I've made both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent investigation! I should have read into the docs more, nice work. This is a lot more robust now.
Ensure we're using the http protocol for ngrok
Hi,
Thanks for the work on this awesome project!
I've encountered a bug running
atlantis bootstrap
on Amazon Linux - the process failed to copy the Terraform binary to/usr/local/bin
but the error got silently ignored. This error caused another error when running the Atlantis server, which got ignored, too.Therefore, I've refactored the overall error handling inside the bootstrap process. It doesn't fix the permissions problem on Amazon Linux (
sudo
needs to be used, which is outside the scope of this PR), but it does show an error to the user instead of reporting that everything is OK.Any comments are welcome :-)