Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling #130

Merged
merged 9 commits into from
May 26, 2018
Merged

Conversation

johananl
Copy link
Contributor

Hi,

Thanks for the work on this awesome project!

I've encountered a bug running atlantis bootstrap on Amazon Linux - the process failed to copy the Terraform binary to /usr/local/bin but the error got silently ignored. This error caused another error when running the Atlantis server, which got ignored, too.

Therefore, I've refactored the overall error handling inside the bootstrap process. It doesn't fix the permissions problem on Amazon Linux (sudo needs to be used, which is outside the scope of this PR), but it does show an error to the user instead of reporting that everything is OK.

Any comments are welcome :-)

johananl added 3 commits May 23, 2018 14:31
Error is silently ignored at the moment, which causes the process
to continue even if the Terraform executable could not be moved
to /usr/local/bin.
Separate quick operations (copy a file) from long-running, background
operations (run ngrok).

Use context to cancel background tasks.
@johananl
Copy link
Contributor Author

Will fix the metalint errors later on today.

@codecov
Copy link

codecov bot commented May 23, 2018

Codecov Report

Merging #130 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #130   +/-   ##
=======================================
  Coverage   63.82%   63.82%           
=======================================
  Files          40       40           
  Lines        1957     1957           
=======================================
  Hits         1249     1249           
  Misses        644      644           
  Partials       64       64

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f802732...27f9b3a. Read the comment docs.

Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! And nice work debugging your original issue without any logs!

I ended up getting into a rabbit hole reviewing this so sorry for all the comments. In the end the main gist of my review is that I don't think we should use the cancelFn and instead just use cmd.Process.Kill()

@@ -134,11 +135,23 @@ func downloadAndUnzip(url string, path string, target string) error {
return unzip(path, target)
}

func executeCmd(cmd string, args []string) (*exec.Cmd, error) {
// Executes a command, waits for it to finish and returns any errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments should have the name of the function at the start

Notice this comment is a complete sentence that begins with the name of the element it describes
from https://blog.golang.org/godoc-documenting-go-code

So this should read

// executeCmd executes a command, waits for it to finish and returns any errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I'm aware of this convention and its implications of course, however for some reason I remembered this was only for exported functions, that is - functions whose name begins with a capital letter. Will fix!

return nil, err
}
return command, nil
return command.Run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like changing this to .Run(). Makes a lot more sense then calling .Wait()!

The problem with just returning the error from .Run() though is that if the command exits with code 1 then the only error you get is exit status 1 but you don't get the output from the program.

For example if I ran

err := executeCmd("mv", []string{"/tmp/terraform", "/usr/local/bin/"})

When there was no file /tmp/terraform then err.Error() == exit status 1.

Instead, I think we should capture the output as well and return an error that has the output too:

func executeCmd(cmd string, args []string) error {
	command := exec.Command(cmd, args...) // #nosec
	bytes, err := command.CombinedOutput()
	if err != nil {
		return fmt.Errorf("%s: %s", err, bytes)
	}
	return nil
}

Then you'd get: exit status 1: mv: /tmp/terraform: No such file or directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I actually saw that but blamed mv for not presenting a more meaningful message... I didn't know about CombinedOutput() - nice. I think your suggestion is a creative solution to the problem. I'll verify this doesn't introduce other problems and change accordingly.

return command.Run()
}

// Executes a command in the background. Returns a context so that the caller may cancel the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to start the comment with executeBackgroundCmd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.


// Executes a command in the background. Returns a context so that the caller may cancel the
// command prematurely if necessary, as well as an errors channel.
func executeBackgroundCmd(cmd string, args []string) (context.CancelFunc, <-chan error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new function but I've just been testing killing the command using cancel() and in my tests it actually doesn't kill the process! There's a couple bugs about this floating around in the golang docs.

I've got my test case below but as a result of this issue. I think this function should return the command instead of the cancelFn and then callers can use defer func() {command.Process.Kill()}:

func executeBackgroundCmd(cmd string, args []string) (*exec.Cmd, <-chan error) {
	command := exec.Command(cmd, args...) // #nosec
	errChan := make(chan error, 1)
	go func() {
		bytes, err := command.CombinedOutput()
		errChan <- fmt.Errorf("%s: %s", err, bytes)
	}()

	return command, errChan
}

The weird part about this one is that I kept getting segmentation faults when I would Ctrl-C the process during the defer ngrokCmd.Process.Kill() functions. I found out that's because before .CombinedOutput() is executed, cmd.Process is nil. And because the defer gets evaluated immediately (although not executed) then it would be nil. The workaround is to wrap the .Process.Kill() in another func so that it's evaluated once the process gets started. So basically it should be used like:

	ngrokCmd, ngrokErrors := executeBackgroundCmd("/tmp/ngrok", []string{"start", "atlantis", "--config", ngrokConfigFile.Name()})
        // We need to wrap the .Process.Kill() call because at runtime .Process might be `nil` since it's started in another goroutine.
	defer func() { ngrokCmd.Process.Kill() }()

Long explanation for why I don't think we should use the cancelFn:
First, I created a golang binary that won't exit if it receives a SIGINT:

package main

import (
	"os"
	"syscall"
	"os/signal"
)

func main() {
        // This signal trapping isn't necessary but if I just use the select{} I get
        // this error: fatal error: all goroutines are asleep - deadlock!
	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
	<-signalChan
	select{}
}

Then I compiled this into a binary called proctest and created another golang program to call proctest:

package main

import (
	"os/exec"
	"log"
	"os"
	"syscall"
	"os/signal"
	"context"
)

func main() {
	ctx, cancel := context.WithCancel(context.Background())
	firstCmd := exec.CommandContext(ctx, "../proctest/proctest")
	if err := firstCmd.Start(); err != nil {
		log.Fatal(err)
	}
	defer cancel()

	signalChan := make(chan os.Signal, 1)
	signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
	<-signalChan
}

When I run this program, I see that it's forked the proctest process:

pstree | grep -A 2 "./[m]ain"
 | |     \-+= 80672 lkysow ./main
 | |       \--- 80673 lkysow ../proctest/proctest

But when I Ctrl-C it, the proctest process keeps running:

ps aux|grep "[p]roctest"
lkysow           80673   0.0  0.0 558432544   1180 s001  S     8:27pm   0:00.00 ../proctest/proctest

If I change the code from defer cancel() to firstCmd.Process.Kill() then it kills child properly.

The reason I think this is a bug with golang is because if I run a time.Sleep after the cancel() then it kills the child properly so it's a weird timing thing where the parent is executing before it sigkills the child or maybe the OS is doing something weird, I don't know :D

But because of this, I think we should get rid of the context.

Interestingly, most of this is moot because Ctrl-C will send a SIGINT to all of the processes in the foreground process group which means ngrok and atlantis server all get a SIGINT before the defer even needs to execute. The defer is just useful in case those processes ignore the SIGINT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Nice job for catching this. I wouldn't jump to blaming the Golang std library for having a bug so fast, though. To me it makes more sense that we're simply doing something wrong...

I wanted to use context since it was built exactly for these scenarios (chaining calls and cancelling them properly when something happens). It's definitely possible to kill the process, however this feels to me like the equivalent of flipping on a light switch, then breaking the light bulb to turn off the light instead of flipping the switch back off again :-) Hope that makes sense.

Let me review this behavior to try and pinpoint the issue with using the context library (unless you think it's wrong to use it here, assuming we could get it to kill the process properly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the correct way to do it so if you can get it to work then 👍

In the end because all processes get the SIGINT anyway they are most likely already exited by the time the defers run so this is just a safe-guard in case they ignore it. Therefore it's not a huge problem either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also worth noting that in the end the call to cancel() just sends SIGKILL so it's the same thing.

The provided context is used to kill the process (by calling os.Process.Kill) if the context becomes done before the command completes on its own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the end result is the same when cancelling or killing, I just felt that it's more correct Golang-wise to start a background task, then cancel it "properly" instead of killing it "from the outside" if that makes sense.

Copy link
Contributor Author

@johananl johananl May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still investigating this, but just wanted to say in the meantime - cmd.Start() must be followed by cmd.Wait(), otherwise the command's resources aren't getting freed according to the docs. I actually used Run() and not Start() in the executeBackgroundCmd() function.

Inspection of the source code for Run() shows that it is actually Start() followed by Wait():

func (c *Cmd) Run() error {
	if err := c.Start(); err != nil {
		return err
	}
	return c.Wait()
}

Therefore, I am not sure the test you've performed using Start() is testing the real behavior we are dealing with. I am, however, indeed seeing that when running a child process which ignores SIGINT, the child is left behind when sending SIGINT to the main process. I'm getting the feeling this has something to do with the main process exiting and not letting defers complete. Updates will follow.

Copy link
Contributor Author

@johananl johananl May 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I figured out the problem. At the end of Start() there is the following:

if c.ctx != nil {
    c.waitDone = make(chan struct{})
    go func() {
        select {
        case <-c.ctx.Done():
            c.Process.Kill()
        case <-c.waitDone:
        }
    }()
}

The Start() function blocks until either the channel returned by c.ctx.Done() is closed (explicit cancellation of the context) or until c.waitDone is closed (command exited on its own).

When the main process returns, it runs the deferred cancel() which closes c.ctx.Done(), however the main process then immediately terminates before c.Process.Kill() had a chance to run. This happens because cancel() doesn't block until the kill function returns, thus creating a race condition between the main function and the kill function. If the main function wins, the kill function never runs and the child process is leaked.

The solution - actively wait for the cancellation to complete using a WaitGroup:

func executeBackgroundCmd(wg *sync.WaitGroup, cmd string, args ...string) (context.CancelFunc, <-chan error) {
	ctx, cancel := context.WithCancel(context.Background())
	command := exec.CommandContext(ctx, cmd, args...)

	errChan := make(chan error, 1)
	go func() {
		defer wg.Done()
		err := command.Run()
		errChan <- err
	}()

	return cancel, errChan
}

func main() {
	var wg sync.WaitGroup

	wg.Add(1)
	cancel, errChan := executeBackgroundCmd(&wg, "./proctest/proctest")
	defer func() {
		cancel()
		wg.Wait()
	}()

	stop := make(chan os.Signal, 1)
	signal.Notify(stop, syscall.SIGINT, syscall.SIGTERM)

	select {
	case <-stop:
		log.Println("Got stop signal")
	case err := <-errChan:
		log.Printf("Got error: %v", err)
	}
}

This seems to work fine.

if err != nil {
return errors.Wrapf(err, "creating ngrok tunnel")
cancelNgrok, ngrokErrors := executeBackgroundCmd("/tmp/ngrok", []string{"start", "atlantis", "--config", ngrokConfigFile.Name()})
// Check if we got a fast error. Move on if we haven't (the command is still running).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, this happens too fast to catch any errors from the command (for example change start to ldjsjk and see that it never hits this select even though that errors out ngrok immediately.

Honestly what really needs to happen here to ensure ngrok is running is we should watch the logs for this line:

client session established

Then we can be sure ngrok is running properly and we can proceed. This would also fix #92.

But until that happens, if you follow what I say below about changing this command to return (exec.Cmd, error) then here you should just check the error and defer cmd.Process.Kill().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later down we verify that ngrok came up successfully:

	tunnelURL, err := getTunnelAddr()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This was not my experience. Maybe it's a timing issue which changes from machine to machine. I'll review this.

go func() {
atlantisErrChan <- atlantisCmd.Wait()
}()
// Check if we got a fast error. Move on if we haven't (the command is still running).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't really work as expected (see above)

if err != nil {
colorstring.Printf("[red]=> opening pull request failed. please go to: %s on the browser", pullRequestURL)
colorstring.Printf("[red]=> opening pull request failed. please go to: %s on the browser\n", pullRequestURL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍺

colorstring.Println("\n[green]Thank you for using atlantis :) \n[white]For more information about how to use atlantis in production go to: https://github.com/runatlantis/atlantis")
return nil

// Keep checking for errors from ngrok or atlantis server. Exit normally on shutdown signal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like how this will exit if ngrok or atlantis has issues now.

johananl added 4 commits May 25, 2018 00:01
- Wait for cancel functions to complete before main function returns.
- Make executeCmd and executeBackgroundCmd variadic.
@johananl
Copy link
Contributor Author

I've addressed all the open issues. We now have a (hopefully) proper handling of background task cancellation using context, as well as both "fast" error checking and continuous error checking at the end of the bootstrap function.

In addition, I've made both executeCmd and executeBackgroundCmd variadic so that you can pass zero or more arguments to the command without having to deal with []string{}. I find this cleaner.

Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent investigation! I should have read into the docs more, nice work. This is a lot more robust now.

@lkysow lkysow merged commit 373f5fa into runatlantis:master May 26, 2018
meringu pushed a commit to meringu/atlantis that referenced this pull request May 29, 2023
Ensure we're using the http protocol for ngrok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants