Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seeing error "waitid: no child processes" from run-to-completion processes while reaping #11

Closed
ahmetb opened this issue Aug 14, 2019 · 17 comments · Fixed by #12
Closed
Assignees

Comments

@ahmetb
Copy link

ahmetb commented Aug 14, 2019

Similar to #2 but a little different.

After doing go reaper.Reap() in my program, I'm doing simple run-to-completion executions, such as:

cmd := exec.Command("some-cmd")
b, err := cmd.CombinedOutput()

this is returning error

waitid: no child processes

When I do not do go reaper.Reap(), things seem to work fine.

For example, more weirdly, this happens only while running:

{/bin/sh [-c npm install --only=production]}

but not while running:

{/bin/sh [-c go build .]}

I have a suspicion that these two commands are different somehow.

How to debug this?

@ahmetb ahmetb changed the title seeing error "waitid: no child processes" from other processes while reaping seeing error "waitid: no child processes" from run-to-completion processes while reaping Aug 14, 2019
@ahmetb
Copy link
Author

ahmetb commented Aug 16, 2019

Yep I'm able to reproduce this fairly easily, most subprocesses I run and wait() while go-reaper was running give me wait: no child processes or waitid: no child processes after the process completes.

@ramr
Copy link
Owner

ramr commented Aug 27, 2019

@ahmetb, hmm ok that does seems a bit weird. One way that waitid is likely going to fail is if the process that you forked off has already exited and your code has consumed the output before the
reaper.Reap code runs. I guess it is possible depending on how the go routines get scheduled.

But I would have thought that those 2 commands you mentioned (go build and npm install)
should have hung around a wee bit - rather than complete instantly but ...

So on the debugging end, I'd check the reaper.reap() flow and see if waiting there till the initialization is complete would help or not.

A simple method would be to do like some old style debugging ala with printf's at:

  1. when initialization gets done: https://github.com/ramr/go-reaper/blob/master/reaper.go#L50
  2. before the reaper reaps at: https://github.com/ramr/go-reaper/blob/master/reaper.go#L61
  3. In your code ala:
cmd := exec.Command("some-cmd")
b, err := cmd.CombinedOutput()
fmt.Printf("script done\n");

And see what the flow is like.

Alternatively, does using

config := reaper.Config{Pid: 0, Options: 0}
go reaper.Start(config)

instead of reaper.Reap() exhibit the same issue?

Thanks.

@ahmetb
Copy link
Author

ahmetb commented Aug 27, 2019

I opted in for using a proper pid 1 init process in my application so I'm no longer using this. The forked processes were definitely running for a while.

@jadolg
Copy link

jadolg commented Jun 1, 2020

@ahmetb could you be more specific? I'm having this issue also.

@ramr
Copy link
Owner

ramr commented Jun 8, 2020

@jadolg do you have a standalone program I can reproduce this with? Thanks.

@ramr ramr self-assigned this Jun 8, 2020
@jadolg
Copy link

jadolg commented Jun 8, 2020

@ramr It's a bit tricky because of the nature of the software I'm making.
This is the issue I'm trying to solve lizardfs/lizardfs-docker-volume-plugin#7
And this is what I'm doing: https://github.com/lizardfs/lizardfs-docker-volume-plugin/blob/FixDefucts/main.go
The application that was started here https://github.com/lizardfs/lizardfs-docker-volume-plugin/blob/6622c905d699a1fe992e5132cd894bc20a0f25e4/main.go#L160 was left dangling after the mountpoint is no longer available and go-reaper is effectively taking care of this. The problem comes then with another application I also need to execute from my application here https://github.com/lizardfs/lizardfs-docker-volume-plugin/blob/6622c905d699a1fe992e5132cd894bc20a0f25e4/main.go#L63 and it looks like it's being killed before it finishes.

@ramr
Copy link
Owner

ramr commented Jun 10, 2020

@jadolg do you try with this fix/workaround added a while back?
Ref: #3
Aka pass the config to reaper.Start ala:

config := reaper.Config{Pid: 0, Options: 0}
go reaper.Start(config)

@ramr
Copy link
Owner

ramr commented Jun 10, 2020

@jadolg Also there's a test in the code which looks like is similar to the issue you are facing.
https://github.com/ramr/go-reaper/blob/master/test/testpid1.go#L18

So try this:

  1. Instead of using the default reaper.Reap(), try passing in the config to the reaper ala:
config := reaper.Config{Pid: 0, Options: 0}
go reaper.Start(config)

and

  1. Set the sysprocattr for both those processes you start via exec.CommandContext to something like:
    cmd.SysProcAttr = &syscall.SysProcAttr{ Setpgid: true, Pgid: 0, });

Or am I misreading your comment in that it is the second call with lizardfs setgoal that's hanging?

If so, you could probably change the code for that slightly ala instead of:
output, err := exec.CommandContext(ctx, "lizardfs", "setgoal", "-r", replicationGoal, volumePath).CombinedOutput()

use something like:

cmd := exec.CommandContext(ctx, "lizardfs", "setgoal", "-r", replicationGoal, volumePath)
cmd.SysProcAttr = &syscall.SysProcAttr{ Setpgid: true, Pgid:    0,}
output, err := cmd.CombinedOutput()
		if err != nil {
...

Does that help?

@jadolg
Copy link

jadolg commented Jun 10, 2020

Hi @ramr
Whenever I do this, the lizardfs application starts behaving ok again as it's no longer killed but go-reaper stops killing the lfsmount process which is left defunct forever.
I end up with something like this

$ ps -ef | grep 2994834
root     2994834 2994804  0 14:22 ?        00:00:00 lizardfs-volume-plugin
root     2994867 2994834  0 14:22 ?        00:00:00 lfsmount /mnt/lizardfs/ -H 192.168.178.26 -P 9421 -S /ppp
root     3001729 2994834  0 14:24 ?        00:00:00 [lfsmount] <defunct>

which is exactly the problem I'm trying to solve as every container that exits will leave behind a defunct lfsmount process eventually filling the process table. When no extra options are set this part works like a charm but then the lizardfs process gets killed before completion.

@ramr
Copy link
Owner

ramr commented Jun 10, 2020

@jadolg what about if you remove setting the SysProcAttr on the lfsmount exec command? That's the one you want the reaper to reap correct?
ala:

cmd := exec.CommandContext(ctx, "lfsmount", params...)
// just comment the call below.
// cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true, Pgid: 1}

@jadolg
Copy link

jadolg commented Jun 10, 2020

Whenever I use

config := reaper.Config{Pid: 0, Options: 0}
go reaper.Start(config)

lfsmount does not get killed. It does not matter If I set pgid or not

@ramr
Copy link
Owner

ramr commented Jun 11, 2020

@jadolg is there a way to run this inside a docker image? Basically some image that I can pull down and run to debug it. I tried to build the plugin from the README but I seem to be hitting some docker errors (enabling the plugin - Error response from daemon: error creating mount for propagated mount: no such file or directory). It might be simpler to just pull down a docker image and run it with a different/debug entry-point. Thx

@ramr
Copy link
Owner

ramr commented Jun 11, 2020

@jadolg can you also send/attach the output from ps which shows the process group id of the hanging process? Maybe the forked lfsmount process is using a different process group id.
Which could explain why the default config Pid: -1 is able to reap it. But the config with Pid: 0 is not waiting for that child process as its process group ID doesn't match the caller aka the process your code runs in.
Basically output of something like: $ ps -axo pid,ppid,pgid,sid,comm,args
Thx

@jadolg
Copy link

jadolg commented Jun 11, 2020

It's a bit hard to debug cause the Docker plugins because it even uses a file for socket.
I'm not sure why it does not work the building process for you but the readme might be a bit outdated.
Here I'm starting go-reaper like this:

config := reaper.Config{Pid: 0, Options: 0}
go reaper.Start(config)

and executing lfsmount without setting any SysProcAttr parameter.

 ps -axo pid,ppid,pgid,sid,comm,args | grep lfsmount
3076393 3076362 3076393 3076393 lfsmount        lfsmount /mnt/lizardfs/ -H 192.168.178.26 -P 9421 -S /ppp

which is only the process that mounts the root of my filesystem.
And this is the application running on the docker plugin

$ ps -axo pid,ppid,pgid,sid,comm,args | grep lizardfs-volume-plugin
3076362 3076339 3076362 3076362 lizardfs-volume lizardfs-volume-plugin

now, when I start a container lfsmount will mount a volume on that container and this will be the process list

$ ps -axo pid,ppid,pgid,sid,comm,args | grep lfsmount
3076393 3076362 3076393 3076393 lfsmount        lfsmount /mnt/lizardfs/ -H 192.168.178.26 -P 9421 -S /ppp
3088940 3076362 3088940 3088940 lfsmount        lfsmount /mnt/docker-volumes/test1 -H 192.168.178.26 -P 9421 -S /ppp/test1

and this happens when killing the container

$ ps -axo pid,ppid,pgid,sid,comm,args | grep lfsmount
3076393 3076362 3076393 3076393 lfsmount        lfsmount /mnt/lizardfs/ -H 192.168.178.26 -P 9421 -S /ppp
3088940 3076362 3088940 3088940 lfsmo <defunct> [lfsmount] <defunct>

And here I'm setting SysProcAttr with the pgid option
This is the parent application (needed to restart the plugin to apply the changes):

$ ps -axo pid,ppid,pgid,sid,comm,args | grep lizardfs-volume-plugin
3100820 3100794 3100820 3100820 lizardfs-volume lizardfs-volume-plugin

This is the output when lfsmount is running

ps -axo pid,ppid,pgid,sid,comm,args | grep lfsmount
3100851 3100820 3100851 3100851 lfsmount        lfsmount /mnt/lizardfs/ -H 192.168.178.26 -P 9421 -S /ppp
3104156 3100820 3104156 3104156 lfsmount        lfsmount /mnt/docker-volumes/test1 -H 192.168.178.26 -P 9421 -S /ppp/test1

And this when it's finished

$ ps -axo pid,ppid,pgid,sid,comm,args | grep lfsmount
3100851 3100820 3100851 3100851 lfsmount        lfsmount /mnt/lizardfs/ -H 192.168.178.26 -P 9421 -S /ppp
3104156 3100820 3104156 3104156 lfsmo <defunct> [lfsmount] <defunct>

I hope this can provide enough context for you. My guess is that lfsmount is either ignoring or changing the parameter and that's why it does not get caught. Is there a way to tackle that from go-reaper?

@ramr
Copy link
Owner

ramr commented Jun 12, 2020

@jadolg , ok looks like that is setting the process group id/creating a new session.

So one solution here is to have the pid 1 do the reaping and have your code run inside a
forked process ala:

import "os"
import "syscall"

func main() {
        // Note 1: Am using a environment variable REAPER to indicate that its a worker/parent.
	if _, hasReaper := os.LookupEnv("REAPER"); !hasReaper {
		go reaper.Reap()

		// Note 2: This is really if you want to distinguish the worker/parent ala in ps.
                //   Am not sure how and if it could affect your plugin's as its one extra
                //  argument added at the end. If it does, you can just use os.Args here instead.
                args := append(os.Args, "#worker")

		pwd, err := os.Getwd()
		if err != nil {
                        // Note 3: Better if you can handle it with a default directory ala "/tmp". 
			panic(err)
		}

		workerEnv := []string{fmt.Sprintf("REAPER=%d", os.Getpid())}

		var wstatus syscall.WaitStatus
		pattrs := &syscall.ProcAttr{
			Dir:   pwd,
			Env:   append(os.Environ(), workerEnv...),
			Sys:   &syscall.SysProcAttr{Setsid: true},
			Files: []uintptr{0, 1, 2},
		}
		workerPid, _ := syscall.ForkExec(args[0], args, pattrs)

		// fmt.Printf("worker-pid = %d\n", workerPid)
		syscall.Wait4(workerPid, &wstatus, 0, nil)
		return
	}

        // your code goes here
        your_plugin_code_main()
}

And there's 3 different notes for you to adjust your code accordingly.

Aside: hmm, maybe something can do from an config option perspective. Let me mull on that a bit as we do require an entry point to invoke as well.

@jadolg
Copy link

jadolg commented Jun 12, 2020

@ramr I needed to modify a bit the code calling lizardfs because for some reason it was still randomly failing with the same message but still doing the job 🤷 but now it's all working :-)
Thank you so much for your help.
I think it would be great if we could configure reaper not to reap a specific process or processes list (giving either gpid, pid or name). What do you think about it? Does it make sense?

@ramr
Copy link
Owner

ramr commented Jun 12, 2020

@jadolg Cool. Glad that worked for you.
As re: your questions ... so the reaper config takes a Pid argument which is just passed down to wait4 (value is similar to the pid argument to waitpid). That value can indicate whether to wait for all processes or anything matching the current process group or a specific process group or then a specific pid. Of course for multiple pids, you would need to invoke the reaper with different configs (lil' kludgy but it works).

But that said, I think from a reaper perspective, making this fix somewhat generic might be a better option. Let me mull on that. For now, I will probably just document the above mentioned usage in the README. Thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants