Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

agent: sandbox_pause should not take arguments #585

Merged
merged 1 commit into from
Aug 1, 2019

Conversation

bergwolf
Copy link
Member

@bergwolf bergwolf commented Jun 18, 2019

gcc attribute constructors cannot take arguments and cannot
access main function arguments on musl. Such functionality is glibc
specific and only works on x86 and x86_64. As a result, the pause
binary always quits on musl causing sandbox share pidns to malfunction.

@bergwolf bergwolf added the do-not-merge PR has problems or depends on another label Jun 18, 2019
@devimc
Copy link

devimc commented Jun 19, 2019

I faced a similar issue with other cgo implementations that I did in the past, the solution was to read the arguments from /proc/self/cmdline, @bergwolf what do you think?

@bergwolf
Copy link
Member Author

@devimc Good point! I've updated the PR to follow your suggestion. Thanks!

@bergwolf bergwolf removed the do-not-merge PR has problems or depends on another label Jun 20, 2019
@bergwolf
Copy link
Member Author

Dropping dnm as the new fix does not introduce additional memory cost.

@bergwolf
Copy link
Member Author

/test

@bergwolf bergwolf changed the title [RFC] agent: move pause bin to golang agent: move pause bin to golang Jun 20, 2019
@codecov
Copy link

codecov bot commented Jun 20, 2019

Codecov Report

Merging #585 into master will increase coverage by 0.16%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #585      +/-   ##
==========================================
+ Coverage   50.45%   50.62%   +0.16%     
==========================================
  Files          17       17              
  Lines        2392     2392              
==========================================
+ Hits         1207     1211       +4     
+ Misses       1028     1026       -2     
+ Partials      157      155       -2

Copy link
Member

@liwei liwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! the commit lgtm with a minor comment inline.

pause.go Outdated Show resolved Hide resolved
pause.go Outdated Show resolved Hide resolved
pause.go Outdated Show resolved Hide resolved
pause.go Outdated Show resolved Hide resolved
pause.go Outdated Show resolved Hide resolved
@bergwolf
Copy link
Member Author

PR updated. PTAL @liwei @devimc

@bergwolf
Copy link
Member Author

/test

pause.go Outdated Show resolved Hide resolved
fclose(f);
free(p);

if (do_pause == 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use braces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really necessary right? C code doesn't need to follow go coding style here.

@lifupan
Copy link
Member

lifupan commented Jun 21, 2019

Since this pause process will be an init process in its pid namespace, does it need to do reap the exited orphan child processes?

@bergwolf bergwolf force-pushed the pause branch 2 times, most recently from c7a37f6 to 6b7b6d8 Compare June 21, 2019 02:07
@bergwolf
Copy link
Member Author

@lifupan That's handled by the agent reaper instead.

@bergwolf
Copy link
Member Author

/test

@liwei
Copy link
Member

liwei commented Jun 21, 2019

I think the commit subject needs an update 😉

@bergwolf
Copy link
Member Author

Ouch! Updated. Thanks @liwei !

@bergwolf bergwolf changed the title agent: move pause bin to golang agent: sandbox_pause should not take arguments Jun 21, 2019
@bergwolf
Copy link
Member Author

/test

@lifupan
Copy link
Member

lifupan commented Jun 21, 2019

@lifupan That's handled by the agent reaper instead.

Hi @bergwolf, actually the kata-agent child subreaper cannot
reap a child from different pid namespace. Read from the latest kernel source codes https://elixir.bootlin.com/linux/latest/source/kernel/exit.c#L603, the logic to find a process's reaper is below:

  1. give them to another thread in it's thread group, if such a member exists.
  2. give it to the first ancestor process which prctl'd itself as a
    child_subreaper for its children in the same pid namespace.
  3. give it to the init process (PID 1) in our pid namespace
    That's why kernel reparent the child process to container's (PID 1) process instead of the kata-agent process.

actually after kernel applying this patch, https://lkml.org/lkml/2017/1/30/630 , the kata-agent cannot reap the orphan child processes in sub-namespace.

@bergwolf
Copy link
Member Author

@lifupan Then it's a bigger problem. Let me take a closer look.

@liwei
Copy link
Member

liwei commented Jun 24, 2019

@lifupan @bergwolf Maybe we can fix the process reap issue in a separate PR and let this PR keeps going?

@lifupan
Copy link
Member

lifupan commented Jun 24, 2019

@lifupan @bergwolf Maybe we can fix the process reap issue in a separate PR and let this PR keeps going?

Agree, we can open a separated issue to track it.

@liwei
Copy link
Member

liwei commented Jun 24, 2019

/test

@jodh-intel
Copy link
Contributor

Odd. One of the Travis builds failed with a timeout... building a golang program!

$ go version
go version go1.11.11 linux/ppc64le

    :

go build -buildmode=pie -tags "" -o kata-agent \
	-ldflags "-X main.version=1.8.0-alpha2-f1e4e9d5eda55469d34d1b59f743a059bab8f01b -X main.seccompSupport=no "

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

Any thoughts @nitkon?

@nitkon
Copy link
Contributor

nitkon commented Jun 24, 2019

@jodh-intel: That was odd. I have restarted the Travis-CI and now it's green. travis_wait can be used to exceed the timeout limit but need to root-cause why it happens else it just would fail after the exceeded timeout.

@jodh-intel
Copy link
Contributor

Restarted initrd CI which had a transitory failure.

@egernst
Copy link
Member

egernst commented Jul 1, 2019

flake test for initrd?

Summarizing 1 Failure:

[Fail] state container [It] with workload [true], timeWait 5 
/tmp/jenkins/workspace/kata-containers-agent-ubuntu-18-04-PR-initrd/go/src/github.com/kata-containers/tests/functional/state_test.go:45

Ran 3 of 8 Specs in 31.466 seconds
FAIL! -- 2 Passed | 1 Failed | 0 Pending | 5 Skipped --- FAIL: TestFunctional (31.47s)
FAIL

@egernst
Copy link
Member

egernst commented Jul 1, 2019

@bergwolf - this should be backport candidate as well, correct? I think it should be -- please add backport label if you agree.

The commit message seems to indicate the fix is not using args, whereas you are just grabbing the args now via proc sysfs, right? Can you update commit message accordingly?

gcc __attribute__ constructors cannot take arguments and cannot
access main function arguments on musl. Such functionality is glibc
specific and only works on x86 and x86_64. As a result, the pause
binary always quits on musl causing sandbox share pidns to malfunction.

Fix it by grabbing arguments from /proc/self/cmdline instead.

Fixes: kata-containers#584
Signed-off-by: Peng Tao <[email protected]>
@bergwolf
Copy link
Member Author

bergwolf commented Jul 2, 2019

/test

@bergwolf
Copy link
Member Author

bergwolf commented Jul 2, 2019

@egernst Yes, I agree it should go to stable. tag added. Thanks.

@egernst
Copy link
Member

egernst commented Aug 1, 2019

rebuilding ubuntu-18-04-initrd

@devimc
Copy link

devimc commented Aug 1, 2019

functional tests are failing again, nothing related to this pr, I think

• Failure [21.258 seconds]
state
/tmp/jenkins/workspace/kata-containers-agent-ubuntu-18-04-PR-initrd/go/src/github.com/kata-containers/tests/functional/state_test.go:26
  container
  /tmp/jenkins/workspace/kata-containers-agent-ubuntu-18-04-PR-initrd/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table.go:92
    with workload [true], timeWait 5 [It]
    /tmp/jenkins/workspace/kata-containers-agent-ubuntu-18-04-PR-initrd/go/src/github.com/kata-containers/tests/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:46

    Expected
        <int>: 1
    to equal
        <int>: 0

@devimc devimc merged commit d6d8514 into kata-containers:master Aug 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants