Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

privileged port issue on Raspberry Pi devices in CI #36847

Closed
Trott opened this issue Jan 8, 2021 · 5 comments
Closed

privileged port issue on Raspberry Pi devices in CI #36847

Trott opened this issue Jan 8, 2021 · 5 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@Trott
Copy link
Member

Trott commented Jan 8, 2021

CI is red until the privileged port issue we're seeing on Raspberry Pi devices is sorted out.

The tests that are failing are test-cluster-shared-handle-bind-privileged-port and test-cluster-bind-privileged-port.

@rvagg suspects a Docker update:

Persistent failure, even after restarts of the whole cluster. #36478 was merged into this test yesterday but the parent commit still has the failures.

What has changed is the Docker version. They all got an upgrade to 5:20.10.2~3-0~raspbian-buster and this is all running inside containers. It's going to be the newest version of Docker running in our CI and I wonder whether we're going to see similar failures when we upgrade other hosts or if this is going to be restricted to ARM.

Other than that, I'm not sure what this could be. It seems like a straightforward test that shouldn't fail, maybe Docker has introduced something new for unprivileged port binding inside containers?

I logged into one of the machines over SSH and ran python -m SimpleHTTPServer 80 to see if this was a case of "oh well, ports less than 1024 don't require root anymore" like we saw a while back on Mojave, but nope. Couldn't bind to port 80 (or 42) as an unprivileged user.

@Trott
Copy link
Member Author

Trott commented Jan 8, 2021

@nodejs/testing @nodejs/docker @nodejs/build

@aduh95 aduh95 added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jan 8, 2021
@jasnell
Copy link
Member

jasnell commented Jan 8, 2021

Given that this is blocking landing any code changes, I'd recommend marking this a flaky test on that platform for now and investigating separately.

jasnell added a commit that referenced this issue Jan 8, 2021
Per rvagg:
```
Persistent failure, even after restarts of the whole cluster. #36478 was
merged into this test yesterday but the parent commit still has the
failures.

What has changed is the Docker version. They all got an upgrade to
5:20.10.2~3-0~raspbian-buster and this is all running inside containers.
It's going to be the newest version of Docker running in our CI and I
wonder whether we're going to see similar failures when we upgrade other
hosts or if this is going to be restricted to ARM.

Other than that, I'm not sure what this could be. It seems like a
straightforward test that shouldn't fail, maybe Docker has introduced
something new for unprivileged port binding inside containers?
```

Signed-off-by: James M Snell <[email protected]>

PR-URL: #36850
Refs: #36847
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Mary Marchini <[email protected]>
@rvagg
Copy link
Member

rvagg commented Jan 9, 2021

@Trott when you log in to a Pi you're in to the actual machine, not a container. The way we run these machines is a bit complicated due to a combination of resource constraints, security and the need to run multiple OS versions for testing. So we have multiple Docker containers running full-time on each of the Pi machines when they start, then when we do a CI run, we set it up on the machine, then delegate into a container to take over and run the rest of the script (build and test).

But we can duplicate that behaviour by copying how it runs it. As user iojs, make a directory and create a file named node-ci-exec (has to be exactly this), then run sudo docker-node-exec.sh -v stretch. The iojs user has only sudo access to this command and it'll run that specific script in as the iojs Docker user using the container specified (stretch in this case).

So here's how that looks:

iojs@test-requireio--piccoloaiutante-debian10-arm64--pi3-1:~/tmp $ echo 'echo "yep, I am in"' > node-ci-exec
iojs@test-requireio--piccoloaiutante-debian10-arm64--pi3-1:~/tmp $ sudo docker-node-exec.sh -v stretch
+ docker exec node-ci-stretch /bin/sh -c 'cd /home/iojs/tmp && . /home/iojs/tmp/node-ci-exec'
yep, I am in

then

iojs@test-requireio--piccoloaiutante-debian10-arm64--pi3-1:~/tmp $ echo 'echo "starting"; python -m SimpleHTTPServer 80; echo "ended $?"' > node-ci-exec
iojs@test-requireio--piccoloaiutante-debian10-arm64--pi3-1:~/tmp $ sudo docker-node-exec.sh -v stretch
+ docker exec node-ci-stretch /bin/sh -c 'cd /home/iojs/tmp && . /home/iojs/tmp/node-ci-exec'
starting

...

In a separate session:

pi@test-requireio--piccoloaiutante-debian10-arm64--pi3-1:~ $ ps auxww | grep python
iojs     23916  1.2  0.9  11932  9064 ?        S    00:02   0:00 python -m SimpleHTTPServer 80
pi       24155  0.0  0.0   7348   500 pts/1    S+   00:03   0:00 grep --color=auto python

i.e. it's running, and hasn't exited. I can Ctrl-C the original session and it stays running, I have to kill that process to get it to stop.

I think this validates my original guess about the cause? But it also raises questions: is this new Docker behaviour or limited to the ARM or Raspbian version(s)? If this is new behaviour, what are we going to do about it when the new version makes it to our other Docker hosts. I was going to reprovision a couple of our main Docker hosts the other day that run our containered tests because they're getting a bit long in the tooth without a full image upgrade, but I suspect if I did that we might encounter the same thing there. Perhaps this is new Docker behaviour but there may also be a way to disable it. Needs some research.

@rvagg
Copy link
Member

rvagg commented Jan 9, 2021

Guessing this is it: https://docs.docker.com/engine/release-notes/#20100, under "Security" for the 20.10.0 @ 2020-12-08 release:

Add default sysctls to allow ping sockets and privileged ports with no capabilities moby/moby#41030

From that moby change (emphasis mine):

Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.

I think if we start the containers with --sysctl net.ipv4.ip_unprivileged_port_start=1024 we get the original behaviour back, at least as far as port binding is concerned.

@rvagg
Copy link
Member

rvagg commented Jan 9, 2021

Yep, that does it, tried on a machine which is currently offline by adding that flag to the stretch container but leaving the jessie one alone:

iojs@test-requireio--pivotalagency-debian10-arm64--pi3-1:~/tmp $ echo 'echo "starting"; python -m SimpleHTTPServer 80; echo "ended $?"' > node-ci-exec
iojs@test-requireio--pivotalagency-debian10-arm64--pi3-1:~/tmp $ sudo docker-node-exec.sh -v stretch
+ docker exec node-ci-stretch /bin/sh -c 'cd /home/iojs/tmp && . /home/iojs/tmp/node-ci-exec'
starting
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/SimpleHTTPServer.py", line 235, in <module>
    test()
  File "/usr/lib/python2.7/SimpleHTTPServer.py", line 231, in test
    BaseHTTPServer.test(HandlerClass, ServerClass)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 606, in test
    httpd = ServerClass(server_address, HandlerClass)
  File "/usr/lib/python2.7/SocketServer.py", line 417, in __init__
    self.server_bind()
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind
    SocketServer.TCPServer.server_bind(self)
  File "/usr/lib/python2.7/SocketServer.py", line 431, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 13] Permission denied
ended 1
iojs@test-requireio--pivotalagency-debian10-arm64--pi3-1:~/tmp $ sudo docker-node-exec.sh -v jessie
+ docker exec node-ci-jessie /bin/sh -c 'cd /home/iojs/tmp && . /home/iojs/tmp/node-ci-exec'
starting

Will move to a nodejs/build PR to propose this as an option. I'm guessing that there probably won't be an objection to this for now, although it does suggest some subtle changes in the way we're supposed to view "privileged ports" on Linux into the future.

rvagg added a commit to nodejs/build that referenced this issue Jan 9, 2021
Since Docker 20.10.0 @ 2020-12-08, port binding has been made unrestricted.
This change undoes that by ensuring that <1024 are privileged. Node.js' test
suite assumes that binding to a lower port will result in a privilege failure
so we need to create an environment suitable for that assumption.

Ref: nodejs/node#36847
rvagg added a commit to rvagg/io.js that referenced this issue Jan 12, 2021
This reverts commit a45a404.

Solved by marking ports <1024 as privileged on Docker containers.

Ref: nodejs#36850
Ref: nodejs#36847
Ref: nodejs/build#2521
rvagg added a commit to nodejs/build that referenced this issue Jan 12, 2021
Since Docker 20.10.0 @ 2020-12-08, port binding has been made unrestricted.
This change undoes that by ensuring that <1024 are privileged. Node.js' test
suite assumes that binding to a lower port will result in a privilege failure
so we need to create an environment suitable for that assumption.

Ref: nodejs/node#36847
danielleadams pushed a commit that referenced this issue Jan 12, 2021
Per rvagg:
```
Persistent failure, even after restarts of the whole cluster. #36478 was
merged into this test yesterday but the parent commit still has the
failures.

What has changed is the Docker version. They all got an upgrade to
5:20.10.2~3-0~raspbian-buster and this is all running inside containers.
It's going to be the newest version of Docker running in our CI and I
wonder whether we're going to see similar failures when we upgrade other
hosts or if this is going to be restricted to ARM.

Other than that, I'm not sure what this could be. It seems like a
straightforward test that shouldn't fail, maybe Docker has introduced
something new for unprivileged port binding inside containers?
```

Signed-off-by: James M Snell <[email protected]>

PR-URL: #36850
Refs: #36847
Reviewed-By: Michael Dawson <[email protected]>
Reviewed-By: Mary Marchini <[email protected]>
aduh95 pushed a commit that referenced this issue Jan 18, 2021
This reverts commit a45a404.

Solved by marking ports <1024 as privileged on Docker containers.

Ref: #36850
Ref: #36847
Ref: nodejs/build#2521

PR-URL: #36884
Refs: #36850
Refs: #36847
Refs: nodejs/build#2521
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Daijiro Wachi <[email protected]>
Reviewed-By: Ash Cripps <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Michael Dawson <[email protected]>
ruyadorno pushed a commit that referenced this issue Jan 22, 2021
This reverts commit a45a404.

Solved by marking ports <1024 as privileged on Docker containers.

Ref: #36850
Ref: #36847
Ref: nodejs/build#2521

PR-URL: #36884
Refs: #36850
Refs: #36847
Refs: nodejs/build#2521
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Gireesh Punathil <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Daijiro Wachi <[email protected]>
Reviewed-By: Ash Cripps <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Rich Trott <[email protected]>
Reviewed-By: Michael Dawson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

No branches or pull requests

5 participants