Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self hosted runner in Linux container exits job prematurely #921

Open
kabamawutschnik opened this issue Jan 19, 2021 · 11 comments
Open

Self hosted runner in Linux container exits job prematurely #921

kabamawutschnik opened this issue Jan 19, 2021 · 11 comments
Labels
bug Something isn't working needs-investigation Runner Bug Bug fix scope to the runner

Comments

@kabamawutschnik
Copy link

Describe the bug
We are attempting to test our nodejs application on a self hosted runner in a linux container. Job is exiting (github runner is cancelling the job) prematurely, during a memory intensive portion. The step that we are exiting consistently in, is the setup bash script to install node, node modules, and mongodb.

To Reproduce
Steps to reproduce the behavior:

  1. Setup up container using example Dockerfile
  2. Start workflow using example workflow (memory intensive work)
  3. Workflow Exits

Expected behavior
To complete without exiting early.

Runner Version and Platform

2.275.1 - Ubuntu Container

What's not working?

Processes are killed, because oom_score_adj is not able to be edited by github runner.

Job Log Output

Setup Step
Error: The operation was canceled. [debug]System.OperationCanceledException: The operation was canceled.

Runner and Worker's Diagnostic Logs

Worker Logs

Exiting Exception
Caught cancellation exception from step: System.OperationCanceledException: The operation was canceled. at System.Threading.CancellationToken.ThrowOperationCanceledException() at GitHub.Runner.Sdk.ProcessInvoker.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at GitHub.Runner.Common.ProcessInvokerWrapper.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Channel1 redirectStandardIn, Boolean inheritConsoleHandler, Boolean keepStandardInOpen, Boolean highPriorityProcess, CancellationToken cancellationToken) at GitHub.Runner.Worker.Handlers.DefaultStepHost.ExecuteAsync(String workingDirectory, String fileName, String arguments, IDictionary2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, CancellationToken cancellationToken)
at GitHub.Runner.Worker.Handlers.ScriptHandler.RunAsync(ActionRunStage stage)
at GitHub.Runner.Worker.ActionRunner.RunAsync()
at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)`

A bunch of these errors, but this is the last one (during the bash setup.sh script)
Which: 'bash' [2021-01-19 00:21:54Z INFO ScriptHandler] Location: '/usr/bin/bash' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Starting process: [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] File name: '/usr/bin/bash' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Arguments: '-e /app/project/_work/_temp/4a28742a-a48e-4b22-9af0-403af0e37076.sh' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Working directory: '/app/project/_work/project/project' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Require exit code zero: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: '' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Persist current code page: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] High priority process: 'False' [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 2506. [2021-01-19 00:21:54Z INFO ProcessInvokerWrapper] System.UnauthorizedAccessException: Access to the path '/proc/2506/oom_score_adj' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at System.IO.FileStream.WriteNative(ReadOnlySpan1 source)
at System.IO.FileStream.FlushWriteBuffer()
at System.IO.FileStream.Dispose(Boolean disposing)
at System.IO.Stream.Close()
at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing)
at System.IO.StreamWriter.Dispose(Boolean disposing)
at System.IO.TextWriter.Dispose()
at System.IO.File.WriteAllText(String path, String contents)
at GitHub.Runner.Sdk.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)`

Example Container Dockerfile
`FROM ubuntu:latest

ARG DEBIAN_FRONTEND=noninteractive

ENV GITHUB_RUNNER_VERSION="2.275.1"
ENV RUNNER_NAME "runner"
ENV GITHUB_OWNER "owner"
ENV RUNNER_WORKDIR "_work"
ENV GITHUB_REPOSITORY ""
ENV GITHUB_PAT ""

RUN apt-get update -y
&& apt-get upgrade -y
&& apt-get install -y
curl
sudo
git
jq
systemctl
tzdata
mysql-client
python3-pip
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*
&& useradd -m github
&& usermod -aG sudo github
&& echo "%sudo ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
&& touch /etc/sudoers.d/github
&& echo "github ALL = (ALL) NOPASSWD: ALL" >> /etc/sudoers.d/github

USER github

WORKDIR /app
COPY --chown=github:github entrypoint.sh ./entrypoint.sh
RUN sudo chown -R github /app
RUN sudo chmod u+x ./entrypoint.sh

ENTRYPOINT [ "/app/entrypoint.sh" ]`

Example entrypoint.sh
`#!/bin/sh

mkdir -p /app/${GITHUB_REPOSITORY}
cd /app/${GITHUB_REPOSITORY}
curl -Ls https://github.com/actions/runner/releases/download/v${GITHUB_RUNNER_VERSION}/actions-runner-linux-x64-${GITHUB_RUNNER_VERSION}.tar.gz | tar xz
&& sudo ./bin/installdependencies.sh

token_url="https://api.github.com/repos/${GITHUB_OWNER}/${GITHUB_REPOSITORY}/actions/runners/registration-token"
registration_url="https://github.com/${GITHUB_OWNER}/${GITHUB_REPOSITORY}"
echo "Requesting token at '${token_url}'"

payload=$(curl -sX POST -H "Authorization: token ${GITHUB_PAT}" ${token_url})
export RUNNER_TOKEN=$(echo "$payload" | jq .token --raw-output)

./config.sh
--name "${GITHUB_REPOSITORY}"
--token "${RUNNER_TOKEN}"
--url "${registration_url}"
--works "${RUNNER_WORKDIR}"
--labels "${RUNNER_LABELS}"
--unattended
--replace

echo "adding actions service"
sudo ./svc.sh install

echo "starting all services with actions."
sudo systemctl start 'actions.
'

tail -f /dev/null
`

Example workflow.yml
`name: nodejs test

on:
push:
branches:
- dev

jobs:
Analysis:
runs-on: self-hosted

env:
  CI: true
  USER: github

strategy:
  matrix:
    node-version: [ 10.19.0 ]
steps:
  - name: Ensure Dependencies
    run: |
      sudo apt update
      sudo apt install -y gcc
      sudo apt install -y make
      sudo apt install -y lsof
      sudo apt install -y curl
      sudo apt install -y psmisc
      sudo apt install -y node-gyp
  - name: Ensure Known Hosts
    run: |
      mkdir -p ~/.ssh/
      touch ~/.ssh/known_hosts
  - name: Pull Repo
    uses: actions/checkout@v2
  - name: Use Node
    uses: actions/setup-node@v1
    with:
      node-version: ${{ matrix.node-version }}
  - name: Run Setup.sh
    run: |
      export CI=true
      ./scripts/setup.sh
  - name: Run Tests
    run: sudo ./scripts/run_all_tests.sh
  - name: Extra Logs if Failed
    if: ${{ failure() }}
    run: |
      npm -v
      node -v
      npm config ls -l
      cat ~/.npm/_logs/*

`

Notes:

@kabamawutschnik kabamawutschnik added the bug Something isn't working label Jan 19, 2021
@TingluoHuang
Copy link
Member

@kabamawutschnik have you check syslog to confirm the runner get killed by OOM Killer?
grep -i kill /var/log/syslog

You can also manually modify the oom_score_adj for the runner process before it takes any jobs.

@kabamawutschnik
Copy link
Author

@TingluoHuang I had to install rsyslog and start it (with a default conf), to start getting syslog. After doing so, I ran the workflow again and ran your grep command. I did not find any results.

I set the oom_score_adj for the github runner service, to -999. Re running the workflow again, resulted in the same job cancellation.

I forgot to note: The exact same workflow runs to completion in a Github hosted runner.

@TingluoHuang
Copy link
Member

@kabamawutschnik do you mind share the full runner diag logs for both runner.listener and runner.worker?

The runner cancel jobs only when

  • user request cancel from server
  • step times out
  • runner can't communicate with the server to report heartbeats.

@kabamawutschnik
Copy link
Author

I believe these are the ones.

Worker_20210119-002009-utc.log
Runner_20210119-002002-utc.log

@TingluoHuang
Copy link
Member

[2021-01-19 00:21:54Z INFO Runner] Received Ctrl-C signal, stop Runner.Listener and Runner.Worker.
[2021-01-19 00:21:54Z INFO HostContext] Runner will be shutdown for UserCancelled

@kabamawutschnik something sends a Ctrl-C to the runner while it's running.

@kabamawutschnik
Copy link
Author

@TingluoHuang Strange.. The exact same workflow runs fully on a github hosted runner. Are github hosted runners some how more resilient to kill commands?

Searching through setup.sh I found a location where we run killall mongod && killall mongos && killall -9 mongos. The job appears to end just after this portion of code.

@kabamawutschnik
Copy link
Author

I did some testing.

I commented out a bunch of ./bin/mongo calls, and made it past the same portion of code that I was exiting on. Job is still exiting the same way, but seems to progress further.

Could this be a result of child process management?

@TingluoHuang
Copy link
Member

@kabamawutschnik try to figure out what process sends the SIGINT?

https://unix.stackexchange.com/a/372581

@kabamawutschnik
Copy link
Author

kabamawutschnik commented Jan 20, 2021

Attached to process 655 (worker process), & tracing all the child processes of 655. Looks like the worker sends SIGINT then SIGTERM.

Note: I notice that our github runner service dies after the exited job. I need to restart it with sudo systemctl start 'actions.servicename'

2671 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2676, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
2679 +++ exited with 0 +++
2666 --- SIGINT {si_signo=SIGINT, si_code=SI_USER, si_pid=655, si_uid=1000} ---
2680 +++ exited with 0 +++
2678 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2679, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
2678 +++ exited with 0 +++

.........

2716 +++ exited with 0 +++
2712 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2716, si_uid=0, si_status=0, si_utime=2, si_stime=70} ---
2666 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=655, si_uid=1000} ---
2666 +++ killed by SIGTERM +++
2646 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=2666, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} ---
2720 +++ exited with 0 +++
2719 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2720, si_uid=0, si_status=0, si_utime=358, si_stime=20} ---

strace.log

@kabamawutschnik
Copy link
Author

kabamawutschnik commented Jan 22, 2021

Starting the runner as a foreground process, with run.sh, allowed the job to get past the issue. Not exactly sure why the service version is dying so easily. Perhaps a setting could be changed in the unit file?

@MartinNowak
Copy link

MartinNowak commented Apr 16, 2021

What's definitely confusing is that the job ends up marked as cancelled in GitHub Actions, even though runsvc is notified.

Apr 16 13:04:04 gh-runner-03 systemd[1]: gh-actions-runner.service: A process of this unit has been killed by the OOM killer.
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Shutting down runner listener
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Sending SIGINT to runner listener to stop
Apr 16 13:04:04 gh-runner-03 runsvc.sh[305080]: Exiting...
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: 2021-04-16 11:04:08Z: Job run completed with result: Canceled
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: Runner listener exited with error code 0
Apr 16 13:04:08 gh-runner-03 runsvc.sh[305080]: Runner listener exit with 0 return code, stop the service, no retry needed.
Apr 16 13:04:08 gh-runner-03 systemd[1]: gh-actions-runner.service: Failed with result 'oom-kill'.
Apr 16 13:04:08 gh-runner-03 systemd[1]: gh-actions-runner.service: Consumed 4h 54min 39.841s CPU time.
Apr 16 13:04:13 gh-runner-03 systemd[1]: gh-actions-runner.service: Scheduled restart job, restart counter is at 5.
Apr 16 13:04:13 gh-runner-03 systemd[1]: Stopped GitHub Actions Runner.

Job run completed with result: Canceled

Should've been failure when a job is not canceled from GitHub Actions, but just killed/terminated on the server, wouldn't you agree? At lest to me job status is reported from the perspective of GH Actions, not from the underlying runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-investigation Runner Bug Bug fix scope to the runner
Projects
None yet
Development

No branches or pull requests

5 participants