Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving 401 response during atlantis apply when using GitHub App authentication method #2285

Closed
cjbehm opened this issue May 31, 2022 · 11 comments · Fixed by #2479
Closed
Labels
bug Something isn't working waiting-on-review Waiting for a review from a maintainer

Comments

@cjbehm
Copy link

cjbehm commented May 31, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

Beginning May 20, we've started receiving the error message below when atlantis apply is run.

{"level":"error","ts":"2022-05-24T12:17:26.429Z","caller":"events/command_runner.go:219","msg":"Unable to check user permissions: non-200 OK status code: 401 Unauthorized body: \"{\\\"message\\\":\\\"Bad credentials\\\",\\\"documentation_url\\\":\\\"https://docs.github.com/graphql\\\"}\"","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/events.(*DefaultCommandRunner).RunCommentCommand\n\tgithub.com/runatlantis/atlantis/server/events/command_runner.go:219"}

This is the same error as in #2187, but we are not using gh-team-allowlist and are getting the error with 0.17.5 (which does not have that feature) so I opened a separate issue.

In that issue and in #2090, I've seen notes about checking the GitHub API rate limit, but I'm not sure that's possible with the OAuth installed application, since I don't have a token to call the rate limit APIs (REST or GraphQL, but this appears to be GraphQL from the error message) as that application user.

We originally noticed this behavior when we did an upgrade from 0.17.5 to 0.19.2 and thought that it was a bug, so upgraded to 0.19.3 and then downgraded back to 0.17.5 -- but all 3 but both have been exhibiting the same behavior.

Recreating the Atlantis container immediately resolves the problem (my guess is because it gets a new token from the oauth flow). That is obviously not a great workflow, but it is our workaround for the moment.

Reproduction Steps

While we see it regularly, I don't know how to provide steps that someone else could use to reproduce the behavior.

Environment details

  • Atlantis version: 0.17.5, 0.19.2, 0.19.3
  • Atlantis flags:
ATLANTIS_GH_ORG="..."
ATLANTIS_GH_APP_ID="..."
ATLANTIS_GH_WEBHOOK_SECRET="..."
ATLANTIS_REPO_CONFIG="/etc/atlantis/atlantis.yaml"
ATLANTIS_GH_APP_KEY_FILE="/etc/atlantis/github-app-key.pem"
ATLANTIS_WRITE_GIT_CREDS="true"

Additional Context

We're running Atlantis as a GitHub app and in a container, which is how we have run it for the past ~1.5 years. Our cadence of updates to the repo that Atlantis is watching has not increased (if anything, it has slowed some).

@cjbehm cjbehm added the bug Something isn't working label May 31, 2022
@cjbehm
Copy link
Author

cjbehm commented Jun 3, 2022

An update that in fact we are only seeing this on 0.19.2 and .3 -- our original downgrade back to 0.17.5 didn't initially switch so we thought we were still seeing it with the older version.

@cjbehm
Copy link
Author

cjbehm commented Jun 10, 2022

Because the error is about authentication, we decided to try updating back to 0.19.3 but changing the authentication from GH App to user+token.

Previously the error would occur after a few hours (regardless of activity levels), but so far after changing the authentication, we have not been receiving 401 responses when applying.

@cjbehm
Copy link
Author

cjbehm commented Jun 14, 2022

We've now been running a 0.19.3 Atlantis version since June 9 using the user+token with no errors.

We'd very much like to use the GH App route again, but it seems fairly clear there's some sort of issue there.

@cjbehm cjbehm changed the title Receiving 401 response during atlantis apply after Atlantis has run for awhile Receiving 401 response during atlantis apply when using GitHub App authentication method Jun 14, 2022
@shadiramadan
Copy link

I'm regularly seeing this- I have to restart the atlantis pod in k8s before it will work again. Would very much like to see this resolved. Could this be related to some credential that isn't being refreshed properly?

@valentindeaconu
Copy link

valentindeaconu commented Aug 2, 2022

I have also faced this issue and I started investigating it. I think the bug comes from the GitHub App implementation.

Installation access tokens have the permissions configured by the GitHub App and expire after one hour.

Source: GitHub App documentation.

The problem here is that the token is fetched by Atlantis once when the server starts running and also when a new GitHub repository clone is made. This means that every time a new PR is opened a new clone is made and the token is refreshed. This bug appears when Atlantis enters an "idle" state for more than one hour.

I am not familiar with the codebase, but those are the calls I've found and from where I took this conclusion:

My suggestion for this fix would be the following:

  1. Compute a timestamp at 55 minutes after the moment when the GitHub App token arrived and cache it;
  2. If a new refresh token call is made, compare the current timestamp with the cached timestamp and if the current timestamp is after the cached timestamp, perform the token refresh query (and also refresh the cached timestamp), else skip the call;
  3. Add a token refresh call every time a new plan or apply comment arrives.

I'd ask someone who also knows the codebase to confirm my findings and if everything is correct, I will open a PR to solve this issue.

@valentindeaconu
Copy link

valentindeaconu commented Aug 5, 2022

I'd ask someone who also knows the codebase to confirm my findings and if everything is correct, I will open a PR to solve this issue.

@jamengual can you, please, take a look over my response?

@stasostrovskyi
Copy link
Contributor

We see the same issue with v0.19.9-pre.20220822, but surprisingly only when atlantis doing graphql calls, but never on normal plan/apply operations

@jamengual
Copy link
Contributor

@lilincmu this only happens when is a github app.

@jamengual jamengual added the waiting-on-review Waiting for a review from a maintainer label Aug 26, 2022
@jamengual
Copy link
Contributor

@valentindeaconu we are working on this

@jamengual
Copy link
Contributor

#2469

@rayterrill
Copy link
Contributor

This is indeed happening only when using github app auth, which has a 1 hour token lifetime (https://docs.github.com/en/developers/apps/building-github-apps/authenticating-with-github-apps#authenticating-as-an-installation) because the token for the graphql calls is only created once - during the client initialization (

token, err := credentials.GetToken()
).

The underlying ghinstallation library includes the ability to refresh tokens if they're near expiration, I just don't believe we have enough information once the client is created to be able to do that - we need the credentials and the graphql url to be able to make those calls.

I thought about a couple of ways to handle this - the cleanest to me seems to be to remove the initialization of the graphql client from the GithubClient initialization flow, and do that on-the-fly as we're making graphql queries to ensure we always have an up-to-date token.

I'll put a PR with an example setup for this in a few.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting-on-review Waiting for a review from a maintainer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants