GitHub Project LLVM Failing and not indexing #1093

bigmadkev · 2023-12-09T11:24:15Z

With the following config I'm getting a lot of errors and left it to run over night with no data stored.
Using a fine grain token with the settings for private repos as the readme suggests.

---
workspaces:
  - name: llvm
    crawlers:
      - name: github-llvm
        provider:
          github_organization: llvm
          github_repositories: 
            - llvm-project
          github_token: GITHUB_TOKEN
        update_since: '2023-10-01'

I've attached the logs from the doker containers

api.log
crawler.log
elastic.log

Any help / pointers really apprecaited.

The text was updated successfully, but these errors were encountered:

TristanCacqueray · 2023-12-09T16:27:18Z

From the crawler log, it seems like there is an unexpected error in the GraphQL response, and the process is likely stuck retrying, which would explain why there is no data stored.

The error is:

(GQLError {
  message = "The additions count for this commit is unavailable.",
  locations = Just [Position {line = 83, column = 11}], path = Just [PropName "repository",PropName "pullRequests",PropName "nodes",PropIndex 8,PropName "commits",PropName "nodes",PropIndex 0,PropName "commit",PropName "additions"],
  errorType = Just (Custom "SERVICE_UNAVAILABLE"), extensions = Nothing} :| [])

The crawler needs a fix to handle that error (e.g. by updating the query/schema or reporting this issue to the github api bugtracker).

Though we could also skip such errors (using DontRetry here) to provide partial data at least, but we don't have a way to report missing data on the user interface.

morucci · 2023-12-10T14:16:56Z

That could also be a transient error from the Github API ("SERVICE_UNAVAILABLE"). Does it still happen ?

As we request Changes by bulk, using DontRetry we'll skip lot of Changes (25 AFAIR), we could also reduce the bulk amount before retrying (like for the server-side timeout query) and only when reach a bulk size of 1 then we DontRetry.

bigmadkev · 2023-12-11T22:40:57Z

Seem to be getting less of those errors today but still nothing being stored.

Although doing a manual hit to the graphql I would get an error trying to get 3 or more PRs.

Would it be possible to have that number configurable in the config?

If I knew Haskell I would give it a go, (would normally try and learn to do it myself, but in the middle of 3 months of training and my head becomes a shed)

morucci · 2023-12-12T09:01:56Z

I just did a test with llvm-project repository on github, and I don't have any issue fetching by bulk of 25 PRs. The crawler reduces by itself the amount of PRs it attempts to fetch when it encounters server side timeout. With GitHub this could happen a lot when PRs are with plenty of comments and data. For llvm it does not seems to be the case.

Regarding the other error, I see it, I don't have solution for now :( I need some time to experiment solutions for that issue.

bigmadkev · 2023-12-13T21:51:25Z

I've run the query in GitHub Explorer with failures as below.

So I've raised a question in the community for help: https://github.com/orgs/community/discussions/79021

GetProjectPullRequests_graphql.txt
GetProjectPullRequests_vars.txt

I'm now getting some data compared to the weekend but not everything :(

"errors": [
    {
      "type": "SERVICE_UNAVAILABLE",
      "path": [
        "repository",
        "pullRequests",
        "nodes",
        0,
        "commits",
        "nodes",
        0,
        "commit",
        "additions"
      ],
      "locations": [
        {
          "line": 84,
          "column": 15
        }
      ],
      "message": "The additions count for this commit is unavailable."
    }

bigmadkev · 2023-12-13T22:09:33Z

Still playing with the query, taking the lines
additions
deletions

out of commits/commit has no errors now

TristanCacqueray · 2023-12-13T22:26:50Z

Thank you very much for the feedback. It looks like working around this issue from monocle side is not going to be easy as we would need to enable an extra query (without the additions/deletions request).

bigmadkev · 2023-12-19T10:23:52Z

I've found an offending pull request: llvm/llvm-project#74806
That not even the front end at the time of typing this can deal with, is there anyway to have the crawler skip an array of pull requests for when this may happen, as I think it's getting to this one and then stopping crawling and then starts again and gets stuck again and of course the only Pull Requests that get pulled in are the ones updated since the last run.

Must have worked at some point for the review to have happened.

TristanCacqueray · 2023-12-19T13:21:34Z

The pull request query is defined in this module, and the parameters are documented here (search for pullRequests). It does not seems possible to filter a given PR, at least not in an effective way.

As discussed with @morucci, monocle may be improved by reducing the query size when such an unexpected error happens, and when the size is one, then perhaps we could skip the offending items by using the provided endCursor in the error body.

The crawler indexes items by chunk of 500, so you should be getting some data by increasing the update_since parameter. Perhaps we should also interpret this error as the end of stream so that the crawler submit everything it got until that point.

Thanks again for investigating that issue, it's a great feedback.

bigmadkev · 2023-12-19T15:30:53Z

I'm looking more and I see that there is additions and deletions at the different levels and wondered if they were needed or could be inferred by data already retuned in the files part?

repository > pullRequests > nodes > additions & deletions
repository > pullRequests > nodes > commits > nodes additions & deletions

Taking out the second ones gives me a clean run through all 8,490 pull requests using a shabby Python script that does the same query calls, with no failures and 12 timeouts (most towards the end) vs a load before and one that never completed not even in graphQL explorer. Took 35 minutes to run.

I can't see where that data at Commits level is used (I've not clicked everything to find it)

Although I can see where the Top Level one and the Files ones are used:

change/[ORG]@[REPO]@[CHANGEID]

As I say I don't know haskell but I've managed to get a build running locally. - Taken me far too long but hey. (Why not python (other than Python is a lot slower than Haskell. google search HvsP) so we can all hack 😉 couldn't see an ADR for that one 😭 and I love to see ADRs being used! )

So if someone is able to point me at how I can update the schema to remove the additions and deletions at the nodes>commits level I would apprecaite it at least than I can get the data loaded up and able to move on to the next task :-)

Would love to learn haskell (if not only to help on the project, but to keep my techie side happy), so any pointers to help in 2024 would be awesome. I may be an Agile Coach by trade now, I'm still a software dev at heart 😆 ❤️ 👨‍💻

TristanCacqueray · 2023-12-19T16:15:28Z

Why not python

Good catch, this deserves an ADR. You can learn more about this choice in: https://changemetrics.io/posts/2021-06-01-haskell-use-cases.html . The main reason being that the language is statistically typed with an advanced type system.

update the schema to remove the additions and deletions

The schema that pulls additions/deletions is shared by two queries and it is defined here. If you remove these attributes, then the build will fails in the PullRequests and UserPullRequests modules, and you can replace the missing term with 0 to fix the compile errors. Here are a few notes to help you:

The graphql datatypes are generated from the query. You can run cabal haddock to get the module documentation, e.g. for Lentille.GitHub.PullRequests
The schema between the crawler and the api is defined in protobuf here (and haskell datatypes are also generated from this definition, see the Makefile)
The different data types usage are documented in this module: Monocle.Backend.Documents

Would love to learn haskell

I would recommend https://learn-haskell.blog/

morucci · 2024-01-03T22:15:19Z

@bigmadkev, the related PR is merged. We believe that the indexing issue is fixed/mitigated so please let us know if we can close that issue.

bigmadkev · 2024-01-03T22:18:56Z

Will clear my cache and let it run and see if it's able to get to it's current state (just missing 1 pull request out of 9k+)

Cheers

bigmadkev · 2024-01-04T11:50:51Z

My checking github query had the wrong date in for updated since so actually all the data is there whoop whoop!

Thank you so for much for the fix!

bigmadkev changed the title ~~GitHub~~ GitHub Proect LLVM Failing and not indexing Dec 9, 2023

bigmadkev changed the title ~~GitHub Proect LLVM Failing and not indexing~~ GitHub Project LLVM Failing and not indexing Dec 9, 2023

TristanCacqueray mentioned this issue Dec 20, 2023

Collect error in the index #1097

Merged

bigmadkev closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Project LLVM Failing and not indexing #1093

GitHub Project LLVM Failing and not indexing #1093

bigmadkev commented Dec 9, 2023

TristanCacqueray commented Dec 9, 2023

morucci commented Dec 10, 2023

bigmadkev commented Dec 11, 2023 •

edited

Loading

morucci commented Dec 12, 2023

bigmadkev commented Dec 13, 2023 •

edited

Loading

bigmadkev commented Dec 13, 2023

TristanCacqueray commented Dec 13, 2023

bigmadkev commented Dec 19, 2023

TristanCacqueray commented Dec 19, 2023

bigmadkev commented Dec 19, 2023 •

edited

Loading

TristanCacqueray commented Dec 19, 2023

morucci commented Jan 3, 2024

bigmadkev commented Jan 3, 2024

bigmadkev commented Jan 4, 2024 •

edited

Loading

GitHub Project LLVM Failing and not indexing #1093

GitHub Project LLVM Failing and not indexing #1093

Comments

bigmadkev commented Dec 9, 2023

TristanCacqueray commented Dec 9, 2023

morucci commented Dec 10, 2023

bigmadkev commented Dec 11, 2023 • edited Loading

morucci commented Dec 12, 2023

bigmadkev commented Dec 13, 2023 • edited Loading

bigmadkev commented Dec 13, 2023

TristanCacqueray commented Dec 13, 2023

bigmadkev commented Dec 19, 2023

TristanCacqueray commented Dec 19, 2023

bigmadkev commented Dec 19, 2023 • edited Loading

TristanCacqueray commented Dec 19, 2023

morucci commented Jan 3, 2024

bigmadkev commented Jan 3, 2024

bigmadkev commented Jan 4, 2024 • edited Loading

bigmadkev commented Dec 11, 2023 •

edited

Loading

bigmadkev commented Dec 13, 2023 •

edited

Loading

bigmadkev commented Dec 19, 2023 •

edited

Loading

bigmadkev commented Jan 4, 2024 •

edited

Loading