This package provides a means to scrape repositories and source code files from GitHub.
See //datasets/github for instructions on how to set up your GitHub credentials.
Create a "clone list" file which contains a list of languages and the GitHub repository queries to run for each:
$ cat ./clone_list.pbtxt
# File: //datasets/github/scrape_repos/proto/scrape_repos.proto
# Proto: scrape_repos.LanguageCloneList
language {
language: "java"
destination_directory: "/tmp/phd/datasets/github/scrape_repos/java"
query {
string: "language:java sort:stars fork:false"
max_results: 10
}
}
See schema defined in //datasets/github/scrape_repos/proto/scrape_repos.proto.
Scrape GitHub to create GitHubRepositoryMeta
messages of repos using:
$ bazel run //datasets/github/scrape_repos:scraper -- \
--clone_list $PWD/clone_list.pbtxt
Run the cloner to download the repos scraped in the previous step:
$ bazel run //datasets/github/scrape_repos:cloner -- \
--cloner_clone_list $PWD/clone_list.pbtxt
Extract individual source files from the cloned repos and import them into a contentfiles database using:
$ bazel run //datasets/github/scrape_repos:importer -- \
--importer_clone_list $PWD/clone_list.pbtxt
Export the source files from the corpus database to a directory:
$ bazel run //datasets/github/scrape_repos:export_corpus -- \
--clone_list $PWD/clone_list.pbtxt \
--export_path /tmp/phd/datasets/github/scrape_repos/corpuses/java