Skip to content

Latest commit

 

History

History

scrape_repos

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Clone Popular GitHub Repos by Language

This package provides a means to scrape repositories and source code files from GitHub.

Pre-requisites

See //datasets/github for instructions on how to set up your GitHub credentials.

Usage

Create a "clone list" file which contains a list of languages and the GitHub repository queries to run for each:

$ cat ./clone_list.pbtxt
# File: //datasets/github/scrape_repos/proto/scrape_repos.proto
# Proto: scrape_repos.LanguageCloneList

language {
  language: "java"
  destination_directory: "/tmp/phd/datasets/github/scrape_repos/java"
  query {
    string: "language:java sort:stars fork:false"
    max_results: 10
  }
}

See schema defined in //datasets/github/scrape_repos/proto/scrape_repos.proto.

Scrape GitHub to create GitHubRepositoryMeta messages of repos using:

$ bazel run //datasets/github/scrape_repos:scraper -- \
    --clone_list $PWD/clone_list.pbtxt

Run the cloner to download the repos scraped in the previous step:

$ bazel run //datasets/github/scrape_repos:cloner -- \
    --cloner_clone_list $PWD/clone_list.pbtxt

Extract individual source files from the cloned repos and import them into a contentfiles database using:

$ bazel run //datasets/github/scrape_repos:importer -- \
    --importer_clone_list $PWD/clone_list.pbtxt

Export the source files from the corpus database to a directory:

$ bazel run //datasets/github/scrape_repos:export_corpus -- \
    --clone_list $PWD/clone_list.pbtxt \
    --export_path /tmp/phd/datasets/github/scrape_repos/corpuses/java