25.3 million pull request review comments on GitHub since January 2015 till December 2018.
xz-compressed CSV, with columns:
COMMENT_ID
- identifier of the comment in mother dataset - GH ArchiveCOMMIT_ID
- commit hash to which the review comment is attachedURL
- path to the GitHub pull request the comment comes fromAUTHOR
- GitHub user of the author of the commentCREATED_AT
- creation date of the commentBODY
- raw content of the comment
Python:
# too big for pandas.read_csv
import codecs
import csv
import lzma
with lzma.open("review_comments.csv.xz") as archf:
reader = csv.DictReader(codecs.getreader("utf-8")(archf))
for record in reader:
print(record)
The dataset was generated from GH Archive in the following notebook.
The comments which exceeded Python's csv.field_size_limit
equal to 128KB were discarded (~10 comments).
We gathered some statistics about the dataset.