Releases: mattbierbaum/arxiv-public-datasets
Releases · mattbierbaum/arxiv-public-datasets
Dataset snapshot release v0.2.0
Release of arXiv public dataset libraries with ability to gather and process:
- arXiv metadata provided by OAI
- PDFs downloaded from S3
- Full plain text generated by pdftotext
- Internal co-citation network
- Parsed author lines (v0.2.0)
The binaries available are:
- arxiv-metadata-hash-abstracts-v0.2.0-2019-03-01.json.gz
Full metadata downloaded from (1) with hashed abstracts in place of the abstract text. - internal-references-v0.2.0-2019-03-01.json.gz
Snapshot of the internal co-citation network at the time of release generated with (4). - authors-parsed-v0.2.0-2019-03-01.json.gz
Parsed author lines at time of release generated by (5). - manifest-index-v0.2.0-2019-03-01.json.gz
A detailed file level manifest dictionary mapping the tarpdf files in the S3 manifest to the arXiv file
paths they contain. This can be used to target a subset of the arXiv bulk download, as discussed in
this issue.
Dataset snapshot release v0.1.1
Initial release (up to a security update) with ability to gather and process:
- arXiv metadata provided by OAI
- PDFs downloaded from S3
- Full plain text generated by pdftotext
- Internal co-citation network
The binaries available are:
- arxiv-metadata-hash-abstracts-v0.1.1-2019-03-01.json.gz
Full metadata downloaded from (1) with hashed abstracts in place of the abstract text. - internal-references-v0.1.1-2019-03-01.json.gz
Snapshot of the internal co-citation network at the time of release generated with (4).