Very Long scientific papers

This dataset contains code and data for the very long scientific papers dataset based on arxiv.org. The data is stored under the final/test directory with PAPER_ID.main.txt and corresponding PAPER_ID.abstract.txt files.

Data gathering process

The data is gathered (main.py) using the following steps:

Search for anything containing the word thesis in the title using the arxiv api
Download the source for these documents
Use engrafo to convert this into html
Filter the html to remove math, images, etc..
Find the abstract and seperate it (if cannot be found, skip document)
Convert to txt format

To gather your own data, simply run main.py.

Name		Name	Last commit message	Last commit date
Latest commit History 620 Commits
data		data
deduped-test		deduped-test
.gitignore		.gitignore
README.md		README.md
get_stats.py		get_stats.py
main.py		main.py
make_final.py		make_final.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Very Long scientific papers

Data gathering process

About

Releases

Packages

Languages

ghomasHudson/very_long_scientific_papers

Folders and files

Latest commit

History

Repository files navigation

Very Long scientific papers

Data gathering process

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages