arXiv Matplotlib Query

Anecdotally the Matplotlib maintainers were told

"About 15% of arXiv papers use Matplotlib"

Unfortunately the original analysis of this data was lost. We reproduce it here.

Watermark

Starting in the early 2010s, Matplotlib started including the bytes b"Matplotlib" in every PNG and PDF that they produce. These bytes persist in the output PDFs stored on arXiv. As a result, it's pretty simple to check if a PDF contains a Matplotlib image. All we have to do is scan through every PDF and look for these bytes; no parsing required.

Data

The data is stored in a requester pays bucket at s3://arxiv (more information at https://arxiv.org/help/bulk_data_s3 ) and also on GCS hosted by Kaggle (more information at https://www.kaggle.com/datasets/Cornell-University/arxiv).

The data is about 1TB in size. We use Dask for this.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
arxiv-aws.ipynb		arxiv-aws.ipynb
make_plot.py		make_plot.py
results.parquet		results.parquet
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arXiv Matplotlib Query

Watermark

Data

Contents

Results

About

Releases

Packages

Languages

License

janeknowsbest77/arxiv-matplotlib

Folders and files

Latest commit

History

Repository files navigation

arXiv Matplotlib Query

Watermark

Data

Contents

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages