Skip to content
This repository has been archived by the owner on Mar 25, 2024. It is now read-only.

MathML corpus statistics #26

Merged
merged 6 commits into from
Mar 24, 2019
Merged

MathML corpus statistics #26

merged 6 commits into from
Mar 24, 2019

Conversation

dginev
Copy link
Member

@dginev dginev commented Mar 20, 2019

Requested by the MathML 4 effort, I added a small example script that extracts a element-attribute-value footprint over the <math> elements of a given llamapun corpus.

Running this over arXiv, will leave the PR open until I vet that the final data looks good.

Request issue with details here:
w3c/mathml#55

Aside: I also simplified the corpus_node_model example -- which still doesn't have any practical purpose -- to use a Result-returning main()

@dginev
Copy link
Member Author

dginev commented Mar 21, 2019

Happily, the DLMF can be seen as a llamapun corpus if we just filter over the .html5 files, and ignore all other auxiliary content. Then the MathML statistics script can also be run over the DLMF, which takes the pleasing 15 seconds. So far so good.

@dginev
Copy link
Member Author

dginev commented Mar 23, 2019

As a side note, this PR also motivated looking into multi-threading the rust libxml wrapper, (see KWARC/rust-libxml#47 ), to speed up this type of traversal on modern CPUs. Having the multi-threaded libxml may also boost us in creating DNM at scale faster, but that's something for another day.

Right now traversing arxiv takes 70-75 hours (for 1.2 million .html files) on my spinning HDD storage, and the bottleneck is not at IO, while also using only a single CPU. So definitely room to squeeze more performance out.

@dginev
Copy link
Member Author

dginev commented Mar 24, 2019

I have now generated the arXiv report, and generally feel this is a good state for the stats example, so merging here.

@dginev dginev merged commit 9a6b545 into master Mar 24, 2019
@dginev dginev mentioned this pull request Apr 14, 2019
2 tasks
@dginev dginev deleted the tag-statistics branch April 15, 2019 20:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant