Rudimentary statistics for the dataset #4

zouharvi · 2022-09-29T09:14:23Z

This PR adds a script to compute some basic statistics about the corpus and displays them in the README. It is as much as I can extract from the fields that are currently available. From NLP perspective I'd be mostly interested in the languages and number of sentences. Let me know if you have some more ideas about this.

Semi-related notes:

I also updated the total row count based on what's in data/acl-publication-info.74k.parquet.
A more standard approach to TODOs would be to use GitHub Issues and list them there, instead of having it in the README. This would make tracing progress easier and would also make the "title page" less cluttered. What do you think?
S2 provides paper embeddings. We could do some fun stuff with that.

shauryr · 2022-10-06T23:04:41Z

about the todo - I agree. I will remove the todo section soon and more it to issues
yeah I have been working with - https://huggingface.co/sentence-transformers/allenai-specter : trying to map things and find clusters

rudimentary statistics for the dataset

afe0cf6

zouharvi changed the title ~~rudimentary statistics for the dataset~~ Rudimentary statistics for the dataset Sep 29, 2022

lint & prettify stats code

649fb0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rudimentary statistics for the dataset #4

Rudimentary statistics for the dataset #4

zouharvi commented Sep 29, 2022

shauryr commented Oct 6, 2022

Rudimentary statistics for the dataset #4

Are you sure you want to change the base?

Rudimentary statistics for the dataset #4

Conversation

zouharvi commented Sep 29, 2022

shauryr commented Oct 6, 2022