Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rudimentary statistics for the dataset #4

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zouharvi
Copy link

This PR adds a script to compute some basic statistics about the corpus and displays them in the README. It is as much as I can extract from the fields that are currently available. From NLP perspective I'd be mostly interested in the languages and number of sentences. Let me know if you have some more ideas about this.

Semi-related notes:

  • I also updated the total row count based on what's in data/acl-publication-info.74k.parquet.
  • A more standard approach to TODOs would be to use GitHub Issues and list them there, instead of having it in the README. This would make tracing progress easier and would also make the "title page" less cluttered. What do you think?
  • S2 provides paper embeddings. We could do some fun stuff with that.

@zouharvi zouharvi changed the title rudimentary statistics for the dataset Rudimentary statistics for the dataset Sep 29, 2022
@shauryr
Copy link
Owner

shauryr commented Oct 6, 2022

  1. about the todo - I agree. I will remove the todo section soon and more it to issues
  2. yeah I have been working with - https://huggingface.co/sentence-transformers/allenai-specter : trying to map things and find clusters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants