Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: File Counts and Dataset Size #44

Open
darien-schettler opened this issue Mar 25, 2023 · 1 comment
Open

Question: File Counts and Dataset Size #44

darien-schettler opened this issue Mar 25, 2023 · 1 comment

Comments

@darien-schettler
Copy link

I recently downloaded The Stack (the-stack-dedup) from Huggingface via GIT LFS. I have two questions that I need help with:

  1. The size on disk of the dedup datset is only around 900GB (much smaller than the 1.5TB indicated on the data card - https://huggingface.co/datasets/bigcode/admin/resolve/main/the-stack-infographic-v11.png)

  2. Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?

Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.

Thanks in advance!

@ChenghaoMou
Copy link
Collaborator

ChenghaoMou commented May 11, 2023

The dataset was compressed with parquet + snappy when uploaded to the hub. Here is a before and after deduplication comparison in terms of physical size (w/o compression) and number of files:
bquxjob_25a65048_188085aa72f.csv

The last line is the total change, here is the screenshot for quick reference:
CleanShot 2023-05-10 at 18 15 25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants