Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add download Wikidata dump command to CLI #517

Closed
2 tasks done
andrewtavis opened this issue Dec 8, 2024 · 6 comments
Closed
2 tasks done

Add download Wikidata dump command to CLI #517

andrewtavis opened this issue Dec 8, 2024 · 6 comments
Assignees
Labels
-priority- High priority feature New feature or request help wanted Extra attention is needed Outreachy Available for Outreachy participants

Comments

@andrewtavis
Copy link
Member

Terms

Description

Scribe-Data will be expanding its functionality to work from Wikidata dumps. The first step in this is to add the ability for the CLI to download Wikidata Lexeme dumps. The following command should be added in this issue:

# Latest dump:
scribe-data download --wikidata-dump
scribe-data d -wd

# Specific dump:
scribe-data download --wikidata-dump YYYYMMDD
scribe-data d -wd YYYYMMDD

# Specific output directory:
scribe-data download --wikidata-dump --output-dir DIRECTORY_PATH
scribe-data d -wd -od DIRECTORY_PATH

The above will download the dumps from dumps.wikimedia.org/wikidatawiki/entities/. In the fist set of queries the latest .json.bz2 file will be downloaded, and in the second the URL for the given YYYYMMDD stamp will be checked and a .json.bz2 dump will be downloaded to the PWD. The third would add in an output directory path as is done on the get command, but let's not change the file name. We'll just allow the user to put it in a directory 😊

The functionality should be added in a file src/scribe_data/cli/download.py, with the option being added into src/scribe_data/cli/main.py :)

Contribution

Being worked on by @axif0 as a part of Outreachy! 📶🚀

@andrewtavis andrewtavis added the feature New feature or request label Dec 8, 2024
@andrewtavis andrewtavis added help wanted Extra attention is needed -priority- High priority Outreachy Available for Outreachy participants labels Dec 8, 2024
@axif0
Copy link
Collaborator

axif0 commented Dec 10, 2024

This should be like-

image

sorry my network is too slow to fully download for now.

@axif0
Copy link
Collaborator

axif0 commented Dec 10, 2024

Also, should we add functionalities like if user give wrong date it'll ask user for available closest vailable old dumps, here we also check in a date-lexeme-folder .json.bz2 file are in or not. like 20241030 has and 20241122 not.

Available Closest old dumps only shows who has .json.bz2

image

@axif0
Copy link
Collaborator

axif0 commented Dec 10, 2024

Apologize early, if those features are not planed.. 😞

@andrewtavis
Copy link
Member Author

No need to apologize, @axif0! You're a part of planning the features :) :)

This seems good to me! We'll just take the next most recent one? If there's no most recent one we'll take the last most recent one?

@axif0
Copy link
Collaborator

axif0 commented Dec 11, 2024

This seems good to me! We'll just take the next most recent one? If there's no most recent one we'll take the last most recent one?

Yes..

@axif0 axif0 mentioned this issue Dec 11, 2024
2 tasks
@andrewtavis
Copy link
Member Author

Closed by #528 🚀 Thanks for the great work on making a sustainable and performant download functionality, @axif0! 😊

@github-project-automation github-project-automation bot moved this from Todo to Done in Scribe Board Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
-priority- High priority feature New feature or request help wanted Extra attention is needed Outreachy Available for Outreachy participants
Projects
Status: Done
Development

No branches or pull requests

2 participants