Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for total (t) CLI functionality to be ran using a Wikidata lexemes dump #520

Open
2 tasks done
andrewtavis opened this issue Dec 8, 2024 · 6 comments
Open
2 tasks done
Assignees
Labels
feature New feature or request help wanted Extra attention is needed Outreachy Available for Outreachy participants

Comments

@andrewtavis
Copy link
Member

Terms

Description

This issue will be the first issue to add dump processing functionality to the Scribe-Data CLI. In it, we'll do the following:

  • We'll add in the --wikidata-dump (-wd) argument to the total command
  • If the user passes this argument, the Add check_lexeme_dump_prompt_download function to cli/utils.py #518 functionality will be passed to make sure that a dump is available or download one
  • From there, the functionality of the total command will be ran over the dump rather than the via the Wikidata query service
    • This functionality will be added into a file src/scribe_data/wikidata/parse_dump.py and called from the CLI

Before starting, we should map out the best way to process the dump, with a specific question being whether we need to uncompress the dump or whether we can work directly from the compressed .json.bz2 file.

Contribution

@axif0 will be working on this as a part of Outreachy! 📶🚤

@andrewtavis andrewtavis added the feature New feature or request label Dec 8, 2024
@andrewtavis andrewtavis added help wanted Extra attention is needed -priority- High priority Outreachy Available for Outreachy participants and removed -priority- High priority labels Dec 8, 2024
@andrewtavis andrewtavis changed the title Allow for --total (-t) CLI functionality to be ran using a Wikidata lexemes dump Allow for total (t) CLI functionality to be ran using a Wikidata lexemes dump Dec 8, 2024
@axif0
Copy link
Collaborator

axif0 commented Dec 11, 2024

Before starting, we should map out the best way to process the dump, with a specific question being whether we need to uncompress the dump or whether we can work directly from the compressed .json.bz2 file.

We can use the .bz2 file, which can be opened in text mode ("rt") using the bz2 module. I think it is a efficient approach for us. So that we can work directly from the compressed .json.bz2 file.

Ref:
https://stackoverflow.com/questions/48078567/how-to-parse-wikidata-json-bz2-file-using-python/
https://docs.python.org/2/library/bz2.html
https://www.entropywins.wtf/blog/2015/11/08/wikidata-wikibase-json-dump-reader/

@andrewtavis
Copy link
Member Author

Perfect, @axif0! Let's go this route for sure then :) I remember using this before I believe. Feel free to get started here!

@axif0
Copy link
Collaborator

axif0 commented Dec 18, 2024

For en code, there are lots of lexicalCategory in lexeme dump (please see block 6 and 7 colab)

Also, there are lots of codes for English, should we combine them or should we show them individually ? Or what should be our optimized approach?

@axif0
Copy link
Collaborator

axif0 commented Dec 18, 2024

Another thing is that, If user give scribe-data total --language English should we show message like

Do you want to query total from
- Wikidata query service 
- Lexeme dump
(W/L/S)

if user give scribe-data total --language English -qs then it auto query from Wikidata query service.
if user give ``scribe-data total --language English -wdp` then it auto query from Wikidata query service.

Or totally remove the Wikidata query service feature?

@axif0
Copy link
Collaborator

axif0 commented Dec 18, 2024

Another suggetion is that, if there is a lexeme parsed dump JSON format available in directory, then scribe-data total --language English -wdp will look for it and count form there.

What do you think? @andrewtavis @wkyoshida @henrikth93

@andrewtavis
Copy link
Member Author

I'd say we can provide the options to the user if they don't pass something and then use the -qs and -wdp arguments as you suggested, @axif0 :) As far as English, I'm fine with combining for now, but what percent is en English vs. en-gb and others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request help wanted Extra attention is needed Outreachy Available for Outreachy participants
Projects
Status: Todo
Development

No branches or pull requests

2 participants