Allow for `total (t)` CLI functionality to be ran using a Wikidata lexemes dump #520

andrewtavis · 2024-12-08T23:04:39Z

Terms

I have searched open and closed feature requests
I agree to follow Scribe-Data's Code of Conduct

Description

This issue will be the first issue to add dump processing functionality to the Scribe-Data CLI. In it, we'll do the following:

We'll add in the --wikidata-dump (-wd) argument to the total command
If the user passes this argument, the Add check_lexeme_dump_prompt_download function to cli/utils.py #518 functionality will be passed to make sure that a dump is available or download one
From there, the functionality of the total command will be ran over the dump rather than the via the Wikidata query service
- This functionality will be added into a file src/scribe_data/wikidata/parse_dump.py and called from the CLI

Before starting, we should map out the best way to process the dump, with a specific question being whether we need to uncompress the dump or whether we can work directly from the compressed .json.bz2 file.

Contribution

@axif0 will be working on this as a part of Outreachy! 📶🚤

The text was updated successfully, but these errors were encountered:

axif0 · 2024-12-11T12:40:42Z

Before starting, we should map out the best way to process the dump, with a specific question being whether we need to uncompress the dump or whether we can work directly from the compressed .json.bz2 file.

We can use the .bz2 file, which can be opened in text mode ("rt") using the bz2 module. I think it is a efficient approach for us. So that we can work directly from the compressed .json.bz2 file.

Ref:
https://stackoverflow.com/questions/48078567/how-to-parse-wikidata-json-bz2-file-using-python/
https://docs.python.org/2/library/bz2.html
https://www.entropywins.wtf/blog/2015/11/08/wikidata-wikibase-json-dump-reader/

andrewtavis · 2024-12-11T15:09:36Z

Perfect, @axif0! Let's go this route for sure then :) I remember using this before I believe. Feel free to get started here!

axif0 · 2024-12-18T11:31:14Z

For en code, there are lots of lexicalCategory in lexeme dump (please see block 6 and 7 colab)

Also, there are lots of codes for English, should we combine them or should we show them individually ? Or what should be our optimized approach?

axif0 · 2024-12-18T11:34:56Z

Another thing is that, If user give scribe-data total --language English should we show message like

Do you want to query total from
- Wikidata query service 
- Lexeme dump
(W/L/S)

if user give scribe-data total --language English -qs then it auto query from Wikidata query service.
if user give ``scribe-data total --language English -wdp` then it auto query from Wikidata query service.

Or totally remove the Wikidata query service feature?

axif0 · 2024-12-18T11:35:03Z

Another suggetion is that, if there is a lexeme parsed dump JSON format available in directory, then scribe-data total --language English -wdp will look for it and count form there.

What do you think? @andrewtavis @wkyoshida @henrikth93

andrewtavis · 2024-12-18T14:11:53Z

I'd say we can provide the options to the user if they don't pass something and then use the -qs and -wdp arguments as you suggested, @axif0 :) As far as English, I'm fine with combining for now, but what percent is en English vs. en-gb and others?

andrewtavis added the feature New feature or request label Dec 8, 2024

andrewtavis assigned axif0 Dec 8, 2024

andrewtavis added this to Scribe Board Dec 8, 2024

github-project-automation bot moved this to Todo in Scribe Board Dec 8, 2024

andrewtavis added help wanted Extra attention is needed -priority- High priority Outreachy Available for Outreachy participants and removed -priority- High priority labels Dec 8, 2024

andrewtavis changed the title ~~Allow for --total (-t) CLI functionality to be ran using a Wikidata lexemes dump~~ Allow for total (t) CLI functionality to be ran using a Wikidata lexemes dump Dec 8, 2024

andrewtavis added a commit that referenced this issue Dec 16, 2024

#520 Add wdp arg to CLI total functionality

ba75af9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for `total (t)` CLI functionality to be ran using a Wikidata lexemes dump #520

Allow for `total (t)` CLI functionality to be ran using a Wikidata lexemes dump #520

andrewtavis commented Dec 8, 2024

axif0 commented Dec 11, 2024 •

edited

Loading

andrewtavis commented Dec 11, 2024

axif0 commented Dec 18, 2024 •

edited

Loading

axif0 commented Dec 18, 2024 •

edited

Loading

axif0 commented Dec 18, 2024 •

edited

Loading

andrewtavis commented Dec 18, 2024

Allow for total (t) CLI functionality to be ran using a Wikidata lexemes dump #520

Allow for total (t) CLI functionality to be ran using a Wikidata lexemes dump #520

Comments

andrewtavis commented Dec 8, 2024

Terms

Description

Contribution

axif0 commented Dec 11, 2024 • edited Loading

andrewtavis commented Dec 11, 2024

axif0 commented Dec 18, 2024 • edited Loading

axif0 commented Dec 18, 2024 • edited Loading

axif0 commented Dec 18, 2024 • edited Loading

andrewtavis commented Dec 18, 2024

Allow for `total (t)` CLI functionality to be ran using a Wikidata lexemes dump #520

Allow for `total (t)` CLI functionality to be ran using a Wikidata lexemes dump #520

axif0 commented Dec 11, 2024 •

edited

Loading

axif0 commented Dec 18, 2024 •

edited

Loading

axif0 commented Dec 18, 2024 •

edited

Loading

axif0 commented Dec 18, 2024 •

edited

Loading