-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added download cli cmd #528
Conversation
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist |
Thanks so much for this, @axif0! I'll get to reviewing it tonight or tomorrow after work :) I added something to the doc from our meeting on Sunday about also doing testing. Let's continue to experiment and make issues for fixes with tests. We can maybe take some time at the end of the month to do a session where we test all the functionalities 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @axif0!!
Really cool to see 🚀
tests/cli/test_get.py
Outdated
output_dir="./test_output", | ||
overwrite=False, | ||
) | ||
# Using sparql based data extract tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these tests no longer valid?
My guess is that they likely are, right? - since the implementation changed from SPARQL to dumps.
Could still be a good idea to write other tests to account for the new implementation. How about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@axif0: Can you make an issue with a title like Outreachy Scribe-Data tests round 1
that is to write tests for all the functionality that's been added so far? The description for the issue can be as simple as:
In my work for Outreachy I've recently added the following functions:
- fxn_name
- another_fxn_name
This issue is to write tests for each of the above functions.
You can then send a PR along that adds the needed tests and then we can check it and close the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can then write an issue like that for each issue so that you can split the work for functionality and tests up a bit? You can open a testing issue whenever you figure out some functionality that you'll be adding to the package, and we can also talk and make these issues later in Outreachy as we go through the current and new functionality to look for bugs 🐛
You'd also be welcome to open a PR and then send another commit to it to close the related testing issue :) Whatever works for you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewtavis Sure I can make an issue to track all the merged functions I made and make tests then by sending along PR. I think it'll more convenient.
src/scribe_data/cli/get.py
Outdated
"[bold red]Parsing lexeme dump feature will be available soon...[/bold red]" | ||
) | ||
|
||
# Using sparql based data extract |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we completely moving away from SPARQL?
Could it be interesting to still keep the SPARQL option around and to then allow selecting it via a flag? Are there any pros for SPARQL that are being discarded in favor of going with dumps? Could very well be that there aren't any at all, but just checking 😄
Perhaps this is a question more for @andrewtavis 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very long story 😅
We're not moving away from SPARQL at all. I had a recent meeting with some coworkers where I talked about the planned functionality of Scribe-Data, and the general sentiment was (very supportive and nice) "Cool, yes, but why are you all planning on sending that many queries to the Wikidata query service even on a bi-weekly basis?" Even now we're talking ten minutes of queries, and at scale it's going to be tons of very long queries.
The plan from here (feedback welcome!):
- Scribe-Data develops a Wikidata dump process, not a Wiktionary dump process, as "there's a chance that lots of translations will be added soon". There are also benefits of this regardless - see below.
- Scribe-Server still runs Scribe-Data, but uses the most recent dump such that the query load is 0 and the only load on the WMF side is transferring the dump (the query service is strained, the download process is not).
- Scribe-Data thus has dump and SPARQL processes that mirror one another
- Results are the same for each
- A dump is required for
scribe-data get -a
- The user can still do the equivalent of
scribe-data get -a
with queries if the want via looping through Scribe-Data commands, but for responsible use purposes it requires a dump in the base command
- The user can still do the equivalent of
- Using a dump is suggested for
scribe-data get -lang LANGUAGE -a
andscribe-data get -dt DATA_TYPE -a
, but if the user wants to run the queries then they can - For commands like
scribe-data get -lang LANGUAGE -dt DATA_TYPE
there is no suggestion to the use dump (total
functionality as well)
- The beauty of the above is that it allows us to directly solve Add Feature to Extract and Verify All Grammatical Features for a Data Type in a Given Language #513
- We create a workflow to download the most recent dump and extract
__all__
forms from it for__all__
Scribe-Data languages - We check the forms that we're getting from the dump against the current queries
- If there are forms that we don't have, we trigger a pull request to Scribe-Data that leverages the src/scribe_data/resources/lexeme_form_metadata.json file – needs to be updated as more property IDs are needed – that says "Hey couldn't help but notice your LANGUAGE query for DATA_TYPE doesn't have MISSING_FORMS. Here's the query as I've used
lexeme_form_metadata
to translate the Wikidata PIDs into appropriate labels ✨" - We thus have a process where the queries are automatically updated to get all data so that people can use them to get the data or the dumps if they so choose
- We create a workflow to download the most recent dump and extract
As stated, long story 😊 Let me know on the above!
src/scribe_data/cli/download.py
Outdated
) -> None: | ||
"""Download Wikidata dumps. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the doc string here, @axif0, but let's please use the format that's used in the rest of the package as this one here is not going to be rendered in the docs. Updating the contribution guide now with some directions here :)
@@ -287,6 +287,33 @@ Scribe does not accept direct edits to the grammar JSON files as they are source | |||
|
|||
The documentation for Scribe-Data can be found at [scribe-data.readthedocs.io](https://scribe-data.readthedocs.io/en/latest/). Documentation is an invaluable way to contribute to coding projects as it allows others to more easily understand the project structure and contribute. Issues related to documentation are marked with the [`documentation`](https://github.com/scribe-org/Scribe-Data/labels/documentation) label. | |||
|
|||
### Function Docstrings | |||
|
|||
Scribe-Data generally follows [NumPy conventions](https://numpydoc.readthedocs.io/en/latest/format.html) for documenting functions and Python code in general. Function docstrings should have the following format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@axif0: Just added this to the contributing guide for directions on how to write docstings that will be rendered properly in the docs. Can we ask you to familiarize yourself with them? :)
@@ -48,6 +48,7 @@ def test_invalid_arguments(self): | |||
# MARK: All Data | |||
|
|||
@patch("scribe_data.cli.get.query_data") | |||
@patch("builtins.input", lambda _: "N") # don't use dump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@axif0: Note that this is how we can be passing input arguments to tests, with in this case "N"
being passed to say that we want to query.
I moved some functionality into cli download as I was experiencing a circular import with some things I thought we'd need in here. At this point the query all functionality should work for languages and data types, but first the user will be asked if they would like to use a dump or not, and if they respond with @axif0: Can you give the above comments and changes a look so you're familiar with them? From there the following would be great:
From there we'll be ready to merge this in and close three issues already! 🚀🚀 Thanks for the great work so far, @axif0! |
I'm getting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really an amazing start to the Outreachy project, @axif0! 😊 Looking forward to seeing how the more difficult work progresses, as this is exactly what we need at this stage :) Thank you further for the testing issue! Let's make this a habit so the final testing work for the v5.0 release will be easier ✅
Description
New Feature: Download Wikidata Dumps
src/scribe_data/cli/download.py
: Addeddownload_wrapper
function to handle downloading of Wikidata dumps.Added
DEFAULT_DUMP_EXPORT_DIR
andcheck_lexeme_dump_prompt_download
function to manage dump files.Integration with Existing CLI
src/scribe_data/cli/get.py
: Modifiedget_data
function to use the newdownload_wrapper
for handling Wikidata dumps. Added a new argument--wikidata-dump
to theget
command to specify the path to a local Wikidata lexemes dump.Code Cleanup and Get -all Test Remove
src/scribe_data/wikidata/query_data.py
: Commented out the main execution block to prevent unintended execution. (Probably this part not needed)Contributor checklist
pytest
command as directed in the testing section of the contributing guideRelated issue
check_lexeme_dump_prompt_download
function to cli/utils.py #518scribe-data g -a
#519