Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added download cli cmd #528

Merged
merged 17 commits into from
Dec 16, 2024
Merged

Added download cli cmd #528

merged 17 commits into from
Dec 16, 2024

Conversation

axif0
Copy link
Collaborator

@axif0 axif0 commented Dec 11, 2024

Description

New Feature: Download Wikidata Dumps

  • src/scribe_data/cli/download.py: Added download_wrapper function to handle downloading of Wikidata dumps.
    Added DEFAULT_DUMP_EXPORT_DIR and check_lexeme_dump_prompt_download function to manage dump files.

Integration with Existing CLI

  • src/scribe_data/cli/get.py: Modified get_data function to use the new download_wrapper for handling Wikidata dumps. Added a new argument --wikidata-dump to the get command to specify the path to a local Wikidata lexemes dump.

Code Cleanup and Get -all Test Remove

Contributor checklist


Related issue

Copy link

github-actions bot commented Dec 11, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@axif0 axif0 requested a review from andrewtavis December 11, 2024 07:40
@andrewtavis
Copy link
Member

Thanks so much for this, @axif0! I'll get to reviewing it tonight or tomorrow after work :)

I added something to the doc from our meeting on Sunday about also doing testing. Let's continue to experiment and make issues for fixes with tests. We can maybe take some time at the end of the month to do a session where we test all the functionalities 😊

@axif0 axif0 changed the title Issue 517 Added download cli cmd Dec 11, 2024
Copy link
Member

@wkyoshida wkyoshida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @axif0!!

Really cool to see 🚀

src/scribe_data/wiktionary/wikitionary_utils.py Outdated Show resolved Hide resolved
output_dir="./test_output",
overwrite=False,
)
# Using sparql based data extract tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests no longer valid?
My guess is that they likely are, right? - since the implementation changed from SPARQL to dumps.

Could still be a good idea to write other tests to account for the new implementation. How about that?

Copy link
Member

@andrewtavis andrewtavis Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axif0: Can you make an issue with a title like Outreachy Scribe-Data tests round 1 that is to write tests for all the functionality that's been added so far? The description for the issue can be as simple as:

In my work for Outreachy I've recently added the following functions:

- fxn_name
- another_fxn_name

This issue is to write tests for each of the above functions.

You can then send a PR along that adds the needed tests and then we can check it and close the issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can then write an issue like that for each issue so that you can split the work for functionality and tests up a bit? You can open a testing issue whenever you figure out some functionality that you'll be adding to the package, and we can also talk and make these issues later in Outreachy as we go through the current and new functionality to look for bugs 🐛

You'd also be welcome to open a PR and then send another commit to it to close the related testing issue :) Whatever works for you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewtavis Sure I can make an issue to track all the merged functions I made and make tests then by sending along PR. I think it'll more convenient.

"[bold red]Parsing lexeme dump feature will be available soon...[/bold red]"
)

# Using sparql based data extract
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we completely moving away from SPARQL?

Could it be interesting to still keep the SPARQL option around and to then allow selecting it via a flag? Are there any pros for SPARQL that are being discarded in favor of going with dumps? Could very well be that there aren't any at all, but just checking 😄

Perhaps this is a question more for @andrewtavis 😊

Copy link
Member

@andrewtavis andrewtavis Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very long story 😅

We're not moving away from SPARQL at all. I had a recent meeting with some coworkers where I talked about the planned functionality of Scribe-Data, and the general sentiment was (very supportive and nice) "Cool, yes, but why are you all planning on sending that many queries to the Wikidata query service even on a bi-weekly basis?" Even now we're talking ten minutes of queries, and at scale it's going to be tons of very long queries.

The plan from here (feedback welcome!):

  1. Scribe-Data develops a Wikidata dump process, not a Wiktionary dump process, as "there's a chance that lots of translations will be added soon". There are also benefits of this regardless - see below.
  2. Scribe-Server still runs Scribe-Data, but uses the most recent dump such that the query load is 0 and the only load on the WMF side is transferring the dump (the query service is strained, the download process is not).
  3. Scribe-Data thus has dump and SPARQL processes that mirror one another
    • Results are the same for each
    • A dump is required for scribe-data get -a
      • The user can still do the equivalent of scribe-data get -a with queries if the want via looping through Scribe-Data commands, but for responsible use purposes it requires a dump in the base command
    • Using a dump is suggested for scribe-data get -lang LANGUAGE -a and scribe-data get -dt DATA_TYPE -a, but if the user wants to run the queries then they can
    • For commands like scribe-data get -lang LANGUAGE -dt DATA_TYPE there is no suggestion to the use dump (total functionality as well)
  4. The beauty of the above is that it allows us to directly solve Add Feature to Extract and Verify All Grammatical Features for a Data Type in a Given Language #513
    • We create a workflow to download the most recent dump and extract __all__ forms from it for __all__ Scribe-Data languages
    • We check the forms that we're getting from the dump against the current queries
    • If there are forms that we don't have, we trigger a pull request to Scribe-Data that leverages the src/scribe_data/resources/lexeme_form_metadata.json file – needs to be updated as more property IDs are needed – that says "Hey couldn't help but notice your LANGUAGE query for DATA_TYPE doesn't have MISSING_FORMS. Here's the query as I've used lexeme_form_metadata to translate the Wikidata PIDs into appropriate labels ✨"
    • We thus have a process where the queries are automatically updated to get all data so that people can use them to get the data or the dumps if they so choose

As stated, long story 😊 Let me know on the above!

src/scribe_data/cli/download.py Outdated Show resolved Hide resolved
src/scribe_data/cli/download.py Outdated Show resolved Hide resolved
src/scribe_data/cli/main.py Outdated Show resolved Hide resolved
src/scribe_data/wiktionary/wikitionary_utils.py Outdated Show resolved Hide resolved
src/scribe_data/wiktionary/wikitionary_utils.py Outdated Show resolved Hide resolved
src/scribe_data/wiktionary/wikitionary_utils.py Outdated Show resolved Hide resolved
src/scribe_data/utils.py Outdated Show resolved Hide resolved
) -> None:
"""Download Wikidata dumps.

Args:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the doc string here, @axif0, but let's please use the format that's used in the rest of the package as this one here is not going to be rendered in the docs. Updating the contribution guide now with some directions here :)

@@ -287,6 +287,33 @@ Scribe does not accept direct edits to the grammar JSON files as they are source

The documentation for Scribe-Data can be found at [scribe-data.readthedocs.io](https://scribe-data.readthedocs.io/en/latest/). Documentation is an invaluable way to contribute to coding projects as it allows others to more easily understand the project structure and contribute. Issues related to documentation are marked with the [`documentation`](https://github.com/scribe-org/Scribe-Data/labels/documentation) label.

### Function Docstrings

Scribe-Data generally follows [NumPy conventions](https://numpydoc.readthedocs.io/en/latest/format.html) for documenting functions and Python code in general. Function docstrings should have the following format:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axif0: Just added this to the contributing guide for directions on how to write docstings that will be rendered properly in the docs. Can we ask you to familiarize yourself with them? :)

@@ -48,6 +48,7 @@ def test_invalid_arguments(self):
# MARK: All Data

@patch("scribe_data.cli.get.query_data")
@patch("builtins.input", lambda _: "N") # don't use dump
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axif0: Note that this is how we can be passing input arguments to tests, with in this case "N" being passed to say that we want to query.

@andrewtavis
Copy link
Member

andrewtavis commented Dec 16, 2024

I moved some functionality into cli download as I was experiencing a circular import with some things I thought we'd need in here. At this point the query all functionality should work for languages and data types, but first the user will be asked if they would like to use a dump or not, and if they respond with y then it'll run.

@axif0: Can you give the above comments and changes a look so you're familiar with them? From there the following would be great:

  1. Make a testing issue for the new functions here as mentioned above
  2. Test run the current functionality and report back if all's working, or if not send along a commit to fix things

From there we'll be ready to merge this in and close three issues already! 🚀🚀

Thanks for the great work so far, @axif0!

@axif0
Copy link
Collaborator Author

axif0 commented Dec 16, 2024

I'm getting AttributeError for download and get. For debug I temporarily added print(vars(args)) before calling the functions to see for corrected parsing arguments. And fix it then..

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really an amazing start to the Outreachy project, @axif0! 😊 Looking forward to seeing how the more difficult work progresses, as this is exactly what we need at this stage :) Thank you further for the testing issue! Let's make this a habit so the final testing work for the v5.0 release will be easier ✅

@andrewtavis andrewtavis merged commit 8c4b557 into scribe-org:main Dec 16, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants