Added download cli cmd #528

axif0 · 2024-12-11T07:39:10Z

Description

New Feature: Download Wikidata Dumps

src/scribe_data/cli/download.py: Added download_wrapper function to handle downloading of Wikidata dumps.
Added DEFAULT_DUMP_EXPORT_DIR and check_lexeme_dump_prompt_download function to manage dump files.

Integration with Existing CLI

src/scribe_data/cli/get.py: Modified get_data function to use the new download_wrapper for handling Wikidata dumps. Added a new argument --wikidata-dump to the get command to specify the path to a local Wikidata lexemes dump.

Code Cleanup and Get -all Test Remove

src/scribe_data/wikidata/query_data.py: Commented out the main execution block to prevent unintended execution. (Probably this part not needed)

Contributor checklist

This pull request is on a separate branch and not the main branch
I have tested my code with the pytest command as directed in the testing section of the contributing guide

Related issue

… Issue_517

…t_download

github-actions · 2024-12-11T07:39:47Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

andrewtavis · 2024-12-11T08:34:01Z

Thanks so much for this, @axif0! I'll get to reviewing it tonight or tomorrow after work :)

I added something to the doc from our meeting on Sunday about also doing testing. Let's continue to experiment and make issues for fixes with tests. We can maybe take some time at the end of the month to do a session where we test all the functionalities 😊

wkyoshida

Awesome work @axif0!!

Really cool to see 🚀

src/scribe_data/wiktionary/wikitionary_utils.py

wkyoshida · 2024-12-15T19:49:45Z

tests/cli/test_get.py

-            output_dir="./test_output",
-            overwrite=False,
-        )
+    # Using sparql based data extract tests


Are these tests no longer valid?
My guess is that they likely are, right? - since the implementation changed from SPARQL to dumps.

Could still be a good idea to write other tests to account for the new implementation. How about that?

@axif0: Can you make an issue with a title like Outreachy Scribe-Data tests round 1 that is to write tests for all the functionality that's been added so far? The description for the issue can be as simple as:

In my work for Outreachy I've recently added the following functions: - fxn_name - another_fxn_name This issue is to write tests for each of the above functions.

You can then send a PR along that adds the needed tests and then we can check it and close the issue?

You can then write an issue like that for each issue so that you can split the work for functionality and tests up a bit? You can open a testing issue whenever you figure out some functionality that you'll be adding to the package, and we can also talk and make these issues later in Outreachy as we go through the current and new functionality to look for bugs 🐛

You'd also be welcome to open a PR and then send another commit to it to close the related testing issue :) Whatever works for you!

@andrewtavis Sure I can make an issue to track all the merged functions I made and make tests then by sending along PR. I think it'll more convenient.

wkyoshida · 2024-12-15T19:58:12Z

src/scribe_data/cli/get.py

+                    "[bold red]Parsing lexeme dump feature will be available soon...[/bold red]"
+                )
+
+        # Using sparql based data extract


Are we completely moving away from SPARQL?

Could it be interesting to still keep the SPARQL option around and to then allow selecting it via a flag? Are there any pros for SPARQL that are being discarded in favor of going with dumps? Could very well be that there aren't any at all, but just checking 😄

Perhaps this is a question more for @andrewtavis 😊

Very long story 😅

We're not moving away from SPARQL at all. I had a recent meeting with some coworkers where I talked about the planned functionality of Scribe-Data, and the general sentiment was (very supportive and nice) "Cool, yes, but why are you all planning on sending that many queries to the Wikidata query service even on a bi-weekly basis?" Even now we're talking ten minutes of queries, and at scale it's going to be tons of very long queries.

The plan from here (feedback welcome!):

Scribe-Data develops a Wikidata dump process, not a Wiktionary dump process, as "there's a chance that lots of translations will be added soon". There are also benefits of this regardless - see below.

Scribe-Server still runs Scribe-Data, but uses the most recent dump such that the query load is 0 and the only load on the WMF side is transferring the dump (the query service is strained, the download process is not).

Scribe-Data thus has dump and SPARQL processes that mirror one another

Results are the same for each

A dump is required for scribe-data get -a

The user can still do the equivalent of scribe-data get -a with queries if the want via looping through Scribe-Data commands, but for responsible use purposes it requires a dump in the base command

Using a dump is suggested for scribe-data get -lang LANGUAGE -a and scribe-data get -dt DATA_TYPE -a, but if the user wants to run the queries then they can

For commands like scribe-data get -lang LANGUAGE -dt DATA_TYPE there is no suggestion to the use dump (total functionality as well)

The beauty of the above is that it allows us to directly solve Add Feature to Extract and Verify All Grammatical Features for a Data Type in a Given Language #513

We create a workflow to download the most recent dump and extract __all__ forms from it for __all__ Scribe-Data languages

We check the forms that we're getting from the dump against the current queries

If there are forms that we don't have, we trigger a pull request to Scribe-Data that leverages the src/scribe_data/resources/lexeme_form_metadata.json file – needs to be updated as more property IDs are needed – that says "Hey couldn't help but notice your LANGUAGE query for DATA_TYPE doesn't have MISSING_FORMS. Here's the query as I've used lexeme_form_metadata to translate the Wikidata PIDs into appropriate labels ✨"

We thus have a process where the queries are automatically updated to get all data so that people can use them to get the data or the dumps if they so choose

As stated, long story 😊 Let me know on the above!

src/scribe_data/cli/download.py

src/scribe_data/cli/main.py

src/scribe_data/wiktionary/wikitionary_utils.py

src/scribe_data/utils.py

Co-authored-by: Will Yoshida <[email protected]>

andrewtavis · 2024-12-15T23:50:37Z

src/scribe_data/cli/download.py

+) -> None:
+    """Download Wikidata dumps.
+
+    Args:


Appreciate the doc string here, @axif0, but let's please use the format that's used in the rest of the package as this one here is not going to be rendered in the docs. Updating the contribution guide now with some directions here :)

andrewtavis · 2024-12-15T23:55:34Z

CONTRIBUTING.md

@@ -287,6 +287,33 @@ Scribe does not accept direct edits to the grammar JSON files as they are source

 The documentation for Scribe-Data can be found at [scribe-data.readthedocs.io](https://scribe-data.readthedocs.io/en/latest/). Documentation is an invaluable way to contribute to coding projects as it allows others to more easily understand the project structure and contribute. Issues related to documentation are marked with the [`documentation`](https://github.com/scribe-org/Scribe-Data/labels/documentation) label.

+### Function Docstrings
+
+Scribe-Data generally follows [NumPy conventions](https://numpydoc.readthedocs.io/en/latest/format.html) for documenting functions and Python code in general. Function docstrings should have the following format:


@axif0: Just added this to the contributing guide for directions on how to write docstings that will be rendered properly in the docs. Can we ask you to familiarize yourself with them? :)

andrewtavis · 2024-12-16T02:20:15Z

tests/cli/test_get.py

@@ -48,6 +48,7 @@ def test_invalid_arguments(self):
    # MARK: All Data

    @patch("scribe_data.cli.get.query_data")
+    @patch("builtins.input", lambda _: "N")  # don't use dump


@axif0: Note that this is how we can be passing input arguments to tests, with in this case "N" being passed to say that we want to query.

andrewtavis · 2024-12-16T02:25:13Z

I moved some functionality into cli download as I was experiencing a circular import with some things I thought we'd need in here. At this point the query all functionality should work for languages and data types, but first the user will be asked if they would like to use a dump or not, and if they respond with y then it'll run.

@axif0: Can you give the above comments and changes a look so you're familiar with them? From there the following would be great:

Make a testing issue for the new functions here as mentioned above
Test run the current functionality and report back if all's working, or if not send along a commit to fix things

From there we'll be ready to merge this in and close three issues already! 🚀🚀

Thanks for the great work so far, @axif0!

…t command.

axif0 · 2024-12-16T08:33:07Z

I'm getting AttributeError for download and get. For debug I temporarily added print(vars(args)) before calling the functions to see for corrected parsing arguments. And fix it then..

andrewtavis

Really an amazing start to the Outreachy project, @axif0! 😊 Looking forward to seeing how the more difficult work progresses, as this is exactly what we need at this stage :) Thank you further for the testing issue! Let's make this a habit so the final testing work for the v5.0 release will be easier ✅

axif0 added 7 commits December 10, 2024 14:31

Added download cli cmd

c0f699f

Merge branch 'Issue_517' of https://github.com/axif0/Scribe-Data into…

51c31cb

… Issue_517

user can download 2024/12/04 or 2024-12-04

4b9a5bb

rename check_existing_lexeme_dump function to check_lexeme_dump_promp…

1b0d6fa

…t_download

final

8ce7744

small issue fix

38d09dc

remove tests for get -all

6191ab9

axif0 requested a review from andrewtavis December 11, 2024 07:40

axif0 changed the title ~~Issue 517~~ Added download cli cmd Dec 11, 2024

wkyoshida reviewed Dec 15, 2024

View reviewed changes

wkyoshida and others added 4 commits December 15, 2024 17:49

Merge branch 'main' into Issue_517

e1553f8

Apply suggestions from code review

e5d68f7

Co-authored-by: Will Yoshida <[email protected]>

Update var names given PR suggestions + minor changes

29707db

Move files from Wiktionary utils to WD utils - delete Wiktionary dir

27378b5

andrewtavis reviewed Dec 15, 2024

View reviewed changes

Update all docstrings and add documentation for how to write them

6e70995

andrewtavis reviewed Dec 15, 2024

View reviewed changes

andrewtavis added 2 commits December 16, 2024 01:40

Comment and file formatting

c074ae2

Re-add get all for lang and dt only + fix tests with input patch

66242f4

andrewtavis reviewed Dec 16, 2024

View reviewed changes

Fix AttributeError: correct wikidata_dump_path argument mapping in ge…

962684a

…t command.

andrewtavis added 2 commits December 16, 2024 15:15

Update changelog given Wikidata dump functionality

e6d1a71

Fix message to user and dir name in docstrings

a2a9fc3

andrewtavis reviewed Dec 16, 2024

View reviewed changes

andrewtavis approved these changes Dec 16, 2024

View reviewed changes

andrewtavis merged commit 8c4b557 into scribe-org:main Dec 16, 2024
5 checks passed

This was referenced Dec 16, 2024

Add download Wikidata dump command to CLI #517

Closed

Add check_lexeme_dump_prompt_download function to cli/utils.py #518

Closed

Make a Wikidata lexemes dump required for scribe-data g -a #519

Closed

axif0 deleted the Issue_517 branch December 22, 2024 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added download cli cmd #528

Added download cli cmd #528

axif0 commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 11, 2024 •

edited by andrewtavis

Loading

andrewtavis commented Dec 11, 2024

wkyoshida left a comment

wkyoshida Dec 15, 2024

andrewtavis Dec 16, 2024 •

edited

Loading

andrewtavis Dec 16, 2024

axif0 Dec 16, 2024

wkyoshida Dec 15, 2024

andrewtavis Dec 15, 2024 •

edited

Loading

andrewtavis Dec 15, 2024

andrewtavis Dec 15, 2024

andrewtavis Dec 16, 2024

andrewtavis commented Dec 16, 2024 •

edited

Loading

axif0 commented Dec 16, 2024 •

edited

Loading

andrewtavis left a comment

Added download cli cmd #528

Added download cli cmd #528

Conversation

axif0 commented Dec 11, 2024 • edited Loading

Description

New Feature: Download Wikidata Dumps

Integration with Existing CLI

Code Cleanup and Get -all Test Remove

Contributor checklist

Related issue

github-actions bot commented Dec 11, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

andrewtavis commented Dec 11, 2024

wkyoshida left a comment

Choose a reason for hiding this comment

wkyoshida Dec 15, 2024

Choose a reason for hiding this comment

andrewtavis Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

andrewtavis Dec 16, 2024

Choose a reason for hiding this comment

axif0 Dec 16, 2024

Choose a reason for hiding this comment

wkyoshida Dec 15, 2024

Choose a reason for hiding this comment

andrewtavis Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

andrewtavis Dec 15, 2024

Choose a reason for hiding this comment

andrewtavis Dec 15, 2024

Choose a reason for hiding this comment

andrewtavis Dec 16, 2024

Choose a reason for hiding this comment

andrewtavis commented Dec 16, 2024 • edited Loading

axif0 commented Dec 16, 2024 • edited Loading

andrewtavis left a comment

Choose a reason for hiding this comment

axif0 commented Dec 11, 2024 •

edited

Loading

github-actions bot commented Dec 11, 2024 •

edited by andrewtavis

Loading

andrewtavis Dec 16, 2024 •

edited

Loading

andrewtavis Dec 15, 2024 •

edited

Loading

andrewtavis commented Dec 16, 2024 •

edited

Loading

axif0 commented Dec 16, 2024 •

edited

Loading