Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402

OmarAI2003 · 2024-10-17T11:28:40Z

Contributor checklist

This pull request is on a separate branch and not the main branch

Description

This PR introduces several key updates and improvements to the project’s handling of language metadata, test cases, and language extraction logic:

Refactored Language Metadata Structure:
- Simplified the language_metadata.json file by removing unnecessary nesting and keys, leaving only language, iso, and qid at the top level.
- Organized sub-languages under their respective main languages (e.g., "Norwegian" now has "Nynorsk" and "Bokmål").
New Language Support:
- Added new languages and dialects, including Mandarin, and others, into the language_metadata.json file.
Improved Language Handling:
- Implemented a new format_sublanguage_name function to format sub-languages as mainlang/sublang.
- Refactored the generation of language_map and language_to_qid to accommodate the new structure, ensuring correct directory creation and querying for sub-languages.
- Removed the dependency on the obsolete languages key in the JSON file.
Utility Functions and Minor Fixes:
- Refined get_scribe_languages, list_all_languages, and _find functions to handle both main languages and sub-languages.
- Removed outdated utility functions (get_language_words_to_remove, get_language_words_to_ignore) no longer needed with the simplified JSON structure.
- Enhanced error handling for cases where a language only has sub-languages.
Updated Test Cases:
- Modified test cases to align with the new language metadata structure, including tests for sub-languages.
- Added unit tests for sub-language formatting and listing, ensuring accurate results for all languages.

I’m open to feedback and welcome any additional suggestions or simple updates to improve the functionality and alignment with the project's overall goals.

Related issue

This PR addresses issue #293

… keys. - Removed 'description', 'entry', and 'languages' keys. - Flattened structure to include only 'language', 'iso', and 'qid' at the top level.

- Removed 'root' parameter since the JSON is now flat. - Updated function to return the entire contents of the JSON directly.

…egian having sub-languags - Removed unnecessary top-level keys - Organized Norwegian with its sub-languages (Nynorsk and Bokmål)

- Enhanced the function to check for both regular languages and their sub-languages. - Added error handling for cases where a language has only sub-languages, providing informative messages. - Updated the function's docstring to reflect changes in behavior and usage.

- Adjusted the function to return both main languages and their sub-languages. - Ensured that languages like Norwegian are represented by their sub-languages only. - Enhanced compatibility with the new JSON format.

…metadata.json-and-rework-references

…due to new language_metadata.json structure

…ON structure - Updated the logic for building language_map and language_to_qid to handle languages with sub-languages. - Both main languages and sub-languages are now processed in a single pass, ensuring that: - language_map includes all metadata for main and sub-languages. - language_to_qid correctly maps both main and sub-languages to their QIDs.

…ng the 'languages' key reference

Removed dependency on the 'languages' key in JSON structure.

…s 'mainlang/sublang' - Implemented the function to check if a language is a sub-language and format its name as 'mainlang/sublang' for easier searching in language_data_extraction. - Returns the original language name if it's not a sub-language. - Added detailed docstring for clarity and usage examples.

- Wrapped 'lang' variable with format_sublanguage_name to ensure sub-languages are formatted as 'mainlang/sublang' during data extraction. - This ensures proper directory creation and querying for a sub-languages, aligning with the new language metadata structure.

…ture in cli/total.py file

- Created list_all_languages function to extract both main languages and sub-languages - The function checks for sub-languages and compiles a complete list for easier access. - Updated example usage to demonstrate the new functionality.

- Replaced old extraction method with a centralized function.

- Imported list_all_languages and ormat_sublanguage_name from scribe_data.utils. - Updated get_datatype_list and print_total_lexemes to improve language name retrieval and formatting.

- Refactored to use the user-defined _find function. - Removed the ry-except block as error handling is already implemented in _find. - Removed the InvalidLanguageValue module as it was imported but unused.

- Utilized already built helper functions to support sub-languages when retrieving ISO and QID values. - Updated table printing to correctly format and display both main languages and sub-languages.

…metadata.json-and-rework-references

… file

…tion to reflect the new JSON structure, ensuring only data types are printed and no sub-languages unlike before.

…_name' to align with the directory structure in the language_data_extraction directory.

…tion to handle sub_language folders.

…se list_all_languages, assigning a complete list of all languages.

…metadata.json-and-rework-references

…uages listing functions

…metadata.json-and-rework-references

- Updated all test cases to account for sub-languages. - Removed tests for est_get_language_words_to_remove and est_get_language_words_to_ignore, as these functions were deleted from utils.py and the languages metadata files

…. Made the language_metadata parameter optional in two functions. Added a ValueError exception when a language is not found.

- Positive and negative tests for format_sublanguage_name - Test to validate the output of list_all_languages

…metadata.json-and-rework-references

…and-rework-references

github-actions · 2024-10-17T11:29:07Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

andrewtavis · 2024-10-17T12:29:14Z

CC @DeleMike, @catreedle, @Ekikereabasi-Nk, @KesharwaniArpita, @axif0

Would be great to get your all's reviews!

DeleMike · 2024-10-17T15:09:04Z

@andrewtavis this is an interesting update. I believe if this is incorporated, we will now have checks to verify languages in the directory and the ones in the json file, which is what #385 is also trying to work on.

#390 is also doing something related.

PR #396 also does something related but it is more about queries. However, could we have a single function to check this?

PR #396 is unaware of the current JSON structure coming from this PR which will break it.

andrewtavis · 2024-10-17T15:20:42Z

I'd like to have one check for queries and one check for structure, but aside from that I agree that we have a bit of a web going here and once we have the current PRs merged we'll have most of the functionality we're after :)

DeleMike · 2024-10-17T15:22:27Z

I'd like to have one check for queries and one check for structure, but aside from that I agree that we have a bit of a web going here and once we have the current PRs merged we'll have most of the functionality we're after :)

alrighty!

catreedle · 2024-10-17T15:35:23Z

Will this help simplify #385, or does it already cover its functionality? If it does, could you please point out which part of the work addresses it? I'm still trying to understand how it connects, but I'm having difficulty. 😅

OmarAI2003 · 2024-10-17T15:55:25Z

Will this help simplify #385, or does it already cover its functionality? If it does, could you please point out which part of the work addresses it? I'm still trying to understand how it connects, but I'm having difficulty. 😅

I think this should significantly simplify things, especially with the two newly added functions in utils.py. One function provides a list of all queryable languages from the JSON file, and the other formats language names appropriately. The formatted names are designed to be compatible with the language_data_extraction directory structure by returning main languages capitalized and sub-languages in the format MainLang/SubLang, with both parts capitalized.

catreedle · 2024-10-17T16:24:21Z

Will this help simplify #385, or does it already cover its functionality? If it does, could you please point out which part of the work addresses it? I'm still trying to understand how it connects, but I'm having difficulty. 😅

I think this should significantly simplify things, especially with the two newly added functions in utils.py. One function provides a list of all queryable languages from the JSON file, and the other formats language names appropriately. The formatted names are designed to be compatible with the language_data_extraction directory structure by returning main languages capitalized and sub-languages in the format MainLang/SubLang, with both parts capitalized.

Thank you! Will look into it. :)

KesharwaniArpita · 2024-10-17T16:22:42Z

src/scribe_data/resources/language_metadata.json

+  "yoruba": {
+    "iso": "yo",
+    "qid": "Q34311"
+  }
 }


Thanks for cleaning the language_metadata.py. It is now much easier to look upto

Thank you for your feedback! I'm glad to hear that the JSON file is easier to work with now. I also want to thank @andrewtavis for his guidance on this.

KesharwaniArpita

Thanks for your work @OmarAI2003 !!!!

KesharwaniArpita · 2024-10-17T16:34:27Z

tests/load/test_update_utils.py

+        "yoruba",
+    ]
+
+    assert utils.list_all_languages() == expected_languages


 def test_get_ios_data_path():


The updated tests looks great! Thanks @OmarAI2003

KesharwaniArpita · 2024-10-17T16:41:52Z

src/scribe_data/cli/list.py

+            language_dir = (
+                LANGUAGE_DATA_EXTRACTION_DIR
+                / format_sublanguage_name(lang, language_metadata).capitalize()
+            )
            if language_dir.is_dir():
                data_types.update(f.name for f in language_dir.iterdir() if f.is_dir())



You've made sorting and formatting more efficient by calling list_all_languages() and sorting the languages using get_language_iso() and get_language_qid(). This reduces redundancy and enhances readability. Thank you for this!!!

andrewtavis

Really really great work here, @OmarAI2003 😊 This is going to make so many parts of Scribe-Data so much easier :) Appreciate your efforts here!

OmarAI2003 · 2024-10-18T05:39:44Z

Thank you! @andrewtavis 😊 I really appreciate your guidance throughout this process. I'm excited to contribute even more and keep improving Scribe!

OmarAI2003 added 30 commits October 12, 2024 16:44

Simplified language metadata JSON by removing unnecessary nesting and…

624760d

… keys. - Removed 'description', 'entry', and 'languages' keys. - Flattened structure to include only 'language', 'iso', and 'qid' at the top level.

Refactored _load_json function to handle simplified JSON structure.

05ba79d

- Removed 'root' parameter since the JSON is now flat. - Updated function to return the entire contents of the JSON directly.

Refactor language metadata structure: Include all languages with Norw…

7be7005

…egian having sub-languags - Removed unnecessary top-level keys - Organized Norwegian with its sub-languages (Nynorsk and Bokmål)

Update get_scribe_languages to handle sub-languages in JSON structure

046c78d

- Adjusted the function to return both main languages and their sub-languages. - Ensured that languages like Norwegian are represented by their sub-languages only. - Enhanced compatibility with the new JSON format.

Merge remote-tracking branch 'upstream/main' into refactor-languages_…

7c00873

…metadata.json-and-rework-references

Merge remote-tracking branch 'upstream/main' into refactor-languages_…

2233e44

…metadata.json-and-rework-references

Remove get_language_words_to_remove and get_language_words_to_ignore …

8f737cd

…due to new language_metadata.json structure

Fix: Update language extraction to match new JSON structure by removi…

6186be9

…ng the 'languages' key reference

Refactor language extraction to use direct keys from language_metadata.

1c959ec

Removed dependency on the 'languages' key in JSON structure.

Removed dependency on the 'languages' key based on the old json struc…

4705414

…ture in cli/total.py file

Refactor to use list_all_languages function for language extraction

8d8f8f5

- Replaced old extraction method with a centralized function.

Enhance language handling by importing utility functions

d9a649b

- Imported list_all_languages and ormat_sublanguage_name from scribe_data.utils. - Updated get_datatype_list and print_total_lexemes to improve language name retrieval and formatting.

Update get_language_iso function:

30f97e9

- Refactored to use the user-defined _find function. - Removed the ry-except block as error handling is already implemented in _find. - Removed the InvalidLanguageValue module as it was imported but unused.

Handle sub-languages in language table generation

ceec187

- Utilized already built helper functions to support sub-languages when retrieving ISO and QID values. - Updated table printing to correctly format and display both main languages and sub-languages.

Merge remote-tracking branch 'upstream/main' into refactor-languages_…

5345c08

…metadata.json-and-rework-references

adding new languages and their dialects to the language_metadata.json…

540e9d2

… file

Modified the loop that searches languages in the list_data_types func…

f389ab5

…tion to reflect the new JSON structure, ensuring only data types are printed and no sub-languages unlike before.

Capitalize the languages returned by the function 'format_sublanguage…

09944ed

…_name' to align with the directory structure in the language_data_extraction directory.

Implemented minor fixes by utilizing the format_sublanguage_name func…

f602f17

…tion to handle sub_language folders.

Updated the instance variable self.languages in ScribeDataConfig to u…

ba0ed9a

…se list_all_languages, assigning a complete list of all languages.

adding mandarin as a sub language under chinese and updating some qids

c77cb1f

Update test_list_languages to match updated output format

87ec3b0

Merge remote-tracking branch 'upstream/main' into refactor-languages_…

84f8a4b

…metadata.json-and-rework-references

removing .capitalize method since it's already implemented inside lag…

881c055

…uages listing functions

Merge remote-tracking branch 'upstream/main' into refactor-languages_…

15a13fb

…metadata.json-and-rework-references

OmarAI2003 and others added 6 commits October 16, 2024 21:35

Updating test cases in test_list.py file to match newly added languages

fed80b3

Update test cases to include sub-languages

e6140e5

- Updated all test cases to account for sub-languages. - Removed tests for est_get_language_words_to_remove and est_get_language_words_to_ignore, as these functions were deleted from utils.py and the languages metadata files

Updated the get_language_from_iso function to depend on the JSON file…

22791ce

…. Made the language_metadata parameter optional in two functions. Added a ValueError exception when a language is not found.

Add unit tests for language formatting and listing:

1416134

- Positive and negative tests for format_sublanguage_name - Test to validate the output of list_all_languages

Merge remote-tracking branch 'upstream/main' into refactor-languages_…

ca9edb4

…metadata.json-and-rework-references

Merge branch 'scribe-org:main' into refactor-languages_metadata.json-…

f3426f1

…and-rework-references

andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 17, 2024

andrewtavis self-requested a review October 17, 2024 12:26

DeleMike mentioned this pull request Oct 17, 2024

Add Script to Check Consistency Between Data Types in Directories and Metadata #390

Merged

1 task

OmarAI2003 mentioned this pull request Oct 17, 2024

Check language metadata #385

Merged

1 task

KesharwaniArpita reviewed Oct 17, 2024

View reviewed changes

Edits to language metadata and supporting functions + pr checklist

661b131

andrewtavis approved these changes Oct 18, 2024

View reviewed changes

andrewtavis merged commit 9df1756 into scribe-org:main Oct 18, 2024
5 checks passed

This was referenced Oct 18, 2024

Update languages metadata file and use of it thoughout project #293

Closed

complete workflow to check sparql queries #396

Merged

Centralizing the emoji keyword generation logic #379

Closed

OmarAI2003 deleted the refactor-languages_metadata.json-and-rework-references branch October 18, 2024 08:11

OmarAI2003 mentioned this pull request Oct 18, 2024

Docs are not building #382

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402

Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402

OmarAI2003 commented Oct 17, 2024 •

edited

Loading

github-actions bot commented Oct 17, 2024 •

edited by andrewtavis

Loading

andrewtavis commented Oct 17, 2024

DeleMike commented Oct 17, 2024

andrewtavis commented Oct 17, 2024

DeleMike commented Oct 17, 2024

catreedle commented Oct 17, 2024 •

edited

Loading

OmarAI2003 commented Oct 17, 2024 •

edited

Loading

catreedle commented Oct 17, 2024

KesharwaniArpita Oct 17, 2024

OmarAI2003 Oct 17, 2024

KesharwaniArpita left a comment

KesharwaniArpita Oct 17, 2024

KesharwaniArpita Oct 17, 2024

andrewtavis left a comment

OmarAI2003 commented Oct 18, 2024

Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402

Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402

Conversation

OmarAI2003 commented Oct 17, 2024 • edited Loading

Contributor checklist

Description

Related issue

github-actions bot commented Oct 17, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

andrewtavis commented Oct 17, 2024

DeleMike commented Oct 17, 2024

andrewtavis commented Oct 17, 2024

DeleMike commented Oct 17, 2024

catreedle commented Oct 17, 2024 • edited Loading

OmarAI2003 commented Oct 17, 2024 • edited Loading

catreedle commented Oct 17, 2024

KesharwaniArpita Oct 17, 2024

Choose a reason for hiding this comment

OmarAI2003 Oct 17, 2024

Choose a reason for hiding this comment

KesharwaniArpita left a comment

Choose a reason for hiding this comment

KesharwaniArpita Oct 17, 2024

Choose a reason for hiding this comment

KesharwaniArpita Oct 17, 2024

Choose a reason for hiding this comment

andrewtavis left a comment

Choose a reason for hiding this comment

OmarAI2003 commented Oct 18, 2024

OmarAI2003 commented Oct 17, 2024 •

edited

Loading

github-actions bot commented Oct 17, 2024 •

edited by andrewtavis

Loading

catreedle commented Oct 17, 2024 •

edited

Loading

OmarAI2003 commented Oct 17, 2024 •

edited

Loading