-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402
Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402
Conversation
… keys. - Removed 'description', 'entry', and 'languages' keys. - Flattened structure to include only 'language', 'iso', and 'qid' at the top level.
- Removed 'root' parameter since the JSON is now flat. - Updated function to return the entire contents of the JSON directly.
…egian having sub-languags - Removed unnecessary top-level keys - Organized Norwegian with its sub-languages (Nynorsk and Bokmål)
- Enhanced the function to check for both regular languages and their sub-languages. - Added error handling for cases where a language has only sub-languages, providing informative messages. - Updated the function's docstring to reflect changes in behavior and usage.
- Adjusted the function to return both main languages and their sub-languages. - Ensured that languages like Norwegian are represented by their sub-languages only. - Enhanced compatibility with the new JSON format.
…metadata.json-and-rework-references
…metadata.json-and-rework-references
…due to new language_metadata.json structure
…ON structure - Updated the logic for building language_map and language_to_qid to handle languages with sub-languages. - Both main languages and sub-languages are now processed in a single pass, ensuring that: - language_map includes all metadata for main and sub-languages. - language_to_qid correctly maps both main and sub-languages to their QIDs.
…ng the 'languages' key reference
Removed dependency on the 'languages' key in JSON structure.
…s 'mainlang/sublang' - Implemented the function to check if a language is a sub-language and format its name as 'mainlang/sublang' for easier searching in language_data_extraction. - Returns the original language name if it's not a sub-language. - Added detailed docstring for clarity and usage examples.
- Wrapped 'lang' variable with format_sublanguage_name to ensure sub-languages are formatted as 'mainlang/sublang' during data extraction. - This ensures proper directory creation and querying for a sub-languages, aligning with the new language metadata structure.
…ture in cli/total.py file
- Created list_all_languages function to extract both main languages and sub-languages - The function checks for sub-languages and compiles a complete list for easier access. - Updated example usage to demonstrate the new functionality.
- Replaced old extraction method with a centralized function.
- Imported list_all_languages and ormat_sublanguage_name from scribe_data.utils. - Updated get_datatype_list and print_total_lexemes to improve language name retrieval and formatting.
- Refactored to use the user-defined _find function. - Removed the ry-except block as error handling is already implemented in _find. - Removed the InvalidLanguageValue module as it was imported but unused.
- Utilized already built helper functions to support sub-languages when retrieving ISO and QID values. - Updated table printing to correctly format and display both main languages and sub-languages.
…metadata.json-and-rework-references
…tion to reflect the new JSON structure, ensuring only data types are printed and no sub-languages unlike before.
…_name' to align with the directory structure in the language_data_extraction directory.
…tion to handle sub_language folders.
…se list_all_languages, assigning a complete list of all languages.
…metadata.json-and-rework-references
…uages listing functions
…metadata.json-and-rework-references
- Updated all test cases to account for sub-languages. - Removed tests for est_get_language_words_to_remove and est_get_language_words_to_ignore, as these functions were deleted from utils.py and the languages metadata files
…. Made the language_metadata parameter optional in two functions. Added a ValueError exception when a language is not found.
- Positive and negative tests for format_sublanguage_name - Test to validate the output of list_all_languages
…metadata.json-and-rework-references
…and-rework-references
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist |
CC @DeleMike, @catreedle, @Ekikereabasi-Nk, @KesharwaniArpita, @axif0 Would be great to get your all's reviews! |
@andrewtavis this is an interesting update. I believe if this is incorporated, we will now have checks to verify languages in the directory and the ones in the #390 is also doing something related. PR #396 also does something related but it is more about queries. However, could we have a single function to check this? PR #396 is unaware of the current JSON structure coming from this PR which will break it. |
I'd like to have one check for queries and one check for structure, but aside from that I agree that we have a bit of a web going here and once we have the current PRs merged we'll have most of the functionality we're after :) |
alrighty! |
Will this help simplify #385, or does it already cover its functionality? If it does, could you please point out which part of the work addresses it? I'm still trying to understand how it connects, but I'm having difficulty. 😅 |
I think this should significantly simplify things, especially with the two newly added functions in |
Thank you! Will look into it. :) |
"yoruba": { | ||
"iso": "yo", | ||
"qid": "Q34311" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning the language_metadata.py
. It is now much easier to look upto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback! I'm glad to hear that the JSON file is easier to work with now. I also want to thank @andrewtavis for his guidance on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work @OmarAI2003 !!!!
"yoruba", | ||
] | ||
|
||
assert utils.list_all_languages() == expected_languages | ||
|
||
|
||
def test_get_ios_data_path(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated tests looks great! Thanks @OmarAI2003
language_dir = ( | ||
LANGUAGE_DATA_EXTRACTION_DIR | ||
/ format_sublanguage_name(lang, language_metadata).capitalize() | ||
) | ||
if language_dir.is_dir(): | ||
data_types.update(f.name for f in language_dir.iterdir() if f.is_dir()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've made sorting and formatting more efficient by calling list_all_languages() and sorting the languages using get_language_iso() and get_language_qid(). This reduces redundancy and enhances readability. Thank you for this!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really really great work here, @OmarAI2003 😊 This is going to make so many parts of Scribe-Data so much easier :) Appreciate your efforts here!
Thank you! @andrewtavis 😊 I really appreciate your guidance throughout this process. I'm excited to contribute even more and keep improving Scribe! |
Contributor checklist
Description
This PR introduces several key updates and improvements to the project’s handling of language metadata, test cases, and language extraction logic:
Refactored Language Metadata Structure:
language_metadata.json
file by removing unnecessary nesting and keys, leaving onlylanguage
,iso
, andqid
at the top level.New Language Support:
language_metadata.json
file.Improved Language Handling:
format_sublanguage_name
function to format sub-languages asmainlang/sublang
.language_map
andlanguage_to_qid
to accommodate the new structure, ensuring correct directory creation and querying for sub-languages.languages
key in the JSON file.Utility Functions and Minor Fixes:
get_scribe_languages
,list_all_languages
, and_find
functions to handle both main languages and sub-languages.get_language_words_to_remove
,get_language_words_to_ignore
) no longer needed with the simplified JSON structure.Updated Test Cases:
I’m open to feedback and welcome any additional suggestions or simple updates to improve the functionality and alignment with the project's overall goals.
Related issue
This PR addresses issue #293