Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving from Old Language Metadata Structure to Support Sub-languages and Simplified JSON #402

Conversation

OmarAI2003
Copy link
Contributor

@OmarAI2003 OmarAI2003 commented Oct 17, 2024

Contributor checklist


Description

This PR introduces several key updates and improvements to the project’s handling of language metadata, test cases, and language extraction logic:

  1. Refactored Language Metadata Structure:

    • Simplified the language_metadata.json file by removing unnecessary nesting and keys, leaving only language, iso, and qid at the top level.
    • Organized sub-languages under their respective main languages (e.g., "Norwegian" now has "Nynorsk" and "Bokmål").
  2. New Language Support:

    • Added new languages and dialects, including Mandarin, and others, into the language_metadata.json file.
  3. Improved Language Handling:

    • Implemented a new format_sublanguage_name function to format sub-languages as mainlang/sublang.
    • Refactored the generation of language_map and language_to_qid to accommodate the new structure, ensuring correct directory creation and querying for sub-languages.
    • Removed the dependency on the obsolete languages key in the JSON file.
  4. Utility Functions and Minor Fixes:

    • Refined get_scribe_languages, list_all_languages, and _find functions to handle both main languages and sub-languages.
    • Removed outdated utility functions (get_language_words_to_remove, get_language_words_to_ignore) no longer needed with the simplified JSON structure.
    • Enhanced error handling for cases where a language only has sub-languages.
  5. Updated Test Cases:

    • Modified test cases to align with the new language metadata structure, including tests for sub-languages.
    • Added unit tests for sub-language formatting and listing, ensuring accurate results for all languages.

I’m open to feedback and welcome any additional suggestions or simple updates to improve the functionality and alignment with the project's overall goals.

Related issue

This PR addresses issue #293

… keys.

- Removed 'description', 'entry', and 'languages' keys.
- Flattened structure to include only 'language', 'iso', and 'qid' at the top level.
- Removed 'root' parameter since the JSON is now flat.
- Updated function to return the entire contents of the JSON directly.
…egian having sub-languags

- Removed unnecessary top-level keys
- Organized Norwegian with its sub-languages (Nynorsk and Bokmål)
- Enhanced the function to check for both regular languages and their sub-languages.
- Added error handling for cases where a language has only sub-languages, providing informative messages.
- Updated the function's docstring to reflect changes in behavior and usage.
- Adjusted the function to return both main languages and their sub-languages.
- Ensured that languages like Norwegian are represented by their sub-languages only.
- Enhanced compatibility with the new JSON format.
…ON structure

- Updated the logic for building language_map and language_to_qid to handle languages with sub-languages.
- Both main languages and sub-languages are now processed in a single pass, ensuring that:
  - language_map includes all metadata for main and sub-languages.
  - language_to_qid correctly maps both main and sub-languages to their QIDs.
Removed dependency on the 'languages' key in JSON structure.
…s 'mainlang/sublang'

- Implemented the function to check if a language is a sub-language and format its name as 'mainlang/sublang' for easier searching in language_data_extraction.
- Returns the original language name if it's not a sub-language.
- Added detailed docstring for clarity and usage examples.
- Wrapped 'lang' variable with format_sublanguage_name to ensure sub-languages are formatted as 'mainlang/sublang' during data extraction.
- This ensures proper directory creation and querying for a sub-languages, aligning with the new language metadata structure.
- Created list_all_languages function to extract both main languages and sub-languages
- The function checks for sub-languages and compiles a complete list for easier access.
- Updated example usage to demonstrate the new functionality.
- Replaced old extraction method with a centralized function.
- Imported list_all_languages and ormat_sublanguage_name from scribe_data.utils.
- Updated get_datatype_list and print_total_lexemes to improve language name retrieval and formatting.
- Refactored to use the user-defined _find function.
- Removed the 	ry-except block as error handling is already implemented in _find.
- Removed the InvalidLanguageValue module as it was imported but unused.
- Utilized already built helper functions to support sub-languages when retrieving ISO and QID values.
- Updated table printing to correctly format and display both main languages and sub-languages.
…tion to reflect the new JSON structure, ensuring only data types are printed and no sub-languages unlike before.
…_name' to align with the directory structure in the language_data_extraction directory.
…se list_all_languages, assigning a complete list of all languages.
OmarAI2003 and others added 6 commits October 16, 2024 21:35
- Updated all test cases to account for sub-languages.
- Removed tests for 	est_get_language_words_to_remove and 	est_get_language_words_to_ignore, as these functions were deleted from utils.py and the languages metadata files
…. Made the language_metadata parameter optional in two functions. Added a ValueError exception when a language is not found.
- Positive and negative tests for format_sublanguage_name
- Test to validate the output of list_all_languages
Copy link

github-actions bot commented Oct 17, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 17, 2024
@andrewtavis andrewtavis self-requested a review October 17, 2024 12:26
@andrewtavis
Copy link
Member

CC @DeleMike, @catreedle, @Ekikereabasi-Nk, @KesharwaniArpita, @axif0

Would be great to get your all's reviews!

@DeleMike
Copy link
Contributor

@andrewtavis this is an interesting update. I believe if this is incorporated, we will now have checks to verify languages in the directory and the ones in the json file, which is what #385 is also trying to work on.

#390 is also doing something related.

PR #396 also does something related but it is more about queries. However, could we have a single function to check this?

PR #396 is unaware of the current JSON structure coming from this PR which will break it.

@andrewtavis
Copy link
Member

I'd like to have one check for queries and one check for structure, but aside from that I agree that we have a bit of a web going here and once we have the current PRs merged we'll have most of the functionality we're after :)

@DeleMike
Copy link
Contributor

I'd like to have one check for queries and one check for structure, but aside from that I agree that we have a bit of a web going here and once we have the current PRs merged we'll have most of the functionality we're after :)

alrighty!

@catreedle
Copy link
Contributor

catreedle commented Oct 17, 2024

Will this help simplify #385, or does it already cover its functionality? If it does, could you please point out which part of the work addresses it? I'm still trying to understand how it connects, but I'm having difficulty. 😅

@OmarAI2003
Copy link
Contributor Author

OmarAI2003 commented Oct 17, 2024

Will this help simplify #385, or does it already cover its functionality? If it does, could you please point out which part of the work addresses it? I'm still trying to understand how it connects, but I'm having difficulty. 😅

I think this should significantly simplify things, especially with the two newly added functions in utils.py. One function provides a list of all queryable languages from the JSON file, and the other formats language names appropriately. The formatted names are designed to be compatible with the language_data_extraction directory structure by returning main languages capitalized and sub-languages in the format MainLang/SubLang, with both parts capitalized.

@OmarAI2003 OmarAI2003 mentioned this pull request Oct 17, 2024
1 task
@catreedle
Copy link
Contributor

Will this help simplify #385, or does it already cover its functionality? If it does, could you please point out which part of the work addresses it? I'm still trying to understand how it connects, but I'm having difficulty. 😅

I think this should significantly simplify things, especially with the two newly added functions in utils.py. One function provides a list of all queryable languages from the JSON file, and the other formats language names appropriately. The formatted names are designed to be compatible with the language_data_extraction directory structure by returning main languages capitalized and sub-languages in the format MainLang/SubLang, with both parts capitalized.

Thank you! Will look into it. :)

"yoruba": {
"iso": "yo",
"qid": "Q34311"
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning the language_metadata.py. It is now much easier to look upto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback! I'm glad to hear that the JSON file is easier to work with now. I also want to thank @andrewtavis for his guidance on this.

Copy link
Contributor

@KesharwaniArpita KesharwaniArpita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work @OmarAI2003 !!!!

"yoruba",
]

assert utils.list_all_languages() == expected_languages


def test_get_ios_data_path():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated tests looks great! Thanks @OmarAI2003

language_dir = (
LANGUAGE_DATA_EXTRACTION_DIR
/ format_sublanguage_name(lang, language_metadata).capitalize()
)
if language_dir.is_dir():
data_types.update(f.name for f in language_dir.iterdir() if f.is_dir())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've made sorting and formatting more efficient by calling list_all_languages() and sorting the languages using get_language_iso() and get_language_qid(). This reduces redundancy and enhances readability. Thank you for this!!!

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really really great work here, @OmarAI2003 😊 This is going to make so many parts of Scribe-Data so much easier :) Appreciate your efforts here!

@OmarAI2003
Copy link
Contributor Author

Thank you! @andrewtavis 😊 I really appreciate your guidance throughout this process. I'm excited to contribute even more and keep improving Scribe!

@OmarAI2003 OmarAI2003 deleted the refactor-languages_metadata.json-and-rework-references branch October 18, 2024 08:11
@OmarAI2003 OmarAI2003 mentioned this pull request Oct 18, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted as a part of Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants