Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check language metadata #385

Merged
merged 17 commits into from
Oct 24, 2024
Merged

Conversation

catreedle
Copy link
Contributor

Contributor checklist


Description

This PR introduces check_language_metadata.py with the functionality:

  • Check and log the differences between the languages in languages_metadata.json and language_data_extraction.
  • Validate that each language and sub-language in language_metadata.json has the property qid and iso.

This code has been tested locally.

Related issue

Copy link

github-actions bot commented Oct 16, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

Comment on lines 6 to 14
LANGUAGE_DATA_EXTRACTION_DIR = Path(__file__).parent.parent / "language_data_extraction"

LANGUAGE_METADATA_FILE = (
Path(__file__).parent.parent / "resources" / "language_metadata.json"
)

DATA_TYPE_METADATA_FILE = (
Path(__file__).parent.parent / "resources" / "data_type_metadata.json"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@catreedle you can get these from src/scribe_data/cli/cli_utils.py - they already exist there

Comment on lines 16 to 36
try:
with LANGUAGE_METADATA_FILE.open("r", encoding="utf-8") as file:
language_metadata = json.load(file)
languages_in_metadata = {
lang["language"]: {"iso": lang["iso"], "qid": lang["qid"]}
for lang in language_metadata["languages"]
} # current language metadata

# languages_in_metadata = { # proposed language metadata
# key.lower(): value for key, value in language_metadata.items()
# } # Normalize keys to lowercase for case-insensitive comparison

except (IOError, json.JSONDecodeError) as e:
print(f"Error reading language metadata: {e}")

try:
with DATA_TYPE_METADATA_FILE.open("r", encoding="utf-8") as file:
data_type_metadata = json.load(file)
all_data_types = tuple(data_type_metadata.keys())

except (IOError, json.JSONDecodeError) as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the cli_utils.py has loaded the language_metadata.json and data_type_metadata.json files for us. You could ignore loading it in here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-10-16 at 15 13 42

see the cli_utils.py contents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean we can directly use it? how?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just call it directly. You can experiment and see for yourself 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankss.. I found it 😊

Comment on lines +230 to +231
if __name__ == "__main__":
check_language_metadata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for a test, then it's fine. but we will be calling this in check_project_metadata.yaml file so no need for this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! good to know 😊

@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 16, 2024
@andrewtavis
Copy link
Member

One minor thing here, @catreedle: Could we get consistent function docstirng, and ones similar to what we have in this PR in get_missing_languages? :)

@andrewtavis
Copy link
Member

Making the docstring like that means we can use autodoc in the documentation 😊

@DeleMike
Copy link
Contributor

hey @catreedle , I'm sorry if I have confused you a bit :(

if __name__ == "__main__":
    check_language_metadata()

I just checked and the code you added earlier is quite important if we are gonna be using this file for our workflow. Can you please add it back?

Ty!

@catreedle
Copy link
Contributor Author

One minor thing here, @catreedle: Could we get consistent function docstirng, and ones similar to what we have in this PR in get_missing_languages? :)

on it :)

@catreedle
Copy link
Contributor Author

hey @catreedle , I'm sorry if I have confused you a bit :(

if __name__ == "__main__":
    check_language_metadata()

I just checked and the code you added earlier is quite important if we are gonna be using this file for our workflow. Can you please add it back?

Ty!

sure. no worries :)

@catreedle
Copy link
Contributor Author

updated the docstring. let me know if there's amiss @andrewtavis :)

@andrewtavis
Copy link
Member

Quick note being sent to all the testing PRs, if updates are needed now that #402 has been merged, then it'd be great to get those updates to the branch :) If no updates are needed, then let me know 😊

@catreedle
Copy link
Contributor Author

Quick note being sent to all the testing PRs, if updates are needed now that #402 has been merged, then it'd be great to get those updates to the branch :) If no updates are needed, then let me know 😊

Hi! I've made the necessary updates after the #402 merge. Everything should be good to go now 😊.

If any missing languages or properties are found, the function exits the script with a status code of 1.
"""
languages_in_metadata = {key.lower(): value for key, value in _languages.items()}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line
languages_in_metadata = {key.lower(): value for key, value in _languages.items()}
give all the languages in _languages, including the sublanguages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but it doesn't lowercase all the sublanguages if any is mistakenly written, I just noticed.

Copy link
Contributor

@OmarAI2003 OmarAI2003 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is graet. There is also a helper function called list_all_languages at the bottom of utils.py that lists all the languages in the JSON file. You can make use of it here to check for sublanguages; it would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! Could you explain further how best to use it in this context? I want to include parent languages so I convert the languages directory to match the format of language_metadata.json for easier comparison. Since list_all_languages doesn’t show the parent languages, I’m unsure how to use it effectively for my comparison.

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bringing this down and integrating the checks for isos and qids into the other workflows 😊 Thanks, @catreedle! :) Really great to have the support on the checks 🚀

@andrewtavis andrewtavis merged commit 0fc2200 into scribe-org:main Oct 24, 2024
5 checks passed
@catreedle
Copy link
Contributor Author

Thank you! @andrewtavis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted as a part of Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants