Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Script to Check Consistency Between Data Types in Directories and Metadata #390

Merged
merged 8 commits into from
Oct 24, 2024

Conversation

KesharwaniArpita
Copy link
Contributor

@KesharwaniArpita KesharwaniArpita commented Oct 16, 2024

Contributor checklist


Description

This PR introduces a new function, check_data_type_metadata, which ensures that data type subdirectories within language directories are accurately reflected in the data_type_metadata.json file. It accounts for meta-languages and compares the data types found in the file system against those in the metadata, flagging any discrepancies such as missing or extra data types.

Key Changes:

  1. New Functionality:

    • check_data_type_metadata(output_file): This function traverses the LANGUAGE_DATA_EXTRACTION_DIR to validate the consistency of data type directories against the data_type_metadata.json.
    • It checks each language's subdirectories and detects:
      • Missing data types in metadata that exist in the directory.
      • Extra data types in metadata that do not exist in the directory.
    • Discrepancies are written to the specified output_file.
  2. Helper Function:

    • check_language_subdirs: A recursive function to handle meta-languages and sub-language directories, ensuring all subdirectories are accounted for during validation.
  3. Discrepancy Reporting:

    • If discrepancies are found, they are stored in the provided output file, and the user is notified.
    • If no discrepancies are found, the script writes a confirmation message that all data type metadata is up to date.

Future Scope:

  • Further enhancements could include integrating this validation check into a CI/CD pipeline to ensure metadata consistency in future contributions. Thought abut this while working. Thoughts are welcomed!!!

Related issue

Copy link

github-actions bot commented Oct 16, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@andrewtavis
Copy link
Member

CC @axif0 - I'm getting some of the issues a bit mixed, but I think that what we have from you and also here would close #390 and #340? Can we discuss here and then I'll integrate both scripts into one?

@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 16, 2024
@andrewtavis andrewtavis self-requested a review October 16, 2024 22:44
Comment on lines +33 to +34
if extra_data_types:
discrepancies.append(f"Extra in directory for '{meta_language}': {extra_data_types}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct terms would be "Extra in metadata" or "Missing in directory"?
this is the result for English:
Extra in directory for 'english': {'conjunctions', 'pronouns', 'articles', 'postpositions', 'personal_pronouns', 'autosuggestions', 'prepositions'}
but the English directory doesn't have them.

Copy link
Contributor Author

@KesharwaniArpita KesharwaniArpita Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @catreedle but extra data here denotes any data type that is there in the language folder but not in metadata file

@DeleMike
Copy link
Contributor

CC @axif0 - I'm getting some of the issues a bit mixed, but I think that what we have from you and also here would close #390 and #340? Can we discuss here and then I'll integrate both scripts into one?

I agree.

Comment on lines +37 to +40
sub_lang_dir = language / 'sub-languages'
if sub_lang_dir.exists():
discrepancies.extend(check_language_subdirs(sub_lang_dir, meta_language))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my thoughts here is how #402 will affect this PR...🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...sub_lang_dir = language / 'sub-languages' is written to even support the new flow coming from #402, yeah? @catreedle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was done keeping the generalisation in mind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get it. does this mean the directory structure will change?
Norwegian
--sub-languages
----Nynorks
----Bokmal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get it. does this mean the directory structure will change? Norwegian --sub-languages ----Nynorks ----Bokmal

No, @catreedle, it will remain as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean the directory structure will change?

No, the directory structure will remain same, but the scripts will be able to check the data types under sub languages also

@andrewtavis
Copy link
Member

Quick note being sent to all the testing PRs, if updates are needed now that #402 has been merged, then it'd be great to get those updates to the branch :) If no updates are needed, then let me know 😊

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing this down and integrating it into the other checks. Thanks, @KesharwaniArpita!

@andrewtavis andrewtavis merged commit 4ae7453 into scribe-org:main Oct 24, 2024
5 checks passed
@KesharwaniArpita KesharwaniArpita deleted the AK-Workflow-Metadata branch October 26, 2024 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted as a part of Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants