Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Feature to Extract and Verify All Grammatical Features for a Data Type in a Given Language #513

Open
2 tasks done
OmarAI2003 opened this issue Nov 22, 2024 · 5 comments
Assignees
Labels
feature New feature or request help wanted Extra attention is needed

Comments

@OmarAI2003
Copy link
Contributor

Terms

Description

The language_data_extraction directory organizes supported languages into folders, with each language folder containing subfolders for supported data types (e.g., nouns, verbs, adverbs). Within these subfolders, SPARQL files are used to fetch lexical data for grammatical features. One way to enhance the data extraction process is to implement a mechanism that tracks the forms for each data type directly from Wikidata.

Problem Statement

Currently, we face two key challenges:

  1. Listing all possible grammatical features for a given data type in a specific language (e.g., all forms that nouns or verbs can take).
  2. Verifying that our SPARQL queries account for all these grammatical features, which could lead to incomplete or inconsistent data extraction if not addressed.

Addressing these challenges is essential for accurately capturing all forms of a data type across languages, ultimately improving data quality and consistency.

Contribution

No response

@OmarAI2003 OmarAI2003 added the feature New feature or request label Nov 22, 2024
@andrewtavis
Copy link
Member

Thanks for making the issue, @OmarAI2003! I'll have more information on this in the coming weeks :)

@andrewtavis andrewtavis moved this to Todo in Scribe Board Nov 22, 2024
@andrewtavis andrewtavis mentioned this issue Dec 15, 2024
2 tasks
@andrewtavis andrewtavis added the help wanted Extra attention is needed label Jan 5, 2025
@andrewtavis
Copy link
Member

@axif0, now that we have the all forms functionality for Wikidata lexeme dumps, do you want to start working on the check for this? Basically we'd want a check that gets all the forms for all languages and then compares them against what we have in the queries. If the queries are missing forms, when we'd throw an error 😊 Ideally we'd have this also be able to be triggered manually.

@axif0
Copy link
Collaborator

axif0 commented Jan 6, 2025

If the queries are missing forms, when we'd throw an error

Thank you for bringing this up! We can start working on the check for this functionality. To clarify, are you suggesting that if any forms are missing in the queries, we should throw an error rather than just issuing a warning?

@andrewtavis
Copy link
Member

I would say that ideally what would come from this is a GitHub workflow that would actually error and on error open a PR with the corrected query with the missing forms. That way the work of actually writing the queries is taken care of for us and we can just review when the queries are written 😊

@axif0
Copy link
Collaborator

axif0 commented Jan 6, 2025

Automating the process with a GitHub workflow that not only identifies the missing forms but also opens a PR with the corrected queries would indeed save a lot of time and effort. Working on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request help wanted Extra attention is needed
Projects
Status: Todo
Development

No branches or pull requests

3 participants