-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Title: Comprehensive expansion of Ukrainian lexeme extraction queries (Issue #237 fixed) #424
Title: Comprehensive expansion of Ukrainian lexeme extraction queries (Issue #237 fixed) #424
Conversation
I'm excited to present a substantial enhancement to our Ukrainian language data extraction pipeline. This pull request significantly expands our SPARQL queries to capture a more comprehensive morphological landscape of Ukrainian lexemes across multiple parts of speech. Let's delve into the technical specifics: 1. Verbs 🔠 (query_verbs.sparql): - Implemented extraction of finite verb forms: * Present tense: 1st, 2nd, 3rd person singular (wd:Q192613 + wd:Q21714344/wd:Q51929049/wd:Q51929074 + wd:Q110786) * Past tense: masculine, feminine, neuter singular (wd:Q1240211 + wd:Q499327/wd:Q1775415/wd:Q1775461 + wd:Q110786) - Added imperative mood: 2nd person singular (wd:Q22716 + wd:Q51929049 + wd:Q110786) - Retained infinitive form extraction (wd:Q179230) 2. Nouns 📚 (query_nouns.sparql): - Extended singular case paradigm: * Genitive (wd:Q146233), Dative (wd:Q145599), Accusative (wd:Q146078) * Instrumental (wd:Q192997), Locative (wd:Q202142) - Maintained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction 3. Adjectives 🏷️ (NEW: query_adjectives.sparql): - Implemented comprehensive adjectival paradigm: * Singular nominative: masculine (wd:Q499327), feminine (wd:Q1775415), neuter (wd:Q1775461) * Plural nominative (wd:Q146786) - Included degree forms: comparative (wd:Q14169499) and superlative (wd:Q1817208) 4. Adverbs 🔄 (NEW: query_adverbs.sparql): - Established query for adverbial extraction: * Base form (lemma) * Comparative (wd:Q14169499) and superlative (wd:Q1817208) degrees 5. Prepositions 📍 (query_prepositions.sparql): - Optimized existing query structure - Enhanced case association extraction (wdt:P5713) 6. Proper Nouns 👤 (query_proper_nouns.sparql): - Significantly expanded case paradigm for singular: * Nominative (lemma), Genitive (wd:Q146233), Dative (wd:Q145599) * Accusative (wd:Q146078), Instrumental (wd:Q192997), Locative (wd:Q202142) - Crucially added Vocative case (wd:Q185077), essential for direct address in Ukrainian - Retained plural nominative (wd:Q131105 + wd:Q146786) and gender (wdt:P5185) extraction Technical implementation details: - Utilized OPTIONAL clauses for all non-lemma forms to ensure query robustness - Implemented consistent use of wikibase:grammaticalFeature for form specification - Employed REPLACE(STR(?lexeme), "http://www.wikidata.org/entity/", "") for lexeme ID extraction - Utilized wikibase:label service for human-readable labels where applicable This enhancement significantly broadens our morphological coverage of Ukrainian, providing a rich dataset for advanced NLP tasks, including but not limited to: - Morphological analysis and generation - Named Entity Recognition (NER) with case-sensitive features - Machine Translation with deep grammatical understanding - Linguistic research on Ukrainian morphosyntax I've rigorously tested these queries on the Wikidata Query Service (https://query.wikidata.org/) to ensure optimal performance and accurate results. However, I welcome meticulous review, particularly focusing on: 1. Correctness of Wikidata QIDs for grammatical features 2. Query efficiency and potential for optimization 3. Completeness of morphological paradigms for each part of speech This pull request represents a significant stride towards a more nuanced and comprehensive representation of Ukrainian in our data pipeline. I'm eager to discuss any suggestions for further refinements or expansions to our linguistic feature set.
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First PR Commit Check
- The commit messages for the remote branch of a new contributor should be checked to make sure their email is set up correctly so that they receive credit for their contribution
- The contributor's name and icon in remote commits should be the same as what appears in the PR
- If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for
git config user.email
in their local Scribe-Data repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate the quality of this PR, @Collins-Webdev 😊 Thanks so much for the care you put into the queries. We have a new SPARQL query writing guide that might give you a bit more information to improve, but really a great first PR :)
I'm excited to present a substantial enhancement to our Ukrainian language data extraction pipeline. This pull request significantly expands our SPARQL queries to capture a more comprehensive morphological landscape of Ukrainian lexemes across multiple parts of speech. Let's delve into the technical specifics:
Verbs 🔠 (query_verbs.sparql):
Nouns 📚 (query_nouns.sparql):
Adjectives 🏷️ (NEW: query_adjectives.sparql):
Adverbs 🔄 (NEW: query_adverbs.sparql):
Prepositions 📍 (query_prepositions.sparql):
Proper Nouns 👤 (query_proper_nouns.sparql):
Technical implementation details:
This enhancement significantly broadens our morphological coverage of Ukrainian, providing a rich dataset for advanced NLP tasks, including but not limited to:
I've rigorously tested these queries on the Wikidata Query Service (https://query.wikidata.org/) to ensure optimal performance and accurate results. However, I welcome meticulous review, particularly focusing on:
This pull request represents a significant stride towards a more nuanced and comprehensive representation of Ukrainian in our data pipeline. I'm eager to discuss any suggestions for further refinements or expansions to our linguistic feature set.
Contributor checklist
pytest
command as directed in the testing section of the contributing guideDescription
Related issue