Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add support for Bavarian NER Dataset (BarNER) #3533

Open
stefan-it opened this issue Aug 20, 2024 · 0 comments
Open

[Feature]: Add support for Bavarian NER Dataset (BarNER) #3533

stefan-it opened this issue Aug 20, 2024 · 0 comments
Assignees
Labels
feature A new feature

Comments

@stefan-it
Copy link
Member

stefan-it commented Aug 20, 2024

Problem statement

Hi,

I've just found a new and very cool NER dataset for Bavarian:

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

It is featured in the 2024 LREC-COLING paper Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data by Peng et al.

The dataset is already released and available in the BarNER repository in this folder.

I can't wait to fine-tune and release NER models for Bavarian 🥨

Solution

I would implement BarNER as NER_BAVARIAN, NER_BARNER or NER_BAVARIAN_WIKI.

The constructor should have a corpora attribute with the following valid corpora: wiki, tweet and all. Thus it should return a MultiCorpus, because BarNER itself consists of two different corpora. However, the tweet corpus seems to be not freely available (probably due to license restrictions of Twitter/X-API).

With an additional revision attribute, the user can pass a specific commit/revision, where default points to main.

In order to distinguish between the "coarse" or fine-grained NER tagset, we can introduce a fine_grained_classes: bool = False attribute (inspired by GERMEVAL_2018_OFFENSIVE_LANGUAGE).

Additional Context

The dataset loader should be unit tested. E.g. the total number of parsed sentences can be seen in Table 1 of the paper. This could serve as a basic unit test.

Unfortunately, the data for the tweet corpus is not available:

# newdoc id = bar_tweet_1605176258652143616
# sent_id = bar_tweet_1605176258652143616-1
# text = _____________ _____ ___ _
_____________	O
_____	O
___	O
_	O

However, data for the wiki corpus is available and it even has document boundary information:

# newdoc id = bar_wiki_Automat
@stefan-it stefan-it added the feature A new feature label Aug 20, 2024
@stefan-it stefan-it self-assigned this Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

No branches or pull requests

1 participant