[Feature]: Add support for Bavarian NER Dataset (BarNER) #3533

stefan-it · 2024-08-20T21:00:57Z

Problem statement

Hi,

I've just found a new and very cool NER dataset for Bavarian:

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

It is featured in the 2024 LREC-COLING paper Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data by Peng et al.

The dataset is already released and available in the BarNER repository in this folder.

I can't wait to fine-tune and release NER models for Bavarian 🥨

Solution

I would implement BarNER as NER_BAVARIAN, NER_BARNER or NER_BAVARIAN_WIKI.

The constructor should have a corpora attribute with the following valid corpora: wiki, tweet and all. Thus it should return a MultiCorpus, because BarNER itself consists of two different corpora. However, the tweet corpus seems to be not freely available (probably due to license restrictions of Twitter/X-API).

With an additional revision attribute, the user can pass a specific commit/revision, where default points to main.

In order to distinguish between the "coarse" or fine-grained NER tagset, we can introduce a fine_grained_classes: bool = False attribute (inspired by GERMEVAL_2018_OFFENSIVE_LANGUAGE).

Additional Context

The dataset loader should be unit tested. E.g. the total number of parsed sentences can be seen in Table 1 of the paper. This could serve as a basic unit test.

Unfortunately, the data for the tweet corpus is not available:

# newdoc id = bar_tweet_1605176258652143616
# sent_id = bar_tweet_1605176258652143616-1
# text = _____________ _____ ___ _
_____________	O
_____	O
___	O
_	O

However, data for the wiki corpus is available and it even has document boundary information:

# newdoc id = bar_wiki_Automat

The text was updated successfully, but these errors were encountered:

stefan-it added the feature A new feature label Aug 20, 2024

stefan-it self-assigned this Aug 20, 2024

stefan-it mentioned this issue Aug 20, 2024

Status of Twitter/X Data mainlp/BarNER#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add support for Bavarian NER Dataset (BarNER) #3533

[Feature]: Add support for Bavarian NER Dataset (BarNER) #3533

stefan-it commented Aug 20, 2024 •

edited

Loading

[Feature]: Add support for Bavarian NER Dataset (BarNER) #3533

[Feature]: Add support for Bavarian NER Dataset (BarNER) #3533

Comments

stefan-it commented Aug 20, 2024 • edited Loading

Problem statement

Solution

Additional Context

stefan-it commented Aug 20, 2024 •

edited

Loading