You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've just found a new and very cool NER dataset for Bavarian:
Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.
The dataset is already released and available in the BarNER repository in this folder.
I can't wait to fine-tune and release NER models for Bavarian 🥨
Solution
I would implement BarNER as NER_BAVARIAN, NER_BARNER or NER_BAVARIAN_WIKI.
The constructor should have a corpora attribute with the following valid corpora: wiki, tweet and all. Thus it should return a MultiCorpus, because BarNER itself consists of two different corpora. However, the tweet corpus seems to be not freely available (probably due to license restrictions of Twitter/X-API).
With an additional revision attribute, the user can pass a specific commit/revision, where default points to main.
In order to distinguish between the "coarse" or fine-grained NER tagset, we can introduce a fine_grained_classes: bool = False attribute (inspired by GERMEVAL_2018_OFFENSIVE_LANGUAGE).
Additional Context
The dataset loader should be unit tested. E.g. the total number of parsed sentences can be seen in Table 1 of the paper. This could serve as a basic unit test.
Unfortunately, the data for the tweet corpus is not available:
# newdoc id = bar_tweet_1605176258652143616
# sent_id = bar_tweet_1605176258652143616-1
# text = _____________ _____ ___ _
_____________ O
_____ O
___ O
_ O
However, data for the wiki corpus is available and it even has document boundary information:
# newdoc id = bar_wiki_Automat
The text was updated successfully, but these errors were encountered:
Problem statement
Hi,
I've just found a new and very cool NER dataset for Bavarian:
It is featured in the 2024 LREC-COLING paper Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data by Peng et al.
The dataset is already released and available in the BarNER repository in this folder.
I can't wait to fine-tune and release NER models for Bavarian 🥨
Solution
I would implement BarNER as
NER_BAVARIAN
,NER_BARNER
orNER_BAVARIAN_WIKI
.The constructor should have a
corpora
attribute with the following valid corpora:wiki
,tweet
andall
. Thus it should return aMultiCorpus
, because BarNER itself consists of two different corpora. However, thetweet
corpus seems to be not freely available (probably due to license restrictions of Twitter/X-API).With an additional
revision
attribute, the user can pass a specific commit/revision, where default points tomain
.In order to distinguish between the "coarse" or fine-grained NER tagset, we can introduce a
fine_grained_classes: bool = False
attribute (inspired byGERMEVAL_2018_OFFENSIVE_LANGUAGE
).Additional Context
The dataset loader should be unit tested. E.g. the total number of parsed sentences can be seen in Table 1 of the paper. This could serve as a basic unit test.
Unfortunately, the data for the
tweet
corpus is not available:However, data for the
wiki
corpus is available and it even has document boundary information:The text was updated successfully, but these errors were encountered: