This repository contains an emerging attempt at compiling a comprehensive set of NLP benchmarks for Norwegian, together with evaluation of the respective performance of various Norwegian language models on these tasks.
We list the existing test sets, their recommended evaluation metrics
and provide links to the original evaluation code (where available).
The evaluation_scripts
directory contains the recommended ready-to-use evaluation scripts for the main tasks
(warning: work in progress!)
The tasks that are currently used in the NorBench leaderboard are emphasized in bold, and the benchmark is publicly hosted on the HuggingFace.
See more details in our paper: NorBench -- A Benchmark for Norwegian Language Models (Samuel et al, NoDaLiDa 2023)
It also introduced a range of novel language models for Norwegian, check our HuggingFace page.
Task | Test Set | Metrics | Evaluation code |
---|---|---|---|
PoS tagging | Bokmaal / Nynorsk / Dialects | Macro F1 UPOS/XPOS | CoNLL 2018 shared task evaluation script |
Dependency parsing | Bokmaal / Nynorsk / Dialects | Unlabeled/Labeled Accuracy Score (UAS/LAS) | CoNLL 2018 shared task evaluation script |
Named entity recognition | NorNE Bokmaal and Nynorsk | Entity-level exact match F1 (strict) | script |
Targeted sentiment analysis | NoReC_tsa | Entity-level exact match F1 (strict) | script |
Linguistic acceptance | NoCoLa | MCC | -script for encoder models -script for encoder-decoderv (text-to-text) models |
Question answering | NorQuaD | token-level F1 | script |
Machine translation from Bokmål to Nynorsk | Sample from Omsetjingsminne frå Nynorsk pressekontor and Omsetjingsminne frå Målfrid | SacreBLEU | script |
Structured sentiment analysis | NoReC_fine | Sentiment Graph F1 | Semeval 2022 evaluation script |
Negation cues and scopes | NoReC_neg | StarSem Full Negation F1 | StarSem Perl script |
Co-reference resolution | NARC (annotation ongoing) | MUC, |
evaluation notebook NARC (using corefeval) |
Task | Test Set | Metrics | Evaluation code |
---|---|---|---|
Sentence-level polarity (ternary sentiment classification) | NoReC_sentence | Macro averaged F1 | - script for encoder models - script for encoder-decoder (text-to-text) models |
Document-level polarity (ternary sentiment classification) | Norwegian Review Corpus | Macro averaged F1 | - script for encoder models - script for encoder-decoder (text-to-text) models |
Political affiliation detection | Talk of Norway | ||
Dialect classification in tweets | NorDial | Macro averaged F1 | sklearn |
Task | Test Set | Metrics | Evaluation code |
---|---|---|---|
Synonym detection | Norwegian Synonymy | Can be used for contextualized models with lexical substitution | |
Analogical reasoning* | Norwegian Analogy | ||
Word-level polarity* | NorSentLex | Accuracy | sklearn |
Word sense disambiguation in context (WSDiC) | Norwegian WordNet | Averaged macro F1 | Preliminary example |
Lexical semantic change detection (LSCD) | NorDiaChange | Spearman correlation, Accuracy | SemEval'2020 |
* Type-based (static) models only