Enable label alignment for token classification datasets #4277

lewtun · 2022-05-04T07:15:16Z

This PR extends the Dataset.align_labels_with_mapping() method to support alignment of label mappings between datasets and models for token classification (e.g. NER).

Example of usage:

from datasets import load_dataset

ner_ds = load_dataset("conll2003", split="train")
# returns [3, 0, 7, 0, 0, 0, 7, 0, 0]
ner_ds[0]["ner_tags"]
# hypothetical model mapping with O <--> B-LOC
label2id = {
    "B-LOC": "0",
    "B-MISC": "7",
    "B-ORG": "3",
    "B-PER": "1",
    "I-LOC": "6",
    "I-MISC": "8",
    "I-ORG": "4",
    "I-PER": "2",
    "O": "5"
  }
ner_aligned_ds = ner_ds.align_labels_with_mapping(label2id, "ner_tags")
# returns [3, 5, 7, 5, 5, 5, 7, 5, 5]
ner_aligned_ds[0]["ner_tags"]

Context: we need this in AutoTrain to automatically align datasets / models during evaluation. cc @abhishekkrthakur

HuggingFaceDocBuilderDev · 2022-05-04T07:26:09Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko

Two nits.

PS: It makes sense to also add support for the Sequence type to class_encode_column. I can work on that.

src/datasets/arrow_dataset.py

Co-authored-by: Mario Šaško <[email protected]>

lewtun · 2022-05-04T12:54:30Z

Hmm, not sure why the Windows tests are failing with:

Did not find path entry C:\tools\miniconda3\bin
C:\tools\miniconda3\envs\py37\python.exe: No module named pytest

Edit: running the CI again fixed the problem 🙃

mariosasko

One last nit and we can merge then

src/datasets/arrow_dataset.py

Co-authored-by: Mario Šaško <[email protected]>

lewtun · 2022-05-05T05:15:46Z

One last nit and we can merge then

Thanks, done!

mariosasko

Looks good now. Thanks!

lewtun added 3 commits May 3, 2022 17:58

Extend align_labels_with_mapping for token classification

988672e

Use list comprehensions

cf0d8f0

Add test

f181bfe

lewtun requested review from mariosasko and lhoestq May 4, 2022 07:18

mariosasko reviewed May 4, 2022

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

lewtun and others added 2 commits May 4, 2022 14:35

Apply suggestions from code review

a51ab12

Co-authored-by: Mario Šaško <[email protected]>

Fix style

afdd2ed

mariosasko reviewed May 4, 2022

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

Apply suggestions from code review

6dbae97

Co-authored-by: Mario Šaško <[email protected]>

mariosasko approved these changes May 6, 2022

View reviewed changes

lewtun merged commit 8e20ff5 into master May 6, 2022

lewtun deleted the align-ner branch May 6, 2022 15:36

lhoestq mentioned this pull request May 19, 2022

to_tf_dataset rewrite #4170

Merged

mariosasko mentioned this pull request Oct 26, 2023

Multi label class encoding #6267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable label alignment for token classification datasets #4277

Enable label alignment for token classification datasets #4277

lewtun commented May 4, 2022

HuggingFaceDocBuilderDev commented May 4, 2022 •

edited

Loading

mariosasko left a comment

lewtun commented May 4, 2022 •

edited

Loading

mariosasko left a comment

lewtun commented May 5, 2022

mariosasko left a comment

Enable label alignment for token classification datasets #4277

Enable label alignment for token classification datasets #4277

Conversation

lewtun commented May 4, 2022

HuggingFaceDocBuilderDev commented May 4, 2022 • edited Loading

mariosasko left a comment

Choose a reason for hiding this comment

lewtun commented May 4, 2022 • edited Loading

mariosasko left a comment

Choose a reason for hiding this comment

lewtun commented May 5, 2022

mariosasko left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 4, 2022 •

edited

Loading

lewtun commented May 4, 2022 •

edited

Loading