Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable label alignment for token classification datasets #4277

Merged
merged 6 commits into from
May 6, 2022
Merged

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented May 4, 2022

This PR extends the Dataset.align_labels_with_mapping() method to support alignment of label mappings between datasets and models for token classification (e.g. NER).

Example of usage:

from datasets import load_dataset

ner_ds = load_dataset("conll2003", split="train")
# returns [3, 0, 7, 0, 0, 0, 7, 0, 0]
ner_ds[0]["ner_tags"]
# hypothetical model mapping with O <--> B-LOC
label2id = {
    "B-LOC": "0",
    "B-MISC": "7",
    "B-ORG": "3",
    "B-PER": "1",
    "I-LOC": "6",
    "I-MISC": "8",
    "I-ORG": "4",
    "I-PER": "2",
    "O": "5"
  }
ner_aligned_ds = ner_ds.align_labels_with_mapping(label2id, "ner_tags")
# returns [3, 5, 7, 5, 5, 5, 7, 5, 5]
ner_aligned_ds[0]["ner_tags"]

Context: we need this in AutoTrain to automatically align datasets / models during evaluation. cc @abhishekkrthakur

@lewtun lewtun requested review from mariosasko and lhoestq May 4, 2022 07:18
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 4, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits.

PS: It makes sense to also add support for the Sequence type to class_encode_column. I can work on that.

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
@lewtun
Copy link
Member Author

lewtun commented May 4, 2022

Hmm, not sure why the Windows tests are failing with:

Did not find path entry C:\tools\miniconda3\bin
C:\tools\miniconda3\envs\py37\python.exe: No module named pytest

Edit: running the CI again fixed the problem 🙃

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last nit and we can merge then

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved
@lewtun
Copy link
Member Author

lewtun commented May 5, 2022

One last nit and we can merge then

Thanks, done!

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Thanks!

@lewtun lewtun merged commit 8e20ff5 into master May 6, 2022
@lewtun lewtun deleted the align-ner branch May 6, 2022 15:36
@lhoestq lhoestq mentioned this pull request May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants