Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove config names as yaml keys #4367

Merged
merged 9 commits into from
May 20, 2022
Merged

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented May 18, 2022

Many datasets have dots in their config names. However it causes issues with the YAML tags of the dataset cards since we can't have dots in YAML keys.

I fix this, I removed the tags separations per config name completely, and have a single flat YAML for all configurations. Dataset search doesn't use this info anyway. I removed all the config names used as YAML keys, and I moved them in under a new config: key.

This is related to #2362 (internal https://github.com/huggingface/moon-landing/issues/946).

Also removing the dots in the YAML keys would allow us to do as in #4302 which removes a hack that replaces all the dots by underscores in the YAML tags.

I also added a test in the CI that checks that all the YAML tags to make sure that:

  • they can be parsed using a YAML parser
  • they contain only valid YAML tags like languages or task_ids

@lhoestq
Copy link
Member Author

lhoestq commented May 18, 2022

I included the change from #4302 directly in this PR, this way the datasets will be updated right away in the CI (the CI is only triggered when a dataset card is changed)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 18, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq
Copy link
Member Author

lhoestq commented May 18, 2022

Alright it's ready now :)

Here is an example for the ade_corpus_v2 dataset card. Notice the new configs key:

---
annotations_creators:
- expert-generated
language_creators:
- found
languages:
- en
licenses:
- unknown
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
- 1K<n<10K
- n<1K
source_datasets:
- original
task_categories:
- text-classification
- token-classification
task_ids:
- coreference-resolution
- fact-checking
paperswithcode_id: null
pretty_name: Adverse Drug Reaction Data v2
train-eval-index:
- config: Ade_corpus_v2_classification
task: text-classification
task_id: multi_class_classification
splits:
train_split: train
col_mapping:
text: text
label: target
metrics:
- type: accuracy
name: Accuracy
- type: f1
name: F1 macro
args:
average: macro
- type: f1
name: F1 micro
args:
average: micro
- type: f1
name: F1 weighted
args:
average: weighted
- type: precision
name: Precision macro
args:
average: macro
- type: precision
name: Precision micro
args:
average: micro
- type: precision
name: Precision weighted
args:
average: weighted
- type: recall
name: Recall macro
args:
average: macro
- type: recall
name: Recall micro
args:
average: micro
- type: recall
name: Recall weighted
args:
average: weighted
configs:
- Ade_corpus_v2_classification
- Ade_corpus_v2_drug_ade_relation
- Ade_corpus_v2_drug_dosage_relation
---

CI failures are only related to dataset cards missing some content.

@lhoestq lhoestq merged commit 3f30134 into master May 20, 2022
@lhoestq lhoestq deleted the remove-config-names-as-yaml-keys branch May 20, 2022 09:27
@albertvillanova albertvillanova linked an issue May 20, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalise license names
2 participants