HuggingFace Flax-Community Project Forum link: https://discuss.huggingface.co/t/pretrain-roberta-from-scratch-in-hindi Pretrain RoBERTa for Hindi Language and compare the performance on various down stream tasks with existing models
In order to create a stable pipeline for building a large corpus of Hindi-text datasets, we brought different datasets available on HuggingFaces's datasethub and kaggle under a single wrapper datasets
library. Dataset loading scripts for all the hindi datasets we could find are available in datasets/
directory.
concatenate_datasets_and_clean.py
merges all datasets(with cleaning hooks) together into a single large dataset which could be fed seamlessly into Huggingface transformers MLM training loop.
Just a minor fix as WikiNER hindi downstream task on HuggingFace datasets library was buggy. Need to load indic_glue_dataset_fixed/indic_glue.py
instead to perform downstream evaluation.
Using IndicGlue benchmarks, the benchmarks/
directory contains helpers which could be used to fine-tune & perform downstream evaluation on the following tasks (for multiple models available on modelhub as reference, and log results to wandb):
- BBCA news classification
- IITP product reviews
- IITP movie reviews
- WikiNER (Named entity recognition)
Note: We are still in the process on cleaning up some colab notebooks and will commit an end2end working script soon.
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='flax-community/roberta-hindi')
>>> unmasker("हम आपके सुखद <mask> की कामना करते हैं")
[{'score': 0.3310680091381073,
'sequence': 'हम आपके सुखद सफर की कामना करते हैं',
'token': 1349,
'token_str': ' सफर'},
{'score': 0.15317578613758087,
'sequence': 'हम आपके सुखद पल की कामना करते हैं',
'token': 848,
'token_str': ' पल'},
{'score': 0.07826550304889679,
'sequence': 'हम आपके सुखद समय की कामना करते हैं',
'token': 453,
'token_str': ' समय'},
{'score': 0.06304813921451569,
'sequence': 'हम आपके सुखद पहल की कामना करते हैं',
'token': 404,
'token_str': ' पहल'},
{'score': 0.058322224766016006,
'sequence': 'हम आपके सुखद अवसर की कामना करते हैं',
'token': 857,
'token_str': ' अवसर'}]
"""
The RoBERTa Hindi model was pretrained on the reunion of the following datasets:
- OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
- mC4 is a multilingual colossal, cleaned version of Common Crawl's web crawl corpus.
- IndicGLUE is a natural language understanding benchmark.
- Samanantar is a parallel corpora collection for Indic language.
- Hindi Text Short and Large Summarization Corpus is a collection of ~180k articles with their headlines and summary collected from Hindi News Websites.
- Hindi Text Short Summarization Corpus is a collection of ~330k articles with their headlines collected from Hindi News Websites.
- Old Newspapers Hindi is a cleaned subset of HC Corpora newspapers.
RoBERTa Hindi is evaluated on various downstream tasks. The results are summarized below.
Task | Task Type | IndicBERT | HindiBERTa | Indic Transformers Hindi BERT | RoBERTa Hindi Guj San | RoBERTa Hindi |
---|---|---|---|---|---|---|
BBC News Classification | Genre Classification | 76.44 | 66.86 | 77.6 | 64.9 | 73.67 |
WikiNER | Token Classification | - | 90.68 | 95.09 | 89.61 | 92.76 |
IITP Product Reviews | Sentiment Analysis | 78.01 | 73.23 | 78.39 | 66.16 | 75.53 |
IITP Movie Reviews | Sentiment Analysis | 60.97 | 52.26 | 70.65 | 49.35 | 61.29 |
- Aman K (amankhandelia)
- Haswanth Aekula (hassiahk)
- Kartik Godawat (dk-crazydiv)
- Prateek Agrawal (prateekagrawal)
- Rahul Dev (mlkorra)
Create the best Hindi Language Model to use in real world applications.