Give Feedback 📑: DSFSI Resource Feedback Form{:target="_blank"}
The dataset contains annotated categorised data from Dikgang - Daily News https://dailynews.gov.bw/news-list/srccategory/10. The data is in setswana.
See the Data Statement for foll details.
This dataset contains machine-readable data extracted from online news articles, from https://dailynews.gov.bw/news-list/srccategory/10, provided by the Botswana Government. While efforts were made to ensure the accuracy and completeness of this data, there may be errors or discrepancies between the original publications and this dataset. No warranties, guarantees or representations are given in relation to the information contained in the dataset. The members of the Data Science for Societal Impact Research Group bear no responsibility and/or liability for any such errors or discrepancies in this dataset. The Botswana Government bears no responsibility and/or liability for any such errors or discrepancies in this dataset. It is recommended that users verify all information contained herein before making decisions based upon this information.
- Vukosi Marivate - @vukosi
- Valencia Wagner
Bibtex Reference
@inproceedings{marivate2023puoberta,
title = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
author = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
year = {2023},
booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
keywords = {NLP},
preprint_url = {https://arxiv.org/abs/2310.09141},
dataset_url = {https://github.com/dsfsi/PuoBERTa},
software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}
The license of the News Categorisation dataset is in CC-BY-SA-4.0. the monolingual data have difference licenses depending on the news website license
- License for Data - CC-BY-SA-4.0