Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. These datasets are made available for non-commercial and research purposes only. If you make use of these datasets please consider citing the publication:
D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", in Proc. 23rd International Conference on Machine learning (ICML'06), 2006. [PDF] [BibTeX].
All rights, including copyright, in the content of the original articles are owned by the BBC.
- Consists of 2225 documents from the BBC news website corresponding to articles in five topical areas published during 2004-2005.
- 5 annotated class labels: business, entertainment, politics, sport, tech
- Each article is stored in a separate text file and articles are divided into sub-directories by class.
All rights, including copyright, in the content of the original articles are owned by the BBC.
- Consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas published during 2004-2005.
- 5 annotated class labels: athletics, cricket, football, rugby, tennis
- Each article is stored in a separate text file and articles are divided into sub-directories by class.
For further information please contact Derek Greene