Skip to content

News article datasets provided for use as benchmarks for machine learning research

License

Notifications You must be signed in to change notification settings

derekgreene/bbc-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

bbc-datasets

Two news article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. These datasets are made available for non-commercial and research purposes only. If you make use of these datasets please consider citing the publication:

D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", in Proc. 23rd International Conference on Machine learning (ICML'06), 2006. [PDF] [BibTeX].

Dataset: BBC

All rights, including copyright, in the content of the original articles are owned by the BBC.

  • Consists of 2225 documents from the BBC news website corresponding to articles in five topical areas published during 2004-2005.
  • 5 annotated class labels: business, entertainment, politics, sport, tech
  • Each article is stored in a separate text file and articles are divided into sub-directories by class.

Dataset: BBCSport

All rights, including copyright, in the content of the original articles are owned by the BBC.

  • Consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas published during 2004-2005.
  • 5 annotated class labels: athletics, cricket, football, rugby, tennis
  • Each article is stored in a separate text file and articles are divided into sub-directories by class.

Contact

For further information please contact Derek Greene

About

News article datasets provided for use as benchmarks for machine learning research

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published