Skip to content

The project scraps articles from a malayalam newspaper website to create a corpus. A set of queries is created and corresponding ground truth answers is retrieved. This can be used as a dataset that can check new tools in future like malaylam stemmer, stopwords removal, lemmatizers, etc...

Notifications You must be signed in to change notification settings

ABHISHEKVALSAN/Malayalam-Newspaper-Article-Dataset

Repository files navigation

Malayalam-Newspaper-Article-Dataset

Project scraped articles from a malayalam newspaper(janmabhumi) website to create a corpus of news articles. Also a set of queries is created and corresponding ground truth answers is retrieved by a combination of bm25 method and tf-idf method. The dataset can be useful for creating tools like stemmer, stopwords removal, lemmatizers, etc...

Dataset includes news articles from the year 2014 to 2018

##Note

This repo is obsolete, and scrapping does not work on the mentioned site.

DATASET

Directly download the complete dataset from Dropbox

Email : [email protected]

Related Works

A similar repo with Telugu DataSet can be found here.

About

The project scraps articles from a malayalam newspaper website to create a corpus. A set of queries is created and corresponding ground truth answers is retrieved. This can be used as a dataset that can check new tools in future like malaylam stemmer, stopwords removal, lemmatizers, etc...

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages