Skip to content

dvguruprasad/tidings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 

Repository files navigation

tidings

The deluge of content(text, images, videos, audio, etc.) available on the web concerning anything, be it science, politics, sports, history, etc. makes it difficult to find content related to a specific field of interest. What is required is a system that can source content from the web, classify and present it in a manner that is easier to consume. tidings is one such system that deals text content.

tidings pulls text content from a bunch of rss feeds and automatically classifies them into predetermined categories.

The content(document) classification part of tidings is a Naive Bayes classifier. Naive Bayes classification is a supervised probabilistic text classification model. The idea is to predict the category for a document based on the existing knowledge of the categories of a myriad others. The probability of a document belonging to each category is computed and the document assigned to the category with the highest probability.

The probability of a document D with n words belonging to a category C can be defined as:

P(C | D) = (product of P(W(i) | C) for i from 1 to n) * P(C) where P(W(i)|C) is the probability of a document containing word W(i) belonging to category C, and P(C) is the probability of a document belonging to category C. and: P(W(i)|C) = Number of occurrences of W(i) in documents categorized as C / Total number of words classified as C P(C) = Number of documents categorized as C / Total number of documents

Computation of P(W(i)|C) and P(C) is on what is called a training set. Training set constitutes data collected during training.

Classification in tidings can thus be divided into two phases:

Training: Is is the act of feeding the system, documents that have already labelled as belonging to different categories. During this process, a document is the word frequencies, or the number of occurrences of every word, captured against the given category. Once word frequencies are captured, probability is computed for each word across all categories [P(W(i)|C)].

Classification: Is the act of classifying a document as belonging to a category. Here, the probability of the document belonging to each category is computed [P(C|D)] and the document assigned to the category with the highest probability.

tidings currently classifies documents into four categories - Science, Entertainment, Sports and Software development. We collected 68k words as part of training. The system can be used to classify documents into any category set, after training.

About

A Content Classification system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages