Skip to content

ravi-kumar-yadav/apache-nutch-1.7-plain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

apache-nutch-1.7-plain

Code is downloaded from apache site and being modified for MTP2

My MTP2 Project

Suggestions By Swapnil

  1. mere time pe 2 class svm use kiya tha with linear kernel, try with gaussian kernel, linear kaam nahi karega as data increases (ganesh aur sunita ne two class svm use karne bola tha)
  2. make sure koi bhi tourism page ko positive me daalo
  3. humne negative set sirf health ka liya tha... which not enough... saare orthogonal categories dhundo aur sabke thode thode urls base set me daalo. humne india ke bahar waalon ko -ve mark kiya tha, which causes error... meri report me error analysis [padh lena...exact details mil jaayenge
  4. one class better fit hoga yahaa shayad, try to get answer to this question as well, isse jaldi implement karna...10 depths tak results aane ko it takes around 10-12 hrs

Rough Idea

  1. Try Guassian Kernel
  2. Build Larger Negative Training set
  3. Koi bhi tourism page ko positive me daalo

Features

  1. Pos URL Tokens :: Percentage of overlapping URL tokens in the already crawled URLs set.
  2. Pos Parent URL Tokens :: Percentage of overlapping parents URL tokens in the already crawled URLs parent tokens set.
  3. Pos Anchor Text of URL :: Percentage of overlapping anchor texts of the URL in the already crawled URLs anchor text set.
  4. Neg URL Tokens :: Percentage of overlapping URL tokens in the already discarded URL tokens set.
  5. Neg Parent URL Tokens :: Percentage of overlapping parents URL tokens in the already discarded URLs parent tokens set.
  6. Neg Anchor Text of URL :: Percentage of overlapping anchor texts of the URL in the already discarded URLs anchor text set.
  7. Average Parent Score :: Average parent scores of the URL.

About

Code is downloaded from apache site and being modified for MTP2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published