Skip to content

EggplantElf/news_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

news_crawler

news crawler from only rss feeds, urls in the news are not considered, might change later

dependencies

to run the crawler, you probably need to install some packages, available in pip:

also the command line tool lynx is used

usage

  • in the file "feeds.txt" are some news feeds, you can add more feeds
  • start the MongoDB deamon (@IMS people: please kindly ask Edgar to update MongoDB in the servers, the version is tooooo old)
$ mongod --dbpath [DATABASE-PATH]
  • run the crawler, or use -h for help
$ python crawler.py -t [NUM-OF-THREADS] -d [DATABASE-NAME] -f [FEEDS-FILE]
  • you can use any schedule tool to run the crawler only once or twice a day, since rss are not updating that fast.

#License free to use under own risk, the author is cowardly not responsible for any unpleasant consequences.

About

news crawler from rss feeds

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published