news crawler from only rss feeds, urls in the news are not considered, might change later
to run the crawler, you probably need to install some packages, available in pip:
also the command line tool lynx is used
- in the file "feeds.txt" are some news feeds, you can add more feeds
- start the MongoDB deamon (@IMS people: please kindly ask Edgar to update MongoDB in the servers, the version is tooooo old)
$ mongod --dbpath [DATABASE-PATH]
- run the crawler, or use -h for help
$ python crawler.py -t [NUM-OF-THREADS] -d [DATABASE-NAME] -f [FEEDS-FILE]
- you can use any schedule tool to run the crawler only once or twice a day, since rss are not updating that fast.
#License free to use under own risk, the author is cowardly not responsible for any unpleasant consequences.