news_crawler

news crawler from only rss feeds, urls in the news are not considered, might change later

dependencies

to run the crawler, you probably need to install some packages, available in pip:

also the command line tool lynx is used

in the file "feeds.txt" are some news feeds, you can add more feeds
start the MongoDB deamon (@IMS people: please kindly ask Edgar to update MongoDB in the servers, the version is tooooo old)

$ mongod --dbpath [DATABASE-PATH]

$ python crawler.py -t [NUM-OF-THREADS] -d [DATABASE-NAME] -f [FEEDS-FILE]

you can use any schedule tool to run the crawler only once or twice a day, since rss are not updating that fast.

#License free to use under own risk, the author is cowardly not responsible for any unpleasant consequences.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
feeds.txt		feeds.txt
run.sh		run.sh
tokenizer.py		tokenizer.py
writer.py		writer.py