crawling-for-nomore404

Crawling-related code for no-more-404s projects.

There are multiple projects that are mostly independent of each other. Here are a summary of each projects. Look for README in respective project subdirectory for more details.

wikipedia

this project scrapes wikipedia IRC channel for updated article, extracts newly added citations, and feed those URLs for crawling. scraper and crawl-scheduler are communicating through Kafka messaging, so other apps can also read a feed of new citations as well as original IRC notifications.

wordpress

this project reads WordPress's official blog update stream, and schedules each permalink URL of new post for crawling. it is implemented as single application at this moment.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
gdelt		gdelt
twitter		twitter
whois		whois
wikipedia		wikipedia
wordpress		wordpress
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawling-for-nomore404

wikipedia

wordpress

About

Releases

Packages

Languages

vbanos/crawling-for-nomore404

Folders and files

Latest commit

History

Repository files navigation

crawling-for-nomore404

wikipedia

wordpress

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages