List of official publicly available crawlers.
Missing something? Feel free to open issue.
- craiglist.com - microcrawler-crawler-craiglist.com
- firmy.cz - microcrawler-crawler-firmy.cz
- google.com - microcrawler-crawler-google.com
- news.ycombinator.com - microcrawler-crawler-news.ycombinator.com
- sreality.cz - microcrawler-crawler-sreality.cz
- xkcd.com - microcrawler-crawler-xkcd.com
- yelp.com - microcrawler-crawler-yelp.com
- youjizz.com - microcrawler-crawler-youjizz.com
From npmjs.org (the easy way)
This is the easiest way. The prerequisites still needs to be satisfied.
npm install -g microcrawler
This is useful if you want to tweak the source code, implement new crawler, etc.
# Clone repository
git clone https://github.com/ApolloCrawler/microcrawler.git
# Enter folder
cd microcrawler
# Install required packages - dependencies
npm install
# Install from local sources
npm install -g .
$ microcrawler
Usage: microcrawler [options] [command]
Commands:
collector [args] Run data collector
config [args] Run config
exporter [args] Run data exporter
worker [args] Run crawler worker
crawl [args] Crawl specified site
help [cmd] display help for [cmd]
Options:
-h, --help output usage information
-V, --version output the version number
$ microcrawler --version
0.1.27
$ microcrawler config init
2016-09-03T10:45:13.105Z - info: Creating config file "/Users/tomaskorcak/.microcrawler/config.json"
{
"client": "superagent",
"timeout": 10000,
"throttler": {
"enabled": false,
"active": true,
"rate": 20,
"ratePer": 1000,
"concurrent": 8
},
"retry": {
"count": 2
},
"headers": {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"From": "googlebot(at)googlebot.com"
},
"proxy": {
"enabled": false,
"list": [
"https://168.63.20.19:8145"
]
},
"natFaker": {
"enabled": true,
"base": "192.168.1.1",
"bits": 16
},
"amqp": {
"uri": "amqp://localhost",
"queues": {
"collector": "collector",
"worker": "worker"
},
"options": {
"heartbeat": 60
}
},
"couchbase": {
"uri": "couchbase://localhost:8091",
"bucket": "microcrawler",
"username": "Administrator",
"password": "Administrator",
"connectionTimeout": 60000000,
"durabilityTimeout": 60000000,
"managementTimeout": 60000000,
"nodeConnectionTimeout": 10000000,
"operationTimeout": 10000000,
"viewTimeout": 10000000
},
"elasticsearch": {
"uri": "localhost:9200",
"index": "microcrawler",
"log": "debug"
}
}
$ vim ~/.microcrawler/config.json
$ microcrawler config show
{
"client": "superagent",
"timeout": 10000,
"throttler": {
"enabled": false,
"active": true,
"rate": 20,
"ratePer": 1000,
"concurrent": 8
},
"retry": {
"count": 2
},
"headers": {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
"From": "googlebot(at)googlebot.com"
},
"proxy": {
"enabled": false,
"list": [
"https://168.63.20.19:8145"
]
},
"natFaker": {
"enabled": true,
"base": "192.168.1.1",
"bits": 16
},
"amqp": {
"uri": "amqp://example.com",
"queues": {
"collector": "collector",
"worker": "worker"
},
"options": {
"heartbeat": 60
}
},
"couchbase": {
"uri": "couchbase://example.com:8091",
"bucket": "microcrawler",
"username": "Administrator",
"password": "Administrator",
"connectionTimeout": 60000000,
"durabilityTimeout": 60000000,
"managementTimeout": 60000000,
"nodeConnectionTimeout": 10000000,
"operationTimeout": 10000000,
"viewTimeout": 10000000
},
"elasticsearch": {
"uri": "example.com:9200",
"index": "microcrawler",
"log": "debug"
}
}
TBD
TBD
TBD
TBD
microcrawler crawl craiglist.index http://sfbay.craigslist.org/sfc/sss/
microcrawler crawl firmy.cz.index "https://www.firmy.cz?_escaped_fragment_="
microcrawler crawl google.index http://google.com/search?q=Buena+Vista
microcrawler crawl hackernews.index https://news.ycombinator.com/
microcrawler crawl xkcd.index http://xkcd.com
microcrawler crawl yelp.index "http://www.yelp.com/search?find_desc=restaurants&find_loc=Los+Angeles%2C+CA&ns=1&ls=f4de31e623458437"
microcrawler crawl youjizz.com.index http://youjizz.com
- @pavelbinar for QA and not just that.