Repository is contains Taiwan Hot-Tops news media data crawler. 使用GO Colly 抓取台灣頭條新聞媒體即時頭條新聞
Using golang with gocolly to crawling data from internet and save to monogDB. The media is include TVBS, Mirrormedia, Chinatimes, STEN, Ettoday
V 1.5 - release in 2021/3/1
V 1.8 - release in 2021/7/30
github.com/PuerkitoBio/goquery v1.8.0
github.com/antchfx/htmlquery v1.2.4
github.com/antchfx/xmlquery v1.3.10
github.com/gobwas/glob v0.2.3
github.com/gocolly/colly v1.2.0
github.com/kennygrant/sanitize v1.2.4
github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca
github.com/sirupsen/logrus v1.8.1
github.com/temoto/robotstxt v1.1.2
golang.org/x/net v0.0.0-20220425223048-2871e0cb64e4
google.golang.org/appengine v1.6.7
gopkg.in/mgo.v2 v2.0.0-20190816093944-a6b53ec6cb22
- build - the final execution file
- crawler - data crawlering program
- exec - execute entry main program
- go.mod - module manage files
- build_program.sh - build go program to be executtion file
- Using Mongodb V 4.0.3
- it need to setup envvironment varibable enable program can connect to mongoDB
export MONGODB_HOST="localhost"
export MONGODB_PORT="27017"
export MONGODB_USERNAME="admin"
export MONGODB_PASSWD="password"
export MONGODB_DB="go_crawler_data"
./build_program.sh
License is opensource. Please refer to Apache License 2.0(Apache 2.0)
Project is done by first version, but it will keep developing.