站酷(ZCOOL),中国设计师互动平台。深耕设计领域十五年,站酷聚集了1300万设计师、摄影师、插画师、艺术家、创意人,设计创意群体中具有较高的影响力与号召力。本仓库是用Go语言爬取站酷上的插画作品集(本人喜欢插画,很喜欢站酷上的作品,代码仅供学习交流!)
- go version go1.18.3 linux/amd64
- GoLand(2021.2.2)
- Google Chrome
- Windows 11
- Ubuntu 22.x
- Docker
- ElasticSearch 7.x
Base Instructions:(copy!copy!copy!)
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.17.5
docker run -p 127.0.0.1:9200:9200 -p 127.0.0.1:9300:9300 -v /home/yinlei/Desktop/goCrawlerData:/usr/share/elasticsearch/data -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.17.5
you must add:
You can clone this project, and replace some category's url which is written at main.go
.
Of course, you only need some logical code to achieve your target. for example, zcool
folder which you can refer.
I also add some intersting website in future. Douyin
is a good target.
- 站酷分类解析器
- 站酷列表解析器
- 站酷帖子解析器
// 解析器ParserGo(go1.17.7 windows/amd64)
// 输入:utf-8编码的文本
// 输出:Request{URL, 对应Parser}列表,Item列表
单任务版的瓶颈在于每次fetcher并解析后才轮到下一个解析,这里通过Goroutine实现并发版本。
Scheduler实现思路:
- 思路1:所有Worker公用一个输入(Request送入Worker让他们自己去抢,谁抢到谁做)
- 思路2:并发分发Request(Scheduler为每个Request创建goroutine)
- 思路3[最终改进版]:Request队列、Worker队列(scheduler engine worker都互相通过Channel通信)