-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
现在的微博爬取一天还能针对单个关键词爬取一千万条吗?该怎么去设置才能大量爬取啊 #515
Comments
现在没办法了,部分参数失效了,一个日期只能获取2万多。 |
那是不是假如要爬取的仔细一点,只能一天天爬,不能一次爬一段时间这样吧 |
最好每次爬一个日期的微博,否则,设置一个长的日期段可能会漏爬。实际上,后者如果结果很多,程序也会自动按天分段爬。只不过,搜索页面有时候会随机出现一个空白页,导致程序以为爬完了而结束,部分设置的日期还没来得及爬取。单独按天爬(START_DATE和END_DATE设置为同一天),会减少这种影响。 |
好的,谢谢大佬的解答 |
还有一个问题,就是例如我爬取完10号的数据,然后在setting里将日期修改为11号,接着再次启动README里的$ scrapy crawl search -s JOBDIR=crawls/search。但是终端开始爬取的数据日期仍然为10号的,我需要将crawls下的search文件夹删除了,再次启动readme里的运行程序才能爬取11号的数据,该怎么解决这个问题啊? |
可以使用命令行scrapy crawl search |
No description provided.
The text was updated successfully, but these errors were encountered: