You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
guhui@guhuideMacBook-Pro temp_api % scrapy crawl temp_api -o data.json
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: temp_api)
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Versions: lxml 5.2.0.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.0, w3lib 2.1.2, Twisted 24.3.0, Python 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)], pyOpenSSL 24.1.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform macOS-14.2.1-arm64-arm-64bit
2024-04-03 22:13:54 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-03 22:13:54 [asyncio] DEBUG: Using selector: KqueueSelector
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet Password: c00d3fa735530f6c
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2024-04-03 22:13:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'temp_api',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'temp_api.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['temp_api.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-03 22:13:54 [scrapy.core.engine] INFO: Spider opened
2024-04-03 22:13:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2024-04-03 22:13:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> from <GET http://sh.weather.com.cn/robots.txt>
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> (referer: None)
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 13 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 33 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 34 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 35 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 50 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 51 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 52 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 54 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 55 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 57 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 58 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 59 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 60 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 61 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 62 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 71 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 72 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 73 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 75 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 76 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 78 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 79 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 80 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 81 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 82 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 99 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 106 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 110 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 112 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 114 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 119 without any user agent to enforce it on.
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.weather.com.cn/> (referer: None)
2024-04-03 22:13:55 [temp_api] DEBUG: http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284
2024-04-03 22:13:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.weather.com.cn/contacts_api.html> from <GET http://d1.weather.com.cn/robots.txt>
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/contacts_api.html> (referer: None)
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 7 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284> (referer: http://sh.weather.com.cn/)
2024-04-03 22:13:56 [scrapy.core.scraper] DEBUG: Scraped from <200 http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284>
{'nameen': 'shanghai', 'cityname': '上海', 'city': '101020100', 'temp': '14.2', 'tempf': '57.6', 'WD': '西北风', 'wde': 'NW', 'WS': '1级', 'wse': '1km/h', 'SD': '84%', 'sd': '84%', 'qy': '1015', 'njd': '8km', 'time': '22:00', 'rain': '0', 'rain24h': '0', 'aqi': '26', 'aqi_pm25': '26', 'weather': '多云', 'weathere': 'Cloudy', 'weathercode': 'd01', 'limitnumber': '', 'date': '04月03日(星期三)'}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-03 22:13:56 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2024-04-03 22:13:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1485,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 65486,
'downloader/response_count': 6,
'downloader/response_status_count/200': 4,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/302': 1,
'elapsed_time_seconds': 1.636783,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 4, 3, 14, 13, 56, 279235, tzinfo=datetime.timezone.utc),
'httpcompression/response_bytes': 11343,
'httpcompression/response_count': 3,
'item_scraped_count': 1,
'log_count/DEBUG': 69,
'log_count/INFO': 11,
'memusage/max': 64684032,
'memusage/startup': 64684032,
'request_depth_max': 1,
'response_received_count': 4,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/200': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2024, 4, 3, 14, 13, 54, 642452, tzinfo=datetime.timezone.utc)}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Spider closed (finished)
guhui@guhuideMacBook-Pro temp_api %
前言
网络爬虫名声在外但是却从来真正用过好像有点落伍,正好借调研的过程正好玩一下看看。
先立一个目标,获取抓取天气网站当前气温的温度。
试试看抓取当前的温度数据。
技术选型
无脑选择现在人气最高的python的scrapy框架,虽然不知道到底好不好。根据大数据,用的人多应该问题也不会特别大。
实现过程
2.分析具体网页数据
可以发现,好像用class=“sk-temp”来找比较方便这是Scrapy框架比较实用的一个工具,在不写一行代码的情况下,先进行数据抓取做一个实时调试的方式尝试抓取数据。
先试用一下,再来谈谈感觉
结构获取到了,好像数据没有啊
调用view(response),可以看预览
问题是数据是动态获取到,静态页面并不包含数据
似乎目前只有两条解决方案
选择直接先拿api数据试试
5.光速打脸
这种大型网站应该加了反爬虫机制
似乎直接硬来不行!
Cookie的值只是障眼法!
我在网页里无法获取是Referer没加上
6.创建爬虫项目
由于获取的api请求是在主html中的,不能肯定这个请求是固定的,为了确保相对稳定性,需要从body中动态抓取
经过一堆试错
最终使用response.css('script[src*=sk_2d]').attrib["src"]抓取
可以看出这个脚本sk_2d是关键字,相对容易识别。
8.解析第二个请求的数据
第二个请求的数据其实是js
这里官网推荐使用chompjs
注意这里response.body是二进制对象,需要解码才能使用
执行爬虫导出到data.json
看数据
和页面结果一致,完美
总结
scrapy初次使用感觉还不错。
从爬虫的数据的思路上似乎分裂成两派
这个学派的思想在于追本溯源,找出数据真正的请求来源。
优点:“冤有头债有主”,快意恩仇,对数据追查溯源比较清晰
缺点:调查成本大,特别对于反爬虫的大型网站挑战性大。
这个学派在于“模拟”浏览器的自然情况。自然情况下用户怎么拿数据的,就怎么拿数据。
优点:调查成本低,只有查找dom树就行
缺点:构建本地浏览器环境代价会高一些
那么“自然发生”型到底好不好还要用过才知道,所以到下一章来讲了
The text was updated successfully, but these errors were encountered: