Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

网络爬虫入门(一)——Scrapy初见 #240

Open
soapgu opened this issue Apr 2, 2024 · 0 comments
Open

网络爬虫入门(一)——Scrapy初见 #240

soapgu opened this issue Apr 2, 2024 · 0 comments
Labels

Comments

@soapgu
Copy link
Owner

soapgu commented Apr 2, 2024

  • 前言

网络爬虫名声在外但是却从来真正用过好像有点落伍,正好借调研的过程正好玩一下看看。
先立一个目标,获取抓取天气网站当前气温的温度。
图片
试试看抓取当前的温度数据。

  • 技术选型

无脑选择现在人气最高的python的scrapy框架,虽然不知道到底好不好。根据大数据,用的人多应该问题也不会特别大。

  • 实现过程

  1. 安装
pip install Scrapy

2.分析具体网页数据

图片 可以发现,好像用class=“sk-temp”来找比较方便
  1. 使用命令行方式“调试”抓取数据

这是Scrapy框架比较实用的一个工具,在不写一行代码的情况下,先进行数据抓取做一个实时调试的方式尝试抓取数据。
先试用一下,再来谈谈感觉

scrapy shell 'http://sh.weather.com.cn/'
>>> response.css('p.sk-temp').get()
'<p class="sk-temp"><span></span><em>℃</em></p>'

结构获取到了,好像数据没有啊
调用view(response),可以看预览
图片
问题是数据是动态获取到,静态页面并不包含数据

  1. 遇到问题了
    似乎目前只有两条解决方案
  • 本地模拟浏览器执行javascript脚本后再抓取数据,目前有中间件方案scrapy-splash,似乎成本也不小,先放一放
  • 还有一条就是抓取相关网络连接,直接拿api数据

选择直接先拿api数据试试

5.光速打脸
图片
这种大型网站应该加了反爬虫机制
图片

  • 请求路由不好定位
  • 参数不好定位
  • cookie也毫无规律

似乎直接硬来不行!

  1. 我原来被骗了!
    Cookie的值只是障眼法!
    我在网页里无法获取是Referer没加上
图片

6.创建爬虫项目

scrapy startproject temp_api   
图片
  1. 找出动态请求url
    由于获取的api请求是在主html中的,不能肯定这个请求是固定的,为了确保相对稳定性,需要从body中动态抓取
图片

经过一堆试错

>>> response.css('script[src*=sk_2d]').get()
'<script type="text/javascript" src="http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284"></script>'
>>> response.css('script[src*=sk_2d]').attrib["src"].get()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'str' object has no attribute 'get'
>>> response.css('script[src*=sk_2d]').attrib["src"]
'http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284'
>>> quit()

最终使用response.css('script[src*=sk_2d]').attrib["src"]抓取
可以看出这个脚本sk_2d是关键字,相对容易识别。

8.解析第二个请求的数据

第二个请求的数据其实是js
图片
这里官网推荐使用chompjs

  1. 完成Spider
import scrapy
import chompjs

class MySpider(scrapy.Spider):
    name = "temp_api"
    start_urls = [
        "http://sh.weather.com.cn/",
    ]

    def parse(self, response):
        next_url = response.css('script[src*=sk_2d]').attrib["src"]
        self.log(next_url)
        headers = {'Referer':'http://sh.weather.com.cn/'}
        yield scrapy.Request(next_url, callback=self.parse_script,headers=headers) 
    
    def parse_script(self,response):
        data = chompjs.parse_js_object(response.body.decode('utf-8'))
        yield data

注意这里response.body是二进制对象,需要解码才能使用

  1. 验证

执行爬虫导出到data.json

guhui@guhuideMacBook-Pro temp_api % scrapy crawl temp_api -o data.json  
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: temp_api)
2024-04-03 22:13:54 [scrapy.utils.log] INFO: Versions: lxml 5.2.0.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.0, w3lib 2.1.2, Twisted 24.3.0, Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)], pyOpenSSL 24.1.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.7, Platform macOS-14.2.1-arm64-arm-64bit
2024-04-03 22:13:54 [scrapy.addons] INFO: Enabled addons:
[]
2024-04-03 22:13:54 [asyncio] DEBUG: Using selector: KqueueSelector
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-04-03 22:13:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.unix_events._UnixSelectorEventLoop
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet Password: c00d3fa735530f6c
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-04-03 22:13:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'temp_api',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'temp_api.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['temp_api.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-04-03 22:13:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-04-03 22:13:54 [scrapy.core.engine] INFO: Spider opened
2024-04-03 22:13:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-04-03 22:13:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2024-04-03 22:13:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> from <GET http://sh.weather.com.cn/robots.txt>
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/other/weather_error_404.html?r=sh.weather.com.cn> (referer: None)
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 4 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 11 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 12 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 13 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 33 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 34 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 35 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 40 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 44 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 45 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 46 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 50 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 51 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 52 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 54 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 55 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 57 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 58 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 59 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 60 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 61 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 62 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 71 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 72 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 73 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 75 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 76 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 78 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 79 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 80 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 81 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 82 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 99 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 106 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 110 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 112 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 114 without any user agent to enforce it on.
2024-04-03 22:13:55 [protego] DEBUG: Rule at line 119 without any user agent to enforce it on.
2024-04-03 22:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://sh.weather.com.cn/> (referer: None)
2024-04-03 22:13:55 [temp_api] DEBUG: http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284
2024-04-03 22:13:55 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://www.weather.com.cn/contacts_api.html> from <GET http://d1.weather.com.cn/robots.txt>
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/contacts_api.html> (referer: None)
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 7 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 8 without any user agent to enforce it on.
2024-04-03 22:13:56 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
2024-04-03 22:13:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284> (referer: http://sh.weather.com.cn/)
2024-04-03 22:13:56 [scrapy.core.scraper] DEBUG: Scraped from <200 http://d1.weather.com.cn/sk_2d/101020100.html?_=1564045644284>
{'nameen': 'shanghai', 'cityname': '上海', 'city': '101020100', 'temp': '14.2', 'tempf': '57.6', 'WD': '西北风', 'wde': 'NW', 'WS': '1级', 'wse': '1km/h', 'SD': '84%', 'sd': '84%', 'qy': '1015', 'njd': '8km', 'time': '22:00', 'rain': '0', 'rain24h': '0', 'aqi': '26', 'aqi_pm25': '26', 'weather': '多云', 'weathere': 'Cloudy', 'weathercode': 'd01', 'limitnumber': '', 'date': '04月03日(星期三)'}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Closing spider (finished)
2024-04-03 22:13:56 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: data.json
2024-04-03 22:13:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1485,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'downloader/response_bytes': 65486,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 4,
 'downloader/response_status_count/301': 1,
 'downloader/response_status_count/302': 1,
 'elapsed_time_seconds': 1.636783,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 4, 3, 14, 13, 56, 279235, tzinfo=datetime.timezone.utc),
 'httpcompression/response_bytes': 11343,
 'httpcompression/response_count': 3,
 'item_scraped_count': 1,
 'log_count/DEBUG': 69,
 'log_count/INFO': 11,
 'memusage/max': 64684032,
 'memusage/startup': 64684032,
 'request_depth_max': 1,
 'response_received_count': 4,
 'robotstxt/request_count': 2,
 'robotstxt/response_count': 2,
 'robotstxt/response_status_count/200': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2024, 4, 3, 14, 13, 54, 642452, tzinfo=datetime.timezone.utc)}
2024-04-03 22:13:56 [scrapy.core.engine] INFO: Spider closed (finished)
guhui@guhuideMacBook-Pro temp_api % 

看数据

[
{"nameen": "shanghai", "cityname": "上海", "city": "101020100", "temp": "14.2", "tempf": "57.6", "WD": "西北风", "wde": "NW", "WS": "1级", "wse": "1km/h", "SD": "84%", "sd": "84%", "qy": "1015", "njd": "8km", "time": "22:00", "rain": "0", "rain24h": "0", "aqi": "26", "aqi_pm25": "26", "weather": "多云", "weathere": "Cloudy", "weathercode": "d01", "limitnumber": "", "date": "04月03日(星期三)"}
]

和页面结果一致,完美

  • 总结

scrapy初次使用感觉还不错。
从爬虫的数据的思路上似乎分裂成两派

  • 分析推理型

图片

这个学派的思想在于追本溯源,找出数据真正的请求来源。
优点:“冤有头债有主”,快意恩仇,对数据追查溯源比较清晰
缺点:调查成本大,特别对于反爬虫的大型网站挑战性大。

  • “自然发生”型

这个学派在于“模拟”浏览器的自然情况。自然情况下用户怎么拿数据的,就怎么拿数据。
优点:调查成本低,只有查找dom树就行
缺点:构建本地浏览器环境代价会高一些

那么“自然发生”型到底好不好还要用过才知道,所以到下一章来讲了

@soapgu soapgu changed the title 网络爬虫入门(一) 网络爬虫入门(一)——Scrapy初见 Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant