700字范文 > 【Python爬虫Scrapy框架】一 Scrapy爬虫框架的入门案例

【Python爬虫Scrapy框架】一 Scrapy爬虫框架的入门案例

时间：2023-05-06 19:54:17

文章目录

一、安装Scrapy二、Scrapy项目生成三、爬取壁纸图片链接1、修改settings文件2、写item文件3、写爬虫文件4、写pipelines文件5、执行爬虫项目四、未来可期

一、安装Scrapy

Anaconda安装

如果你的python是使用anaconda安装的，可以用这种方法。

conda install Scrapy

Windows安装

如果你的python是从官网下载的，你需要先安装以下的库： lxmlpyOpenSSLTwistedPyWin32

安装完上述库之后，就可以安装Scrapy了，命令如下：

pip install Scrapy

我是通过anaconda安装的python，Windows方法参考自崔庆才老师著的《Python3网络爬虫开发实战》

二、Scrapy项目生成

项目生成的位置是自己可以控制的，比如我是把项目放在了D盘的scrapy_test这个文件夹。

操作如下：

win+R

点击确定，打开cmd依次输入以下命令，便可以切换到自己想要的路径（需要根据自己的情况进行更改）

d: # 切换到D盘cd scrapy_test # 切换到D盘的scrapy_test文件夹

输入命令scrapy startproject 项目名，创建项目文件夹

示例如下：

scrapy startproject firstpro

切换到新创建的文件夹

cd firstpro

输入命令scrapy genspider 爬虫名爬取网址的域名，创建爬虫项目

示例如下：

scrapy genspider scenery

至此，一个scrapy项目创建完毕。

三、爬取壁纸图片链接

1、修改settings文件

打开settings.py

修改第20行的机器人协议修改第28行的下载间隙（默认是注释掉的，取消注释是3秒，太长了，改成1秒）修改第40行，增加一个请求头修改第66行，打开一个管道

详细修改内容如下：

ROBOTSTXT_OBEY = FalseDOWNLOAD_DELAY = 1DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'}ITEM_PIPELINES = {'firstpro.pipelines.FirstproPipeline': 300,}

2、写item文件

打开items.py

我准备爬取的内容为每张图片的名称和链接，于是我就创建了name和link这两个变量。

Field()方法实际上就是创建了一个字典。

# Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.htmlimport scrapyclass FirstproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()name = scrapy.Field()link = scrapy.Field()pass

3、写爬虫文件

打开scenery.py（打开自己的爬虫文件，这里以我的为例）

import scrapyfrom ..items import FirstproItemclass ScenerySpider(scrapy.Spider):name = 'scenery'allowed_domains = ['']start_urls = ['/4kfengjing/'] # 起始urlpage = 1def parse(self, response):items = FirstproItem()lists = response.css('.clearfix li')for list in lists:items['name'] = list.css('a img::attr(alt)').extract_first() # 获取图片名items['link'] = list.css('a img::attr(src)').extract_first() # 获取图片链接yield itemsif self.page < 10: # 爬取10页内容self.page += 1url = f'/4kfengjing/index_{str(self.page)}.html' # 构建urlyield scrapy.Request(url=url, callback=self.parse) # 使用callback进行回调pass

构建url

第二页链接：/4kfengjing/index_2.html

第三页链接：/4kfengjing/index_3.html

根据第二第三页的链接，可以很容易的看出来，变量只能index_处的数字，且变化是逐次加1的规律。

css选择器

scrapy的选择器对接了css选择器，因此定位元素，我选择了css选择器。::attr()是获取属性;extract_first()是提取列表的第一个元素。

4、写pipelines文件

打开pipelines.py

在pipeline，我们可以处理提取的数据。为了方便，我选择直接打印。

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapterclass FirstproPipeline:def process_item(self, item, spider):print(item)return item