**
解决:python爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接
**
问题:
1,网络问题。
确定是否是本机或爬虫目标网站出现网络问题
2,单位时间内请求页面频率过高
3,网站监测到非人为行为,断开连接
解决思路:
最有效的方法是异常捕获try except!!!
1.判断是否网络有误,如果有错误,建议换稳定的网络
2,设置下载延迟
setting.py文件中添加以下内容:
# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 3DOWNLOAD_TIMEOUT = 60
DOWNLOAD_DELAY = 3
下载延迟3秒
DOWNLOAD_TIMEOUT = 60
下载超时60秒,有些网页打开很慢,该设置表示,到60秒后若还没加载出来自动舍弃
3,设置UA:
设置UA有多种方法:
1),直接在spider中添加
class KespiderSpider(scrapy.Spider):name = 'kespider'allowed_domains = ['']headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",}cookeis = COOKIES[0]industry_list = ["E_COMMERCE", "SOCIAL_NETWORK", ]phase_list = ["ANGEL", ]def start_requests(self):for industry in self.industry_list:for phase in self.phase_list:for i in range(15, 16):try:url = "/n/api/column/0/company?phase={}&industry={}&sortField=HOT_SCORE&p={}".format(phase, industry, str(i))print(url)yield scrapy.Request(url=url, headers=self.headers, cookies=self.cookeis, callback=self.parse,dont_filter=True)except Exception as e:print("出现错误:", e)
2)在setting.py文件中添加:
# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = '******(+)'
3)使用middlewares中间件,python源码中提供了很多中间件,以下是中间件的目录
将useragent.py复制到项目根目录下,如下图:
打开useragent.py,可以看到默认从setting.py中读取UA,
"""Set User-Agent header per spider or use a default value from settings"""from scrapy import signalsclass UserAgentMiddleware(object):"""This middleware allows spiders to override the user_agent"""def __init__(self, user_agent='Scrapy'):self.user_agent = user_agent@classmethoddef from_crawler(cls, crawler):o = cls(crawler.settings['USER_AGENT'])crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)return odef spider_opened(self, spider):self.user_agent = getattr(spider, 'user_agent', self.user_agent)def process_request(self, request, spider):if self.user_agent:request.headers.setdefault(b'User-Agent', self.user_agent)
设置随机UA,首先命令行 pip install fake-useragent 安装 fake-useragent库。
在useragent.py中修改代码,建议直接复制粘贴。
"""Set User-Agent header per spider or use a default value from settings"""from scrapy import signalsfrom fake_useragent import UserAgentclass RandomUserAgentMiddleware(object):"""This middleware allows spiders to override the user_agent"""def __init__(self, crawler):super(RandomUserAgentMiddleware, self).__init__()self.user_agent = UserAgent()self.user_agent_type = crawler.settings.get("RANDOM_UA_TYPE", "random")@classmethoddef from_crawler(cls, crawler):return crawlerdef process_request(self, request, spider):def get_user_agent():return getattr(self.user_agent, self.user_agent_type)request.headers.setdefault("User-Agent", get_user_agent())
最后不要忘了在setting.py中开启该中间件:
DOWNLOADER_MIDDLEWARES = {# 'dangdang.middlewares.DangdangDownloaderMiddleware': 543,'dangdang.useragent.RandomUserAgentMiddleware': 544,}
附加:
若使用scrapy_redis,可以使用命令行加参数运行,自动保存爬虫状态:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
注意:somespider要替换成自己的爬虫名称。
完结!!!;