700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > scrapy爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接

scrapy爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接

时间:2022-08-25 07:19:31

相关推荐

scrapy爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接

**

解决:python爬虫 出现10054错误 远程主机强迫关闭了一个现有的连接

**

问题:

1,网络问题。

确定是否是本机或爬虫目标网站出现网络问题

2,单位时间内请求页面频率过高

3,网站监测到非人为行为,断开连接

解决思路:

最有效的方法是异常捕获try except!!!

1.判断是否网络有误,如果有错误,建议换稳定的网络

2,设置下载延迟

setting.py文件中添加以下内容:

# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 3DOWNLOAD_TIMEOUT = 60

DOWNLOAD_DELAY = 3

下载延迟3秒

DOWNLOAD_TIMEOUT = 60

下载超时60秒,有些网页打开很慢,该设置表示,到60秒后若还没加载出来自动舍弃

3,设置UA:

设置UA有多种方法:

1),直接在spider中添加

class KespiderSpider(scrapy.Spider):name = 'kespider'allowed_domains = ['']headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",}cookeis = COOKIES[0]industry_list = ["E_COMMERCE", "SOCIAL_NETWORK", ]phase_list = ["ANGEL", ]def start_requests(self):for industry in self.industry_list:for phase in self.phase_list:for i in range(15, 16):try:url = "/n/api/column/0/company?phase={}&industry={}&sortField=HOT_SCORE&p={}".format(phase, industry, str(i))print(url)yield scrapy.Request(url=url, headers=self.headers, cookies=self.cookeis, callback=self.parse,dont_filter=True)except Exception as e:print("出现错误:", e)

2)在setting.py文件中添加:

# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = '******(+)'

3)使用middlewares中间件,python源码中提供了很多中间件,以下是中间件的目录

将useragent.py复制到项目根目录下,如下图:

打开useragent.py,可以看到默认从setting.py中读取UA,

"""Set User-Agent header per spider or use a default value from settings"""from scrapy import signalsclass UserAgentMiddleware(object):"""This middleware allows spiders to override the user_agent"""def __init__(self, user_agent='Scrapy'):self.user_agent = user_agent@classmethoddef from_crawler(cls, crawler):o = cls(crawler.settings['USER_AGENT'])crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)return odef spider_opened(self, spider):self.user_agent = getattr(spider, 'user_agent', self.user_agent)def process_request(self, request, spider):if self.user_agent:request.headers.setdefault(b'User-Agent', self.user_agent)

设置随机UA,首先命令行 pip install fake-useragent 安装 fake-useragent库。

在useragent.py中修改代码,建议直接复制粘贴。

"""Set User-Agent header per spider or use a default value from settings"""from scrapy import signalsfrom fake_useragent import UserAgentclass RandomUserAgentMiddleware(object):"""This middleware allows spiders to override the user_agent"""def __init__(self, crawler):super(RandomUserAgentMiddleware, self).__init__()self.user_agent = UserAgent()self.user_agent_type = crawler.settings.get("RANDOM_UA_TYPE", "random")@classmethoddef from_crawler(cls, crawler):return crawlerdef process_request(self, request, spider):def get_user_agent():return getattr(self.user_agent, self.user_agent_type)request.headers.setdefault("User-Agent", get_user_agent())

最后不要忘了在setting.py中开启该中间件:

DOWNLOADER_MIDDLEWARES = {# 'dangdang.middlewares.DangdangDownloaderMiddleware': 543,'dangdang.useragent.RandomUserAgentMiddleware': 544,}

附加:

若使用scrapy_redis,可以使用命令行加参数运行,自动保存爬虫状态:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

注意:somespider要替换成自己的爬虫名称。

完结!!!;

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。