Discuz 论坛模块全部帖子和评论爬取
Discuz 是一款由PHP编写的开源论坛
image.png
要爬取的页面地址:
创建工程
scrapy startproject discuz
C:\Users\PeiJingbo\Desktop\discuz>scrapy startproject discuz
New Scrapy project discuz, using template directory c:\program files\python37\lib\site-packages\scrapy\templates\project, created in:
C:\Users\PeiJingbo\Desktop\discuz\discuz
You can start your first spider with:
cd discuz
scrapy genspider example
C:\Users\PeiJingbo\Desktop\discuz>
cd discuz
创建爬虫
scrapy genspider discuz_spider discuz,net
C:\Users\PeiJingbo\Desktop\discuz\discuz>scrapy genspider discuz_spider discuz,net
Created spider discuz_spider using template asic in module:
discuz.spiders.discuz_spider
打开工程
image.png
应该打开创建项目命令生成的那个目录 如果选择再下层目录 就不能导模块了
修改配置
settings,py
ROBOTSTXT_OBEY = False # 不遵循ROBOTS协议
DEFAULT_REQUEST_HEADERS = { # 设置默认请求头
Accept: ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,
Accept-Language: en,
user-agent: Mozilla/