700字范文 > Python爬取豆瓣Top250电影可见资料并保存为excel形式

Python爬取豆瓣Top250电影可见资料并保存为excel形式

时间：2022-04-01 08:28:48

利用requests第三方库实现网页的元素爬取，再使用openpyxl库进行信息的录入。

具体思路

1.分析网页的headers。

2.分析网页的js。

3.模拟用户代理进入网页，get请求浏览。

4.提取网页的Xpath在python中利用循环遍历将获取资料录入列表。

5.使用openpyxl库，循环遍历列表录入信息。

headers头的分析与申请浏览

在控制台的网络中点击任意绿色条，弹出下列文件，任意点击一个后即可分析请求头。

headers= {'', #用户代理}def getDb_text(url,headers):response = requests.get(url, headers=headers)response.encoding = 'utf-8'return response.text

return response.text

核心代码

利用循环与清空的逻辑依次录入每一个页面的可见资料到总列表中。

for start in range(0,226,25):start1 = str(start)url = '/top250?start='+start1+'&filter='Dbhtml = etree.HTML(getDb_text(url, headers))DbNo = []Dbname = []Dbpf = []Dbpeople = []DbNo = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[1]/em/text()') # 排名Dbname = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()') # 电影名Dbpf = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()') # 评分Dbpeople = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[4]/text()') # 评价人数for i in range(0,25):list_a = []list_a.append(DbNo[i])list_a.append(Dbname[i])list_a.append(Dbpf[i])list_a.append(Dbpeople[i])Tjy = str(i+1)Tjy1='//*[@id="content"]/div/div[1]/ol/li['+Tjy+']/div/div[2]/div[2]/p[2]/span/text()'Tjy2= Dbhtml.xpath (Tjy1) # 获取推荐语list_a=list_a+Tjy2print(list_a)list_2.append(list_a)

完整代码

import requestsfrom lxml import etreefrom openpyxl import Workbook# 录入数据def writer():ws1 = wb.create_sheet('sheet1', index=0)ws1.column_dimensions['B'].width = 25ws1.column_dimensions['D'].width = 23ws1.column_dimensions['E'].width = 100ws1['A1'] = '排名'ws1['B1'] = '名称'ws1['C1'] = '评分'ws1['D1'] = '评价人数'ws1['E1'] = '推荐语'ws1['F1'] = '推荐指数'for iu in list_2:ws1.append(iu)# 录入无评语数据def writer1():ws2 = wb.create_sheet('no_desc_movie', index=0)ws2.column_dimensions['B'].width = 25ws2.column_dimensions['D'].width = 23ws2['A1'] = '排名'ws2['B1'] = '名称'ws2['C1'] = '评分'ws2['D1'] = '评价人数'for anyone in list_2:if len(anyone)<5:ws2.append(anyone)# 头headers= {'', #用户代理}def getDb_text(url,headers):response = requests.get(url, headers=headers)response.encoding = 'utf-8'return response.text# 获取电影的信息def main():for start in range(0,226,25):start1 = str(start)url = '/top250?start='+start1+'&filter='Dbhtml = etree.HTML(getDb_text(url, headers))DbNo = []Dbname = []Dbpf = []Dbpeople = []DbNo = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[1]/em/text()') # 排名Dbname = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[1]/a/span[1]/text()') # 电影名Dbpf = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[2]/text()') # 评分Dbpeople = Dbhtml.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]/div[2]/div/span[4]/text()') # 评价人数for i in range(0,25):list_a = []list_a.append(DbNo[i])list_a.append(Dbname[i])list_a.append(Dbpf[i])list_a.append(Dbpeople[i])Tjy = str(i+1)Tjy1='//*[@id="content"]/div/div[1]/ol/li['+Tjy+']/div/div[2]/div[2]/p[2]/span/text()'Tjy2= Dbhtml.xpath (Tjy1) # 获取推荐语list_a=list_a+Tjy2print(list_a)list_2.append(list_a)if __name__ == '__main__':# 创建表excel_path = ' ' #保存excel文件的路径wb = Workbook()list_2 = []main()writer()writer1()wb.save(excel_path)

总结

本文章不做商业用途，纯属记录自身学习过程所写下的代码。

Python爬虫并不难，但是必须遵守http协议，在不破坏网络环境的前提下适当爬取所需的资源。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。