700字范文 > 爬虫 - 抓取52论坛帖子列表

爬虫 - 抓取52论坛帖子列表

时间：2024-07-29 05:03:00

相关推荐

爬虫 - 抓取52论坛帖子列表

1. 前言

这两周稍微得了点空闲，又对爬虫有相当兴趣，PythonPycharm都是现成的，说干就干。

从需求出发，起初是想做个爬图的程序，下点动漫美图什么的，非常实用。网站和图片URL都抓好了，结果发现得登录才能下载。关于登录的程序目前对我还有点超纲，只好先放一放，找点简单的做做，就简单抓点文字算了。

顺带一提，request库和BeautifulSoup库似乎都只能拿到网页源码而非元素，而很多时候源码和元素长得又不一样，我实在想知道怎么提取网页的元素，就F12显示的那个。目前我只好以源码为准。

偶尔会逛52论坛，看看有没有发布什么实用的工具，于是考虑批量扒一下帖子标题和链接，形成表格。看了一下源码，结构还算清晰。

首页的几个主要板块：新鲜出炉、技术分享、人气热门、精华采撷。

图就不放了，会被搞。

2. 代码

# 编写环境：Python 3.8.5 + PyCharm .1.4 (Community Edition)# 记得修改网址！文中的“xxxxxxx”！import requests #爬虫import time#延时import re #正则表达式import openpyxl #保存至表格文件import os #打开文件from openpyxl.styles import Font,Alignment #修改表格格式# 使用时可能需要修改的部分：Cookie、输出文件路径、52网址。# 获取网页。入参=板块名/板块主链接/板块页数，出参=含有帖子标题/帖子编号/帖子链接的二维列表。def RequestWeb_52(section_name, main_url, pages):print('----------开始抓取{}板块！----------'.format(section_name))time.sleep(1)# 意味着板块页数上限为10，超出会报错post_url = [[], [], [], [], [], [], [], [], [], []]post_info = [[], [], [], [], [], [], [], [], [], []]for page in range(0, pages):time.sleep(1)web_url = main_url + '&page=' + str(page+1) + '.html'headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) ''Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.71','Cookie': 'htVC_2132_saltkey=c5l7HvVS; htVC_2132_lastvisit=1658727423; ''wzws_cid=b9aba2f4673e085e665772733d04f8f46c133784f5725df85be854e0b7fe8e61873a747a11841693c1c14ee847ff3c2db5a45228b783f31332c723dde5db273580fd7af6f83df369943e1c6bc3be579b; ''htVC_2132_lastact=1658742319%09forum.php%09; ''Hm_lvt_46d556462595ed05e05f009cdafff31a=1658453541,1658454628,1658734624,1658744492; ''Hm_lpvt_46d556462595ed05e05f009cdafff31a=1658744492'}r = requests.get(web_url, headers=headers)# r.encoding = r.apparent_encodingr.encoding = 'gbk'web_code = r.text# 抓取帖子信息search_tag = r'<a href="(.*?)" target="_blank" class="xst".*?>(.*?)</a>'post_info = re.findall(search_tag, web_code)# 获得帖子的标题、编号、链接，存放于post_info与post_urlfor item in post_info:i = post_info.index(item)item = list(reversed(list(item)))if '&' in item[0]:item[0] = item[0].replace('&', '&')post_url.append('/' + item[1][:item[1].find('.html') + 5])item[1] = item[1][item[1].find('thread-') + 7:item[1].find('.html') - 4]post_info[i] = itemprint('已抓取第{}页，该页有{}条帖子。'.format(page+1, len(post_info)))post_info_tmp = post_url_tmp = []for i in range(len(post_info)):post_info_tmp = post_info_tmp + post_info[i]post_url_tmp = post_url_tmp + post_url[i]post_info, post_url = post_info_tmp, post_url_tmp# 将信息集中到post_info中for i in range(len(post_info)):post_info[i].append(post_url[i])# 去重复# 比较列表里有无重复repeat = 0list_tmp = []for i in post_info:if not i in list_tmp:list_tmp.append(i)else:repeat += 1post_info = list_tmp# 比较编号与链接是否匹配，避免操作有误for i in range(len(post_info)):if post_info[i][1] not in post_info[i][2]:del post_info[i]repeat += 1print('去除了{}条重复，{}板块合计抓取到{}条帖子。'.format(repeat, section_name, len(post_info)))return post_info# 保存信息至表格文件。入参=工作表名称/RequestWeb返回的二维列表/输出文件位置，无出参。def OutputExcel(sheet_title, post_info, output_doc):if not os.path.exists(output_doc):wb = openpyxl.Workbook()ws = wb.create_sheet(sheet_title)del wb['Sheet']else:wb = openpyxl.load_workbook(output_doc)ws = wb.create_sheet(sheet_title)# 表头ws.cell(1, 1).value, ws.cell(1, 2).value = '标题', '链接'ws.cell(1, 1).font = ws.cell(1, 2).font = Font(name='微软雅黑', size=18, color='000000', bold=True)ws.cell(1, 1).alignment = ws.cell(1, 2).alignment = Alignment(horizontal='center', vertical='center')for i in post_info:ws.cell(post_info.index(i) + 2, 1, value=i[0])ws.cell(post_info.index(i) + 2, 2, value=i[1])for row in ws.rows:if row[0].coordinate == 'A1':continuefor cell in row:cell.font = Font(name='微软雅黑', size=12, color='000000')cell.alignment = Alignment(horizontal='left', vertical='center')wb.save(output_doc)# 调整表格列宽，提高美观度# 参考：/qq_33704787/article/details/124722917wb = openpyxl.load_workbook(output_doc)ws = wb[sheet_title]dims = {} # 设置一个字典用于保存列宽数据for row in ws.rows: # 遍历表格数据，获取自适应列宽数据for cell in row:if cell.value:# 遍历整个表格，把该列所有的单元格文本进行长度对比，找出最长的单元格# 在对比单元格文本时需要将中文字符识别为1.1个长度，英文字符识别为1个，这里只需要将文本长度直接加上中文字符数量即可# re.findall('([\u4e00-\u9fa5])', cell.value)能够识别大部分中文字符cell_len = 1.1 * len(re.findall('([\u4e00-\u9fa5])', str(cell.value))) + len(str(cell.value))dims[cell.column] = max((dims.get(cell.column, 0), cell_len))for col, value in dims.items():# 设置列宽，get_column_letter用于获取数字列号对应的字母列号，最后值+5是用来调整最终效果的ws.column_dimensions[openpyxl.utils.get_column_letter(col)].width = value + 5# 设置超链接i = 0for cell in tuple(ws.columns)[1]:if cell.coordinate == 'B1':continuecell.value = '=HYPERLINK("{}", "{}")'.format(post_info[i][2], cell.value)cell.font = Font(name='微软雅黑', size=12, underline='single', color='0000ff')cell.alignment = Alignment(horizontal='center', vertical='center')i += 1wb.save(output_doc)if __name__ == "__main__":# 文件名附加抓取时间# import time#当前时间# current_time = time.strftime('%Y%m%d_%H%M%S', time.localtime(time.time()))# output_doc = 'D:\\52_' + current_time + '.xlsx'output_doc = 'D:\\52.xlsx'# 待抓取的52板块section_52 = {'人气热门': ['/forum.php?mod=guide&view=hot', 3],'技术分享': ['/forum.php?mod=guide&view=tech', 5],'新鲜出炉': ['/forum.php?mod=guide&view=newthread', 8],'精华采撷': ['/forum.php?mod=guide&view=digest', 2],}for item in section_52.keys():post_info = RequestWeb_52(item, section_52[item][0], section_52[item][1])OutputExcel(item, post_info, output_doc)print('----------运行完成，输出文件已产生。----------')os.startfile(output_doc)

最开始是只爬“人气热门”板块来着的，后来又拓展了一下。

逐步说一说思路。

首先request拿源码，re.findall的正则抓有效信息，即帖子标题和帖子链接。

我用了一下BeautifulSoup库，它拿到的也是网页源码，通过类似bf.find_all('div', class_ = 'bookname')这种查找标签的形式抓有效信息附近好几行文本，然后仍然要用正则处理一下。那为啥不直接re.findall捏，拿到的无用信息还能少一点。

前期post_info存储的诸如[('thread-1662751-1-1.html', '硬盘xxxxxx'), ('thread-1660647-1-1.html', '深度清理xxxxxx')]这种。可以看到 [0] 放的是半截的帖子链接，加个前缀就是链接，去掉多余的就是编号。[1] 放的就是帖子标题。

处理之后post_info长这样：[['硬盘xxxxxx', '1662751', '/thread-1662751-1-1.html'], ['深度清理xxxxxx', '1660647', '/thread-1660647-1-1.html']]。

因为每个版块网页都有好几页，爬完之后会有重复的帖子，也不知道啥原因，总之要经过一步去重复。

标题、编号、链接都处理好了，把这个列表给到表格进行输出。

表格这一步就没啥了，加个表头，把信息挨个写进去，给网址设一下超链接，改一下样式格式列宽，美观实用一点。这里感谢列宽模块的作者，模块挺好用。

运行日志：

生成的表格：