700字范文 > Python使用BeautifulSoup爬取网页中主体部分的内容并导出为pdf格式

Python使用BeautifulSoup爬取网页中主体部分的内容并导出为pdf格式

时间：2019-05-08 16:07:35

1、首先，必须下载相关的模块，依次使用以下命令安装相关模块即可：

pip install requestspip install html5libpip install beautifulsoup4

其中需注意的是安装requests的话还会下载对应的依赖包，第一条命令执行完将安装下面的包：certifi-.10.15 chardet-3.0.4 requests-2.20.0 urllib3-1.24.1，第二条命令执行完将安装下面的包：html5lib-1.0.1 webencodings-0.5.1，第三条需注意如果写为：pip install beautifulsoup 的话，安装会报错，安装完了之后就可以了。

2、导出对应的模块，其中html5lib不需要导入，但是需要安装，不然后面代码运行会报错，其中只需导入两个模块（其中pdfkit的可以看下我上篇文章的介绍，这里不做介绍了，/u012561176/article/details/83655247），代码如下：

import pdfkit, requestsfrom bs4 import BeautifulSoup

3、下面附上View底下实现的代码：

def export_pdf(request):config = pdfkit.configuration(wkhtmltopdf='D:\\SoftWare\\wkhtmltopdf\\bin\\wkhtmltopdf.exe')options = {'page-size': 'A3','margin-top': '0.75in','margin-right': '0.75in','margin-bottom': '0.75in','margin-left': '0.75in','encoding': "UTF-8",'no-outline': None,'custom-header': [('Accept-Encoding', 'gzip')],'cookie': [('cookie-name1', 'cookie-value1'),('cookie-name2', 'cookie-value2'),],'outline-depth': 10}url = 'http://127.0.0.1:8000/test/6/'response = requests.get(url)soup = BeautifulSoup(response.content, "html5lib")body = soup.find_all(class_="content")[0]html = str(body)with open("content.html", 'wb') as f:f.write(html.encode(encoding="utf-8"))pdfkit.from_url('content.html', 'test.pdf',configuration=config, options=options)file = open('test.pdf', 'rb')response = FileResponse(file)response['Content-Type'] = 'application/pdf'response['Content-Disposition'] = 'attachment;filename='test.pdf'return response

主要爬取代码为底下几行：

response = requests.get(url)soup = BeautifulSoup(response.content, "html5lib")body = soup.find_all(class_="content")[0]html = str(body)with open("content.html", 'wb') as f:f.write(html.encode(encoding="utf-8"))

其中通过requests获取url的信息，然后实例化BeautifulSoup对象，接着寻找到界面上第一个出现content的class样式，开始爬取从那个样式开始的内容，然后把这些内容写进去content.html文件中，保存起来，以供后面导出PDF使用。其中需注意写内容进去的时候必须为bytes类型，不能写入str类型，否则会报这种错误：a bytes-like object is required, not 'str'

4、以上内容仅供学习参考，谢谢！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

Python使用BeautifulSoup爬取网页中主体部分的内容 并导出为pdf格式

Python使用BeautifulSoup爬取网页中主体部分的内容并导出为pdf格式