700字范文 > Python读取Word文档段落或者表格

Python读取Word文档段落或者表格

时间：2023-09-02 18:29:21

Python解析word文档

1 、安装并导依赖包2、word的doc格式转docx格式3、解析word_doc文档段落、表格内容4、word读取表格存列表封装优化，节省读取时间

1 、安装并导依赖包

安装依赖

python -m pip install pypiwin32pip install python-docx

导包

from win32com import client as wcfrom docx import Document

其它：不存文件，直接通过文件流读word文档

from io import BytesIOimport requestsresp = requests.get(url)doc = Document(BytesIO(resp.content))# xlrd.open_workbook(file_contents=resp.content) # excel

2、word的doc格式转docx格式

首先，所有的doc格式的必须先转为docx格式才能解析，函数方法def download_word如下

# encoding: utf-8from win32com import client as wcfrom docx import Documentdef download_word(file_url):"""下载word，并转为docx格式Document对象，表示一个word文档： doc = Document("xxx.docx")paragraphs表示word文档中的一个段落: doc.paragraphstables表示word文档中的表格: doc.tables"""# 下载word文件resp = requests.get(file_url, timeout=10, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"})file_path_name = f"E:\\{file_url.split('/')[-1]}"with open(file_path_name, "wb") as f:f.write(resp.content)if file_url.endswith("docx"):return file_name# 将doc文件转换为docx文件w = wc.Dispatch('Word.Application')doc_object = w.Documents.Open(file_path_name)doc_object.SaveAs(f"{file_path_name}x", 12)doc_object.Close()w.Quit()return f"{file_path_name}x"

3、解析word_doc文档段落、表格内容

# encoding: utf-8from win32com import client as wcimport requestsfrom docx import Documentdef download_word(file_url):"""下载word，并转为docx格式Document对象，表示一个word文档： doc = Document("xxx.docx")paragraphs表示word文档中的一个段落: doc.paragraphstables表示word文档中的表格: doc.tables"""# 下载word文件resp = requests.get(file_url, timeout=10, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"})file_path_name = f"E:\\{file_url.split('/')[-1]}"with open(file_path_name, "wb") as f:f.write(resp.content)if file_url.endswith("docx"):return file_name# 将doc文件转换为docx文件w = wc.Dispatch('Word.Application')doc_object = w.Documents.Open(file_path_name)doc_object.SaveAs(f"{file_path_name}x", 12)doc_object.Close()w.Quit()return f"{file_path_name}x"# 文本式doc文档file_name = word_doc_to_docx("http://hd.kw./vipchat/kw/k/n10344235.doc")doc = Document(file_name)for p in doc.paragraphs:print(p.text)# 表格式doc文档file_name = word_doc_to_docx("http://www./kjb/tzgg/12/27bcf546d8364243987fd54370611e3e/files/b19bd297367a4438907d5b724f1ebf62.doc")doc = Document(file_name)for table in doc.tables:for row in table.rows:line = [cell.text for cell in row.cells]print(line)

4、word读取表格存列表封装优化，节省读取时间

这一部分需替换，否则读取单元格极耗时

for row in table.rows:line = [cell.text.strip() for cell in row.cells]p_list.append(line)

换成如下形式

def parse_word_table(file_name=None, word_content=None) -> list:"""解析word文档的table内容:param file_name::param word_content::return:"""try:if word_content:doc = Document(BytesIO(word_content))else:doc = Document(file_name)p_list = []for table in doc.tables:table_cells = table._cellscols_count = table._column_countfor i in range(len(table.rows)):line = [table_cells[i*cols_count+num].text for num in range(cols_count)]p_list.append(line)# for row in table.rows:#line = [cell.text.strip() for cell in row.cells]#p_list.append(line)return p_listexcept Exception as err:logging.exception(f"Warning! {file_name} parse word table fail {err}")return []

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。