700字范文 > python读写word文档-docx和docx2txt包使用实例

python读写word文档-docx和docx2txt包使用实例

时间：2019-02-12 03:14:45

简介

doc是微软的专有的文件格式，docx是Microsoft Office之后版本使用，其基于Office Open XML标准的压缩文件格式，比 doc文件所占用空间更小。docx格式的文件本质上是一个ZIP文件，所以其实也可以把.docx文件直接改成.zip，解压后，里面的 word/document.xml包含了Word文档的大部分内容，图片文件则保存在word/media里面。

docx包

python-docx不支持.doc文件，间接解决方法是在代码里面先把.doc转为.docx。

python-docx模块会把word文档中的段落、文本、字体等看作对象，处理对象。

Document对象：一个word文档Paragraph对象：word文档中的一个段落Paragraph对象的text属性：段落的文本内容

安装

pip install python-docx

使用实例1：读

from docx import Documentdef readDocx(fileName):doc = Document(fileName)# python UnicodeEncodeError: 'gbk' codec can't encode character '\xef' in posioutFile = open("a." + fileName + ".txt", "w", encoding='utf-8')#for para in doc.paragraphs:# print(para.txt)# 每一段的编号、内容for i in range(len(doc.paragraphs)):outFile.write(str(i) + " " + doc.paragraphs[i].text + "\n")# 表格tbs = doc.tablesfor tb in tbs:# 行for row in tb.rows:# 列for cell in row.cells:outFile.write(cell.text + "\t")outFile.write("\n")# 也可以用下面方法# text = ''# for p in cell.paragraphs:#text += p.text# print(text)

写

from docx import Documentfrom docx.shared import Inchesdef createDocx():document = Document()# 添加标题并设置级别，范围0-9，默认1document.add_heading("Title", 0)p = document.add_paragraph("a plain paragraph lalalal")# 在段落后面追加文本，并设置样式# 直接追加哦p.add_run("bold").bold = Truep.add_run(" test ")p.add_run("italic.").italic = Truefor i in range(10):document.add_heading("heading, level " + str(i) , level=i)document.add_paragraph("intense quote", style="Intense Quote")# 添加list(原点)document.add_paragraph("first item in unordered list", style="List Bullet")document.add_paragraph("second item in unordered list", style="List Bullet")# 添加带计数的listdocument.add_paragraph('first item in ordered list', style='List Number')document.add_paragraph('second item in ordered list', style='List Number')# 添加图片document.add_picture('test.PNG', width=Inches(1.25))records = ((3, '101', 'Spam'),(7, '422', 'Eggs'),(4, '631', 'Spam, spam, eggs, and spam'))# 添加表格：一行三列# 表格样式参数可选：# Normal Table# Table Grid# Light Shading、 Light Shading Accent 1 至 Light Shading Accent 6# Light List、Light List Accent 1 至 Light List Accent 6# Light Grid、Light Grid Accent 1 至 Light Grid Accent 6# 太多了其它省略...table = document.add_table(rows=1, cols=3, style='Light Shading Accent 1')# 获取第一行的单元格列表hdr_cells = table.rows[0].cells# 下面三行设置上面第一行的三个单元格的文本值hdr_cells[0].text = 'Qty'hdr_cells[1].text = 'Id'hdr_cells[2].text = 'Desc'for qty, id, desc in records:# 表格添加行，并返回行所在的单元格列表row_cells = table.add_row().cellsrow_cells[0].text = str(qty)row_cells[1].text = idrow_cells[2].text = descdocument.add_page_break()# 保存.docx文档document.save('demo.docx')

docx2txt包

用它是因为python-docx读不到超链接的文字内容。而docx2txt一定能读到所有字符。

def read_docx(fileName):text = docx2txt.process(fileName)outFile = open("b." + fileName + ".txt", "w", encoding='utf-8')outFile.write(text)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。