700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > 用pymupdf将pdf转为图片速度测试

用pymupdf将pdf转为图片速度测试

时间:2019-08-06 00:13:41

相关推荐

用pymupdf将pdf转为图片速度测试

读取pdf文件,将每页图片转为np.array格式,供paddleocr进行读取,此代码对转换速度进行了测试.

需要安装:paddleocr, pyinstrument, pymupdf,memory_profiler

收到pymupdf开发者回复,得到了更高效的方法, 使用pix.samples_mv可以直通内存(which is a memoryview to that internal area (without copying)) github链接 , 速度非常可观,相比之前的ms级加速到µs级,足足有3000倍

下面是测试结果:

images = []pixs = [page.get_pixmap(dpi=300) for page in doc]%timeit [np.frombuffer(pix.samples_mv, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]5.22 µs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)%timeit [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]15.7 ms ± 77.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)%timeit [np.array(Image.frombytes("RGB", (pix.width, pix.height), pix.samples), dtype=np.uint8) for pix in pixs]105 ms ± 4.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)%timeit [np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG")))) for pix in pixs]179 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit [cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR) for pix in pixs]182 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)%timeit [cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR) for pix in pixs]394 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

测试平台是i7-8700,pdf是随便找的110KB文件,大文件速度会相对更慢一些,get_pixmap如果设置太大生成的图片会非常大

内存消耗也相对减少了一点:

%load_ext memory_profiler%memit images = [np.frombuffer(pix.samples_mv, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]peak memory: 346.82 MiB, increment: 0.07 MiB%memit [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs]peak memory: 396.67 MiB, increment: 49.83 MiB

以下内容可以不看,下面是以前写的,测试不严谨

import ioimport cv2import fitzimport numpy as npfrom PIL import Imagefrom paddleocr import PaddleOCRfrom pyinstrument import Profilerfrom memory_profiler import profilepdf_file = "./测试文档.pdf"doc = fitz.open(pdf_file)ocr = PaddleOCR(use_angle_cls=True, use_gpu=False,lang="ch")# 测试函数时间def test(func):def _call():profiler = Profiler()profiler.start()func()profiler.stop()print(profiler.output_text(unicode=True, color=True))return _call@test# @profiledef test1():images = []for page in doc:pix = page.get_pixmap(dpi=300) # dpi=300是测试出来比较合适的大小,过大会导致图片过大image = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)image = np.array(image, dtype=np.uint8)images.append(image)# [print(ocr.ocr(image)) for image in images] #确定images可以被ocr读取@test# @profiledef test2():images = []for page in doc:pix = page.get_pixmap(dpi=300)image = np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))images.append(image)# [print(ocr.ocr(image)) for image in images]@test# @profiledef test3():images = []for page in doc:pix = page.get_pixmap(dpi=300)image = cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR)images.append(image)# [print(ocr.ocr(image)) for image in images]@test# @profiledef test4():images = []for page in doc:pix = page.get_pixmap(dpi=300)image = cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR)images.append(image)# [print(ocr.ocr(image)) for image in images]@test# @profiledef test5():images = []for page in doc:pix = page.get_pixmap(dpi=300)image = np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG"))))images.append(image)# [print(ocr.ocr(image)) for image in images]@test# @profiledef test2_Comprehensions():imaegs = []pixs = [page.get_pixmap(dpi=300) for page in doc]images = [np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3)) for pix in pixs ]# 列表推导式可以提高效率test1()test2()test3()test4()test5()test2_Comprehensions()

时间测试结果:

_._ __/__ _ _ _ _ _/_ Recorded: 11:10:09 Samples: 112/_//_/// /_\ / //_// / //_'/ //Duration: 0.391CPU time: 0.391/ _/ v4.1.1Program: I:/ocr_test/性能测试.py0.381 _call 性能测试.py:17├─ 0.374 test1 性能测试.py:27│ ├─ 0.128 get_pixmap fitz\utils.py:812│ │[7 frames hidden] fitz, <built-in>│ │ 0.125 DisplayList_get_pixmap <built-in>:0│ ├─ 0.126 __array__ PIL\Image.py:705│ │[17 frames hidden] PIL, <built-in>│ ├─ 0.056 frombytes PIL\Image.py:2788│ │[7 frames hidden] PIL, <built-in>│ ├─ 0.038 array <built-in>:0│ │[2 frames hidden] <built-in>│ ├─ 0.023 samples fitz\fitz.py:7468│ │[2 frames hidden] fitz│ └─ 0.005 [self] └─ 0.007 [self] _._ __/__ _ _ _ _ _/_ Recorded: 11:10:10 Samples: 15/_//_/// /_\ / //_// / //_'/ //Duration: 0.141CPU time: 0.156/ _/ v4.1.1Program: I:/ocr_test/性能测试.py0.141 _call 性能测试.py:17├─ 0.136 test2 性能测试.py:38│ ├─ 0.109 get_pixmap fitz\utils.py:812│ │[4 frames hidden] fitz, <built-in>│ │ 0.109 DisplayList_get_pixmap <built-in>:0│ ├─ 0.024 samples fitz\fitz.py:7468│ │[2 frames hidden] fitz│ └─ 0.003 __del__ fitz\fitz.py:7494│ [3 frames hidden] fitz, <built-in>└─ 0.004 [self] _._ __/__ _ _ _ _ _/_ Recorded: 11:10:10 Samples: 56/_//_/// /_\ / //_// / //_'/ //Duration: 0.611CPU time: 0.609/ _/ v4.1.1Program: I:/ocr_test/性能测试.py0.607 _call 性能测试.py:17└─ 0.607 test3 性能测试.py:48├─ 0.296 imdecode <built-in>:0│[2 frames hidden] <built-in>├─ 0.196 pil_tobytes fitz\fitz.py:7279│[33 frames hidden] fitz, PIL, <built-in>, ntpath, generi...└─ 0.113 get_pixmap fitz\utils.py:812[4 frames hidden] fitz, <built-in>_._ __/__ _ _ _ _ _/_ Recorded: 11:10:10 Samples: 21/_//_/// /_\ / //_// / //_'/ //Duration: 1.545CPU time: 1.531/ _/ v4.1.1

性能测试.py

1.540 _call 性能测试.py:17└─ 1.535 test4 性能测试.py:58├─ 1.120 tobytes fitz\fitz.py:7146│[4 frames hidden] fitz, <built-in>│ 1.120 Pixmap__tobytes <built-in>:0├─ 0.306 imdecode <built-in>:0│[2 frames hidden] <built-in>└─ 0.109 get_pixmap fitz\utils.py:812[4 frames hidden] fitz, <built-in>_._ __/__ _ _ _ _ _/_ Recorded: 11:10:12 Samples: 146/_//_/// /_\ / //_// / //_'/ //Duration: 0.561CPU time: 0.562/ _/ v4.1.1Program: I:/ocr_test/性能测试.py0.567 _call 性能测试.py:17└─ 0.567 test5 性能测试.py:68├─ 0.232 __array__ PIL\Image.py:705│[13 frames hidden] PIL, <built-in>├─ 0.185 pil_tobytes fitz\fitz.py:7279│[24 frames hidden] fitz, PIL, <built-in>, ntpath, generi...├─ 0.111 get_pixmap fitz\utils.py:812│[4 frames hidden] fitz, <built-in>├─ 0.032 array <built-in>:0│[2 frames hidden] <built-in>└─ 0.006 [self] _._ __/__ _ _ _ _ _/_ Recorded: 11:10:12 Samples: 16/_//_/// /_\ / //_// / //_'/ //Duration: 0.136CPU time: 0.141/ _/ v4.1.1Program: I:/ocr_test/性能测试.py0.143 _call 性能测试.py:17├─ 0.131 test2_Comprehensions 性能测试.py:78│ ├─ 0.109 <listcomp> 性能测试.py:82│ │ └─ 0.109 get_pixmap fitz\utils.py:812│ │ [4 frames hidden] fitz, <built-in>│ │ 0.109 DisplayList_get_pixmap <built-in>:0│ └─ 0.023 <listcomp> 性能测试.py:83│└─ 0.023 samples fitz\fitz.py:7468│ [2 frames hidden] fitz├─ 0.006 __del__ fitz\fitz.py:7494│[3 frames hidden] fitz, <built-in>└─ 0.006 [self]

内存测试结果:

Filename: I:\ocr_test\性能测试.pyLine # Mem usage Increment Occurrences Line Contents=============================================================30 290.2 MiB 290.2 MiB 1 @profile31 def test1():32 290.2 MiB0.0 MiB 1 images = []33 449.1 MiB0.1 MiB 6 for page in doc:34 424.0 MiB29.5 MiB 5 pix = page.get_pixmap(dpi=300) # dpi=300是测试出来比较合适的大小,过大会导致图片过大35 457.2 MiB 166.1 MiB 5 image = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)36 449.1 MiB -36.8 MiB 5 image = np.array(image, dtype=np.uint8)37 449.1 MiB0.0 MiB 5 images.append(image)Filename: I:\ocr_test\性能测试.pyLine # Mem usage Increment Occurrences Line Contents=============================================================41 299.7 MiB 299.7 MiB 1 @profile42 def test2():43 299.7 MiB0.0 MiB 1 images = []44 449.3 MiB0.0 MiB 6 for page in doc:45 424.4 MiB25.1 MiB 5 pix = page.get_pixmap(dpi=300)46 449.3 MiB 124.5 MiB 5 image = np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))47 449.3 MiB0.0 MiB 5 images.append(image)Filename: I:\ocr_test\性能测试.pyLine # Mem usage Increment Occurrences Line Contents=============================================================51 299.9 MiB 299.9 MiB 1 @profile52 def test3():53 299.9 MiB0.0 MiB 1 images = []54 716.0 MiB0.0 MiB 6 for page in doc:55 657.8 MiB 124.6 MiB 5 pix = page.get_pixmap(dpi=300)56 716.0 MiB 291.5 MiB 5 image = cv2.imdecode(np.frombuffer(bytearray(pix.pil_tobytes("JPEG")), dtype=np.uint8), cv2.IMREAD_COLOR)57 716.0 MiB0.0 MiB 5 images.append(image)Filename: I:\ocr_test\性能测试.pyLine # Mem usage Increment Occurrences Line Contents=============================================================61 300.9 MiB 300.9 MiB 1 @profile62 def test4():63 300.9 MiB0.0 MiB 1 images = []64 450.3 MiB0.0 MiB 6 for page in doc:65 425.6 MiB24.8 MiB 5 pix = page.get_pixmap(dpi=300)66 450.3 MiB 124.6 MiB 5 image = cv2.imdecode(np.frombuffer(pix.tobytes(), dtype='uint8'),cv2.IMREAD_COLOR)67 450.3 MiB0.0 MiB 5 images.append(image)Filename: I:\ocr_test\性能测试.pyLine # Mem usage Increment Occurrences Line Contents=============================================================71 300.9 MiB 300.9 MiB 1 @profile72 def test5():73 300.9 MiB0.0 MiB 1 images = []74 716.3 MiB0.0 MiB 6 for page in doc:75 657.1 MiB 124.7 MiB 5 pix = page.get_pixmap(dpi=300)76 716.3 MiB 290.7 MiB 5 image = np.array(Image.open(io.BytesIO(pix.pil_tobytes("JPEG"))))77 716.3 MiB0.0 MiB 5 images.append(image)Filename: I:\ocr_test\性能测试.pyLine # Mem usage Increment Occurrences Line Contents=============================================================81 301.2 MiB 301.2 MiB 1 @profile82 def test2_Comprehensions():83 301.2 MiB0.0 MiB 1 imaegs = []84 425.9 MiB 124.7 MiB 6 def a(pix):85 450.8 MiB 124.5 MiB 5 return np.frombuffer(buffer=pix.samples, dtype=np.uint8).reshape((pix.height, pix.width, 3))86 425.9 MiB -124.5 MiB 8 images = [a(page.get_pixmap(dpi=300)) for page in doc]

可以看出,test4方法最慢,test3方法占用内存最多,test2方法最优秀,有最快的速度和最少的内存占用,如果使用列表推导式理论上还能加速和减少内存使用,速度提升有限

测试过程中还发现如果使用Image.open可能会过大导致PIL.Image.DecompressionBombError

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。