Python擷取pdf檔案文字、表格、圖檔內容_三套pdf library(PyPdf,PdfPlumber,PyMuPDF)
本篇使用的 Python 版本為 3.10
在此會使用此sample作為pdf資料擷取測試用的共同標準
https://css4.pub/2015/usenix/example.pdf
https://css4.pub/2015/textbook/somatosensory.pdf
使用Pycharm建立我們的Pure Project專案
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | from pypdf import PdfReader from PIL import Image import os reader = PdfReader('example.pdf') print("頁數:"+ str(len(reader.pages))) #獲取第一頁pdf的純文字內容 page = reader.pages[0] print("===================Page1.Context below===================") print(page.extract_text()) for idxPage in range(len(reader.pages)): page = reader.pages[idxPage] print(f"====================Page.{idxPage + 1}Context below================") print(page.extract_text()) for imgItem in page.images: with open(imgItem.name, 'wb') as imgFile: imgFile.write(imgItem.data) |
https://css4.pub/2015/usenix/Floppy_icon.svg
https://css4.pub/2015/usenix/Computer_keyboard_Danish_layout.svg
但改成somatosensory.pdf則有成功擷取到jpg圖檔
透過PdfPlumber 擷取每一頁pdf純文字內容與表格資料
1 2 3 4 5 6 7 8 | import pdfplumber with pdfplumber.open('example.pdf') as pdf: for page in pdf.pages: #print(f"====================Page.{page.page_number}Contexts below================") #print(page.extract_text()) print(f"====================Page.{page.page_number}Tables below================") print(page.extract_table()) |
PdfPlumber 則支援pdf中表格結構資料擷取,不過這邊測試看起來沒這麼理想就是了。
用example的pdf剛好第一頁不會有table,第二頁、第三頁則有table。
用example的pdf剛好第一頁不會有table,第二頁、第三頁則有table。
這邊實測只有第三頁獲取的table比較完整
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import pdfplumber all_tables = [] with pdfplumber.open('example.pdf') as pdf: for page in pdf.pages: print(f"====================Page.{page.page_number}Contexts below================") print(page.extract_text()) print(f"====================Page.{page.page_number}words below================") print(page.extract_words()) print(f"====================Page.{page.page_number}Tables below================") #print(page.extract_table()) tables = page.extract_tables() all_tables.extend(tables) idTable = 0 print("Print tables each row") for table in all_tables: idTable += 1 print(f"=======Table.{idTable}=======") for row in table: print(row) |
PyMuPDF套件下載並測試(對於pdf中圖片提取較推薦)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | import fitz # PyMuPDF from PIL import Image import io import os def extract_text_from_pdf(pdf_path): """ 從 PDF 提取所有頁面的文字內容。 """ pdf_document = fitz.open(pdf_path) text_content = [] for page_number in range(len(pdf_document)): page = pdf_document[page_number] text = page.get_text() # 提取文字 text_content.append((page_number + 1, text)) return text_content def extract_images_from_pdf(pdf_path, output_folder="extracted_images"): """ 從 PDF 提取所有頁面的圖片並保存到指定資料夾。 """ # 建立輸出資料夾 if not os.path.exists(output_folder): os.makedirs(output_folder) pdf_document = fitz.open(pdf_path) images = [] for page_number in range(len(pdf_document)): page = pdf_document[page_number] # 提取頁面中的圖片 for img_index, img in enumerate(page.get_images(full=True)): xref = img[0] base_image = pdf_document.extract_image(xref) image_bytes = base_image["image"] image_ext = base_image["ext"] # 圖片副檔名 image = Image.open(io.BytesIO(image_bytes)) # 保存圖片 image_filename = f"{output_folder}/page_{page_number + 1}_image_{img_index + 1}.{image_ext}" image.save(image_filename) images.append(image_filename) return images pdf_path = "somatosensory.pdf" # 替換為你的 PDF 檔案路徑 text_output = extract_text_from_pdf(pdf_path) print("提取的文字內容:") for page_num, text in text_output: print(f"第 {page_num} 頁內容:\n{text}\n{'-' * 50}") print("正在提取圖片...") images = extract_images_from_pdf(pdf_path) print(f"提取的圖片已保存至:{images}") |
Ref:
https://www.illumine.tw/xkldimedn11/pdfplumber
https://pypi.org/project/pdfplumber/#extracting-tables
https://pypi.org/project/tabula-py/
https://hjwang520.pixnet.net/blog/post/403782626
https://azhar-sayyad.medium.com/a-step-by-step-guide-to-parsing-pdfs-using-the-pdfplumber-library-in-python-c12d94ae9f07
https://lancerninja.com/extract-data-pdf-python/
留言
張貼留言