Extract Information from PDFs via Free Python Library
Free Python API that enables developers to extract information from PDF documents and convert PDF into other formats and Performs automatic layout analysis.
PDFMiner is an open source very easy to use Python library for processing PDF files without any other dependencies. PDFMine.six community-maintained fork of the original PDFMiner library. The library has provided very powerful features for extracting information from PDF documents. It provides a command utility for Non-Programmers and an API interface for programmers. A powerful PDF converter is also part of the library that helps users to transform PDF files into other text formats such as HTML.
The PDFMiner is a pure Python library that can easily extract all the texts from a PDF file that are rendered programmatically. The great ability is that it also extracts the corresponding locations, font names & sizes, and writing direction (horizontal or vertical) for each text segment. It supports PDF-1.7 specification and provides support for password-protected PDF document extraction. The library has included several other important features, such as parsing, analyzing, and converting PDF documents, extracting content as HTML or hOCR, support for vertical writing scripts, RC4 and AES encryption support, extracting table of contents, tagged contents extraction, automatic layout analysis and so on.
Getting Started with PDFMiner
PDFMiner requires Python 3.6 and higher. You can install PDFMiner using pip. Please use the following command to install it.
Install PDFMiner via pip
pip install pdfminer
You can also download the compiled shared library from the GitHub repository and install it.
Extract Text from PDF File via Python
The open source Pdfminer.six library gives software developers the ability to extract text from a PDF file with just a couple of lines of Python code. The library focuses on getting and analyzing text data and after that extracts the text from a page directly from the source code of the PDF. The library also allows developers to extract images (JPG, JBIG2, Bitmaps) from a PDF file. It is also possible to extract the Fontname or size of each individual character. The following examples show how to extract the text from a PDF file and print it on the screen.
Open & Manipulate PDF Documents via Python
from pdfminer.high_level import extract_text
# Extract text from a pdf.
text = extract_text('example.pdf')
# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')
print(text)
Convert PDF File to hOCR via Python API
hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The free Pdfminer.six libraries allow software developers to convert PDF files to hOCR format with just a couple of lines of Python code. The library is very easy to handle and can extract the explicit text information from those PDFs that do have it and uses it to generate a basic hOCR representation.
Convert PDF File to Text via Python
The library includes a rich feature set and capabilities that allow you to extend beyond the basic PDF processing. The open source Pdfminer.six library let’s Python developers convert PDF documents to text with just a couple of simple commands. First you need to provide the path to PDF files as well as the Text file. If the document is password protected, you also need to provide its password. The following code example can used to achieve the goal, it will simply returns the string in a PDF, given its filename, you can easily save it to the a file.
Convert PDF File to Text Format via Python API
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text