Open Source Python Library for Converting PDF Files
Free Python API allows Developers to Export, Rotates, Merge and Concatenate PDF Files, Extract Data & Elements from PDFs.
pdfrw is an open source pure Python library that gives software developers to read and write PDF files without installing any external special software. pdfrw programming library is very simple to use and the source code is well documented, very simple, and easy to understand. The library has included proper Unicode support for text strings in PDFs as well as the fastest pure Python PDF parser.
pdfrw library includes support for several important PDF operations such as merging PDFs, modifying metadata, concatenating multiple PDFs together, extracting images, PDF printing, Rotating PDF pages, Creating a new PDF, Adding a watermark PDF image, and many more.
.
Getting Started with pdfrw
pdfrw requires Python 2.6, 2.7, 3.3, 3.4, 3.5, and 3.6. You can install pdfrw using pip. Please use the following command to install it.
Install pdfrw via pip
python -m pip install pdfrw
Create PDF Documents via Python Library
pdfrw library provides software developers the capability to create Create PDF Documents inside their own Python applications with just a couple of lines of code. The library also provides support for accessing and modifying existing PDF files. You can easily insert new pages as well as graphics components or text elements into the existing PDF. pdfrw library provides support to find the pages in PDF files you read in, and to write a set of pages back out to a new PDF file.
Create & Alter PDF Documents via Python
// PDF Documents Creation
import sys
import os
from pdfrw import PdfReader, PdfWriter
inpfn, = sys.argv[1:]
outfn = 'alter.' + os.path.basename(inpfn)
trailer = PdfReader(inpfn)
trailer.Info.Title = 'My New Title Goes Here'
PdfWriter(outfn, trailer=trailer).write()
Reading PDF Files via Python
pdfrw library gives software developers to easily access and read different parts of PDF documents inside Python applications. It gives easy access to the entire PDF document. The library supports retrieving file information, size, and more. It creates a special attribute named pages, which allows users to list all the pages of a PDF document. It lets you extract a document information object that you can use to pull out information like author, title, etc.
Access & Read PDF Files via Python
// Reading PDF Files
from pdfrw import pdfreader
def get_pdf_info(path):
pdf = pdfreader(path)
print(pdf.keys())
print(pdf.info)
print(pdf.root.keys())
print('pdf has {} pages'.format(len(pdf.pages)))
if __name__ == '__main__':
get_pdf_info('w9.pdf')
Adding or Modifying Metadata
pdfrw allows software developers to add or modify metadata of PDF files inside their own Python applications. You can alter a single metadata item in a PDF, writes the result to a new PDF as well as can make include multiple files, and concatenate them after adding some nonsensical metadata to the output PDF file.
Modify PDF Metadata via Python
// Modifying PDF Metadata
import sys
import os
from pdfrw import PdfReader, PdfWriter
inpfn, = sys.argv[1:]
outfn = 'alter.' + os.path.basename(inpfn)
trailer = PdfReader(inpfn)
trailer.Info.Title = 'My New Title Goes Here'
PdfWriter(outfn, trailer=trailer).write()
Splitting PDF Documents
pdfrw allows software developers to programmatically Split PDF Documents documents inside their applications. A user may require extracting a specific part of a PDF book or dividing it into multiple PDFs instead of storing them in one file. It is very easy with pdfrw library, you just need to provide an input PDF file path, the number of pages that you want to extract, and the output path.
Split PDF File to Multiple PDFs via Python
// Splitting PDF file into multiple pdfs
from pdfrw import pdfreader, pdfwriter
def split(path, number_of_pages, output):
pdf_obj = pdfreader(path)
total_pages = len(pdf_obj.pages)
writer = pdfwriter()
for page in range(number_of_pages):
if page <= total_pages:
writer.addpage(pdf_obj.pages[page])
writer.write(output)
if __name__ == '__main__':
split('reportlab-sample.pdf', 10, 'subset.pdf')