· 5 min read
Automating PDF Hyperlinks
Background
$BUSINESS is an engineering firm that has an extensive library of manuals and guides stored as PDFs. The PDF files reference related documents as sources of background information to the current document, troubleshooting guides, assembly instructions, etc.
Problem
Although related documents appear in the document as if you can click on them due to the familiar blue hyperlink color, it is in fact just text. The PDF files are stripped of all inter-document hyperlinks. As a result, navigating between documents is a time consuming task since the name of the referenced document needs to be manually typed or copy-pasted into the search bar of the document store.
Originally the documents had hyperlinks but they were generated by a supplier who is no longer available so access to the original documents and therefore rebuilding the PDF files from the source code if you will is not an option. New hyperlinks will need to be added to the existing PDF files.
Lucky Breaks
The solution is actually quite straightforward thanks to a few reasons.
1. The files are text-based
Unlike image-based PDF files which is what you get when you scan a physical document, text-based PDF files contain easy access to the text and other information to each and every element or character in the document. We will be able to simply read the text from the document with some readily available tools. This way we avoid needed to use computer vision and OCR which would be more error prone and require extra steps to verify or clean the computer vision results.
2. The hyperlinks have a blue color
As mentioned just before, we can use readily available tools to scan the documents and locate all of the blue characters in the document. Only former hyperlinks have this shade of blue so it is easy to select them from the other text in the document.
3. The exact document name is the hyperlink
The hyperlinks used the exact name of the referenced document and never used anything such as “The guide here” or “additional information can be found here”. The documents explicitly state the referenced document name such as “See INI-054 Assembly Guide for details” whenever a related document is mentioned so we will not need to infer or do additional analysis on documents to try and match related documents that best fit the context of the document we are working on. This as you can imagine turns the project into a simple script compared to a full-on job it would have been if someone with technical background on the engineering subjects was needed to verify each and every suspected hyperlink target.
Solution
First, we recursively walk the directories of the PDF files available collecting file names and full file paths as we go.
Then, we scan through the documents using pdfplumber to identify the former hyperlinks and collect the text of the hyperlink to match with a known file name. We also grab the position of the hyperlink characters on the page.
Lastly, we can overlay the blue, former hyperlinks on the document with a new hyperlink using the coordinates collected in the previous step using PyPDF2.
Technical details
Finding former hyperlink text isn’t completely straightforward but it was relatively easy in this case.
The first step is to filter the characters on the page down to just the characters containing the distinctive blue shade.
Now a page may contain many former hyperlinks so this collection of blue characters contains characters from all parts of the page and needs to be grouped by the originally referenced document names as they are read on the document. Each character object returned by pdfplumber
contains coordinates for the position of that character on the page and by shear luck we also got away with finding only one referenced document per row so we can filter the characters that share a unique horizontal row on the page to reconstruct a hyperlink’s group of characters.
Finally, we overlay the referenced document text with a hyperlink annotation using the position of the characters in the link.
The task sounded more complex when I heard the problem described for the first time, but upon seeing the documents and finding a few lucky breaks it made the job far too simple for such an improvement to engineers’ productivity.
- automation
- python