Automating PDF Hyperlinks

Background

$BUSINESS is an engineering firm that has an extensive library of manuals and guides stored as PDFs. The PDF files reference related documents as sources of background information to the current document, troubleshooting guides, assembly instructions, etc.

Problem

Although related documents appear in the document as if you can click on them due to the familiar blue hyperlink color, it is in fact just text. The PDF files are stripped of all inter-document hyperlinks. As a result, navigating between documents is a time consuming task since the name of the referenced document needs to be manually typed or copy-pasted into the search bar of the document store.

Originally the documents had hyperlinks but they were generated by a supplier who is no longer available so access to the original documents and therefore rebuilding the PDF files from the source code if you will is not an option. New hyperlinks will need to be added to the existing PDF files.

Lucky Breaks

The solution is actually quite straightforward thanks to a few reasons.

1. The files are text-based

Unlike image-based PDF files which is what you get when you scan a physical document, text-based PDF files contain easy access to the text and other information to each and every element or character in the document. We will be able to simply read the text from the document with some readily available tools. This way we avoid needed to use computer vision and OCR which would be more error prone and require extra steps to verify or clean the computer vision results.

2. The hyperlinks have a blue color

As mentioned just before, we can use readily available tools to scan the documents and locate all of the blue characters in the document. Only former hyperlinks have this shade of blue so it is easy to select them from the other text in the document.

3. The exact document name is the hyperlink

The hyperlinks used the exact name of the referenced document and never used anything such as “The guide here” or “additional information can be found here”. The documents explicitly state the referenced document name such as “See INI-054 Assembly Guide for details” whenever a related document is mentioned so we will not need to infer or do additional analysis on documents to try and match related documents that best fit the context of the document we are working on. This as you can imagine turns the project into a simple script compared to a full-on job it would have been if someone with technical background on the engineering subjects was needed to verify each and every suspected hyperlink target.

Solution

First, we recursively walk the directories of the PDF files available collecting file names and full file paths as we go.

Then, we scan through the documents using pdfplumber to identify the former hyperlinks and collect the text of the hyperlink to match with a known file name. We also grab the position of the hyperlink characters on the page.

Lastly, we can overlay the blue, former hyperlinks on the document with a new hyperlink using the coordinates collected in the previous step using PyPDF2.

Technical details

Finding former hyperlink text isn’t completely straightforward but it was relatively easy in this case.

The first step is to filter the characters on the page down to just the characters containing the distinctive blue shade.

def extract_blue_characters(pdf_file: Path) -> List[Tuple[int, List[Dict[str, Any]]]]:
    """
    Find all characters in the PDF file that use the unique blue color
    that distinguishes the text as a former hyperlink.
    """
    pages = []
    with pdfplumber.open(pdf_file) as file:
        for num, page in enumerate(file.pages):
            chars = []
            for char in page.chars:
                if char["non_stroking_color"] == (0.0, 0.4, 0.8):
                    chars.append(char)
            pages.append((num, chars))
    return pages

Now a page may contain many former hyperlinks so this collection of blue characters contains characters from all parts of the page and needs to be grouped by the originally referenced document names as they are read on the document. Each character object returned by pdfplumber contains coordinates for the position of that character on the page and by shear luck we also got away with finding only one referenced document per row so we can filter the characters that share a unique horizontal row on the page to reconstruct a hyperlink’s group of characters.

@dataclass
class PDFLink:
    top: int
    text: str
    chars: List[Dict[str, Any]]
    position: Tuple[int, int, int, int]
 
def find_links(chars) -> List[PDFLink]:
    """
    Find distinct links in the list of characters.
    This works easily since the PDF file is text based
    so horizontal rows are perfectly level and we are
    assuming a single line only contains a single link.
    """
    y_positions = set(map(lambda c: c.get("top"), chars))
    links = []
    for y0 in y_positions:
        link_chars = sorted(
            filter(lambda c: c.get("top") == y0, chars), key=lambda c: c.get("x0")
        )
        text = "".join(map(lambda c: c.get("text"), link_chars))
        left = int(link_chars[0].get("x0"))
        right = int(link_chars[-1].get("x1"))
        top = int(link_chars[0].get("top"))
        bottom = int(link_chars[0].get("bottom"))
        links.append(
            PDFLink(
                top=y0,
                text=text,
                chars=link_chars,
                position=(left, top, right, bottom),
            )
        )
    return sorted(links, key=lambda l: l.top)

Finally, we overlay the referenced document text with a hyperlink annotation using the position of the characters in the link.

def write_links(page_number, pageObj, writerObj, links: List[PDFLink], url):
    writerObj.add_page(pageObj)
    pageShape = pageObj.mediaBox
    height = pageShape[3]
    for link in links:
        # Modify top and bottom box values so that its (0, 0)
        # coordinates start from bottom left corner of page
        # instead of top left corner of page
        left, top, right, bottom = link.position
        box = (left, int(height - top), right, int(height - bottom))
        # Add the hyperlink
        annotation = AnnotationBuilder.link(
            rect=box,
            url=url,
        )
        writerObj.add_annotation(page_number=page_number, annotation=annotation)

The task sounded more complex when I heard the problem described for the first time, but upon seeing the documents and finding a few lucky breaks it made the job far too simple for such an improvement to engineers’ productivity.