· 4 min read
Scraping PDFs With Computer Vision
I am in the process of building a dashboard to explore the relationship between fuel prices and utility rates in Vanuatu among other things. The data is coming from publicly released reports so far so the first step to build the dashboard is to turn the reports into something machine readable.
The Problem
One data point I want to collect is the amount of energy produced via different sources whether it be diesel generators or some form of a renewable energy source. Each month a new report is released and I want to grab the figure from the top of each report but the figures are not positioned or labeled consistently.
Let’s ignore I could just grab the information needed to contruct the pie chart myself from the table below the figure for the sake of demonstrating this data scraping technique.
The Plan
The figure we want is not in the same absolute position each month nor is it the same size or shape. So immediately we can rule out a simple image crop strategy. If we pretend reconstructing the figure based on the text contained in the PDF or if the PDF were a copy of a scanned document that wouldn’t actually contain any text that could be extracted then that leads us to take a computer vision approach to locate the figure.
So how do we find the figure with computer vision?
What we do know is the figure we want is always near the top of the first page, so we can easily limit our search to the top half or so of the first page of each report. This will reduce false-positive results and speed up the time to locate the figure as we are searching less area. The figure also has an uniquely large box shaped border which can be the primary feature our computer vision script will search for. And as a back-up we have the consistent and unique color palette used in the figure to help us find the figure but we don’t actually use this detail for our usecase this time.
The Script
First, we will use a Python package called pdf2image to open each PDF file and translate each page to an image which we load into OpenCV.
Secondly, we crop our image down to our region of interest, the top half of the first page of each report, as this is where we are certain the figure can be found.
Then, we use the Canny edge detection algorithm to find edges of features of our region of interest.
With all of the edges detected we then find all of the contours therein and we filter down the list of contours to those with approximately four sides since we are searching for the figure’s rectangular border.
Luckily for our use case, we have only one very large four sided contour and it happens to contain the figure we want to extract. So the final step is to select the contour with the greatest internal area and extract that section from our original region of interest resulting in an image of just the figure.
The Result
In just seconds all 3 years of reports are extracted.
We could then follow up with OCR to read the values on the figure. Or if no percentage was written on the figure but if a key were provided we could perform a similar contour analysis by each colored slice of the pie chart to determine its total area of the circle and approximate the percentages. But that’s completely unnecessary for my simple project.
- python
- computer-vision