The scraper will process what I will call a Session. Each session has a name for reference, a list of web pages to scrape and a function to process/format the results. The reasoning here being that some data I want to collect requires multiple pages of raw data to construct so we need a Session to handle an all-or-nothing approach to collecting that data and processing it.
Each Page has a path to the URL it represents and a function to process/scrape the page. The output from the process function here will be passed to the process function in the Session later to aggregate the data from multiple pages.
So together the session for the weekly forecast, which requires two pages of data to create, looks like the following.
Handling Sessions
The main loop of the web scraper handles many Session objects and because we all know network requests are slow we setup an async loop to process all the sessions quickly.
Each Session is handled by the process_session_mapping function. Below is a condensed version of that function.
Of note on the highlighted line is db_session.begin() which starts a database transaction and we handle the entire session in this context. This is so if any error is raised while processing the session or the pages the database transaction is cleanly rolled back and we are left with a NULL in the session’s end column to mark it as a failed session.
ID
session
start
end
1136
forecast_weekly
2023-03-26 8:35:28.235400
NULL
1137
forecast_weekly
2023-03-26 9:45:11.695478
2023-03-26 9:45:20.502400
Handling Pages & Errors
Processing a page is quite simple but it is wrapped with lots of error handling so we can accurately record our errors for future reference. First, we attempt to fetch the page and handle all the possible network errors we could encounter. Then, we attempt to scrape the HTML page and handle all the possible errors we could encounter again. Finally, we pass the scraping_result back to process_session_mapping so we can save the page’s raw data and then run the process function of the Session this page belongs to.
In the event of an error our handle_processing_page_mapping_error function will record what went wrong for later review including saving any HTML files that failed to be scraped and then the error is raised again to cause the database transaction in process_session_mapping to exit. I’ve also added logic to stack repeated errors with the count column so repeated errors don’t need multiple rows.
ID
created
updated
page
description
exception
html_hash
json_data
errors
count
2
2023-03-24 07:00:08.811612
2023-03-26 8:00:11.226044
forecast_week
TIMEOUT
NULL
NULL
NULL
NULL
3
Scraping and Validating Data
Lastly, we reach the real work being done. Each web page has has a purpose built function to extract the data from the HTML. The functions are full of hardcoded strings, fragile tree traversals and excessive uses of a function I called strip_html_text which just does .strip().replace("\n", " ").replace("\t", "").replace("\xa0", ""). If the VMGD pages were to change then this is where code would need rewritten.
Our ScrapeResult data should have a predictable format/schema but HTML scraping can give subtle errors if you aren’t careful. To solve this problem I am using Cerberus to define the structure and types of data I expect.
After this step the session handling function runs its processing function to write the results to the database.
Conclusion
As of time of writing, I have been collecting for the past 4 months without error and building up a nice little history of weather data.
It took a bit of time to layout the right abstractions but in the end I believe it was the right one for this purpose built web scraper. Next time I would like to look at the big web scraping frameworks for some other project ideas I have to see how they compare and how much better I could make my own scraper.