In my previous post, I talked about open data and how making the Nigerian postcode data open and more accessible has a wide potential for powering several applications. I’ve received several comments on Facebook with even more examples on how that data could be useful.
In this post, I would share how this extraction was done and how similar extraction scripts or scrapers could be written.
The first step in every scraping project I begin is to understand the HTTP dialog for the website I want to scrape. So I attempt to answer questions like these:
- Does the application need me to login?
- Is it sensitive to certain HTTP features like cookies or referrers?
- What urls do I access to view the content I want to extract?
- What variables can I set to change the view and specify what I want?
Determining the answers to these questions can be obtained by using tools that enable you view this dialog. I personally like to use Firebug for this task.
After you’ve determined the HTTP dialog, you can then write your script to do the extraction. You can write scrapers in any language provided it has support to retrieve HTTP resources and parse HTML. The parsing aspect of a scraper is usually the most interesting part because a lot of parsing libraries choke when they encounter badly formed HTML.
The code is available on Github and although it changes as more functionality is added, you can view the revision log of the gist to see the history of changes.