the web, mobile technology and location based services as I see it
In: Open Data|Programming
25 Jul 2010In my previous post, I talked about open data and how making the Nigerian postcode data open and more accessible has a wide potential for powering several applications. I’ve received several comments on Facebook with even more examples on how that data could be useful.
In this post, I would share how this extraction was done and how similar extraction scripts or scrapers could be written.
The first step in every scraping project I begin is to understand the HTTP dialog for the website I want to scrape. So I attempt to answer questions like these:
Determining the answers to these questions can be obtained by using tools that enable you view this dialog. I personally like to use Firebug for this task.
After you’ve determined the HTTP dialog, you can then write your script to do the extraction. You can write scrapers in any language provided it has support to retrieve HTTP resources and parse HTML. The parsing aspect of a scraper is usually the most interesting part because a lot of parsing libraries choke when they encounter badly formed HTML.
In the code snippet below, I used BeautifulSoup for parsing the HTML and python’s urllib2 for the HTTP communication.
The code is available on Github and although it changes as more functionality is added, you can view the revision log of the gist to see the history of changes.
Tim Akinbo's Weblog is the personal weblog of Tim Akinbo. Here he discusses issues relating to technology. Special interests include the web, mobile technology and location based services.
2 Responses to How it was done – Nigerian postcode data extraction
Web Trends Nigeria » How it was done – Nigerian postcode data extraction
August 1st, 2010 at 8:40 pm
[...] us to publish this article on how he scraped data from the Nigeria Postal Service site. He also shares how you can too below. In my previous post, I talked about open data and how making the Nigerian [...]
Data Extraction
August 9th, 2010 at 10:46 am
This is a good technique and it think it would do goods. Thanks for your post.