How it was done – Nigerian postcode data extraction

In: Open Data|Programming

25 Jul 2010

In my previous post, I talked about open data and how making the Nigerian postcode data open and more accessible has a wide potential for powering several applications. I’ve received several comments on Facebook with even more examples on how that data could be useful.

In this post, I would share how this extraction was done and how similar extraction scripts or scrapers could be written.

The first step in every scraping project I begin is to understand the HTTP dialog for the website I want to scrape. So I attempt to answer questions like these:

  1. Does the application need me to login?
  2. Is it sensitive to certain HTTP features like cookies or referrers?
  3. What urls do I access to view the content I want to extract?
  4. What variables can I set to change the view and specify what I want?

Determining the answers to these questions can be obtained by using tools that enable you view this dialog. I personally like to use Firebug for this task.

HTTP Dialog

Click for a larger version

After you’ve determined the HTTP dialog, you can then write your script to do the extraction. You can write scrapers in any language provided it has support to retrieve HTTP resources and parse HTML. The parsing aspect of a scraper is usually the most interesting part because a lot of parsing libraries choke when they encounter badly formed HTML.

In the code snippet below, I used BeautifulSoup for parsing the HTML and python’s urllib2 for the HTTP communication.

The code is available on Github and although it changes as more functionality is added, you can view the revision log of the gist to see the history of changes.

2 Responses to How it was done – Nigerian postcode data extraction

Avatar

Web Trends Nigeria » How it was done – Nigerian postcode data extraction

August 1st, 2010 at 8:40 pm

[...] us to publish this article on how he scraped data from the Nigeria Postal Service site. He also shares how you can too below. In my previous post, I talked about open data and how making the Nigerian [...]

Avatar

Data Extraction

August 9th, 2010 at 10:46 am

This is a good technique and it think it would do goods. Thanks for your post.

Comment Form

About this blog

Tim Akinbo's Weblog is the personal weblog of Tim Akinbo. Here he discusses issues relating to technology. Special interests include the web, mobile technology and location based services.

Photostream

    Panel Session on Realizing Nigeria's Internet PotentialDelegates during Justin's Keynote PresentationBarCamp Attendees 1main hallmain hall 2main hall 3
  • Budzeg: I remember the days of mobile too... When we were building wapitis like it was going out of fashion. [...]
  • damiet: Tim i think the society has in some way showed us that technology, engineering, etc are male stuff. [...]
  • Tim: @Trae, my memory fails me. You\'re right I did start blogging in 2004 :) Hmm... so many years gone b [...]
  • trae_z: I started blogging late 2008. For one reason or the other, I lost the first two incarnations of m [...]
  • Tim: That would depend on what application is being built but I did read up uniPaaS and it's worth t [...]

Subscribe to this blog via email

Enter your email address: