Data Deconstructed : Which Production Company won the most Oscars?

This was a fun weekend experiment- more in getting the data rather than data deconstruction.

The question I’m going to cover is straightforward (and completely random): Since the Academy Awards started in 1920’s – which production company won the most Oscars?

If you’re working with public data sets like I am you will realize that mostly data is available in formats – which isn’t ready to be used with most visual analytics / spreadsheet software /stats tools. Working with different data formats and getting them through API’s requires some basic coding skills (unfortunately for a lot of analysts).

For a while I’ve been trying to avoid the programming route – it really isn’t my natural ability.  I think I came to a point where it’s not worth not know how to write some basic code anymore. For two reasons a) The flexibility of working with different formats and more importantly b) The ability to create your own data sets – through API’s/ pulling data directly through sites – that’s exciting!

This week after unfortunately trying to find an interesting data set in a workable format – I decided to create my own. I started looking Googling random stuff around movies and found this page on Wikipedia on a list of all best movie academy award winners and came up with a straightforward question.

Data Sourcing and data cleaning:

Not a lot of people talk about data cleaning and sourcing because it’s probably the most tedious part of the analytics process. The problem I have today is actually a piece of cake if I had the data in front of me in the right format. In future, I will explore a few blog posts where we will tinker with different data formats.

Tools used: The programming language I decided to use was Python. The only reason behind this was that I had played with some basics before (I eventually realized I had mostly forgotten them) so I figured it was probably better to go with the one you had a miniscule head start in.

Note: My python and programming knowledge going into this was extremely basic. I have read Learn Python the Hard way with the sole intention of creating random patterns on Nodebox – which I did for a couple of months. So even if you’ve never touched code before, I probably didn’t have much a of headstart over you.

Step 1 – Getting the data:

Sounds easy, right? It’s actually not that bad.

When I started looking into how I’ll scrape a web page here’s one thing that may encourage you to learn some coding,if you’re thinking about getting into it – Most of the hard work has actually been done.  Amazing people have written tons of useful stuff already – I quickly realized the skills which are really useful and time saving are a)Figuring out whether someone’s code is useful by reading some documentation and b) Importing it and using it.

The library I worked with for scraping this page is Beautiful Soup, it can parse html and xml. It’s amazing – a few lines of code and you can probably get to any data element on a web page! Here’s the bridge you have to cross – figuring out how to use Beautiful Soup so you can make it work for you. That’s it.

Notes on the screen scraping process:

–  You first need to understand basic html structure of the page and the data you want. In my case it’s a table. An easy way to understand page structure is to use Firebug, click on Inspect Element and you can see the html associated with each element.

– What you’re looking for is how that element or similar elements can be uniquely identified – it may be a tag or an attribute. Since in my case it’s a table what I did was first grouped all the elements with the <td> tag.

–  Find patterns: Since I was looking for data in the Production Company column only, I needed to find a pattern to see how I get to it. What I did was simple – since the production company is the 2nd, 5th, 8th, 11th …. element of the table -a simple arithmetic progression. The only caveat here is that there are more tables after the actual data ends- so you need to put a condition on where the loop should end at.

– Scraping Wikipedia pages will also require an additional condition of passing these headers (otherwise it will give you a 403 Error)

Here’s my first few lines of the script which address this:

import urllib2
from BeautifulSoup import BeautifulSoup
req = urllib2.Request(“”, headers = {‘User-Agent’ : “Magic Browser”})
con = urllib2.urlopen(req)

I scraped the data element and wrote it in a .txt file. The magic of Python and coding: to get these 484 elements it takes a fraction of a second to create this file. That’s so much time saved once you’re familiar with it.

Also there is nothing more satisfying than finally getting a script to finally work!

Data Cleaning:

The data you get from Beautiful Soup will require some slight additional cleaning. You’ll see some <b> tags around some elements. I could have probably cleaned it out with more refined code – but decided to opt for a cleaning tool instead: Google refine. You can form filters, edit cells and apply those edits to the entire data set.

The answer:

The answer to the question is straightforward once you have the data. Instead of binning it and categorizing it using a bar chart, I decided to opt for a word cloud instead.Here’s what it looks like :

Over the last 90 years – it’s really just 5 production houses which have dominated the Oscars. Not surprised by this, actually.

Anyway fun experiment – will be back next week with another data set, this time in another format!


2 thoughts on “Data Deconstructed : Which Production Company won the most Oscars?

Leave a Reply

Your email address will not be published.