Data Deconstructed : Which Production Company won the most Oscars?

This was a fun weekend experiment- more in getting the data rather than data deconstruction.

The question I’m going to cover is straightforward (and completely random): Since the Academy Awards started in 1920’s – which production company won the most Oscars?

If you’re working with public data sets like I am you will realize that mostly data is available in formats – which isn’t ready to be used with most visual analytics / spreadsheet software /stats tools. Working with different data formats and getting them through API’s requires some basic coding skills (unfortunately for a lot of analysts).

For a while I’ve been trying to avoid the programming route – it really isn’t my natural ability.  I think I came to a point where it’s not worth not know how to write some basic code anymore. For two reasons a) The flexibility of working with different formats and more importantly b) The ability to create your own data sets – through API’s/ pulling data directly through sites – that’s exciting!

This week after unfortunately trying to find an interesting data set in a workable format – I decided to create my own. I started looking Googling random stuff around movies and found this page on Wikipedia on a list of all best movie academy award winners and came up with a straightforward question.

Data Sourcing and data cleaning:

Not a lot of people talk about data cleaning and sourcing because it’s probably the most tedious part of the analytics process. The problem I have today is actually a piece of cake if I had the data in front of me in the right format. In future, I will explore a few blog posts where we will tinker with different data formats.

Tools used: The programming language I decided to use was Python. The only reason behind this was that I had played with some basics before (I eventually realized I had mostly forgotten them) so I figured it was probably better to go with the one you had a miniscule head start in.

Note: My python and programming knowledge going into this was extremely basic. I have read Learn Python the Hard way with the sole intention of creating random patterns on Nodebox – which I did for a couple of months. So even if you’ve never touched code before, I probably didn’t have much a of headstart over you.

Step 1 – Getting the data:

Sounds easy, right? It’s actually not that bad.

When I started looking into how I’ll scrape a web page here’s one thing that may encourage you to learn some coding,if you’re thinking about getting into it – Most of the hard work has actually been done.  Amazing people have written tons of useful stuff already – I quickly realized the skills which are really useful and time saving are a)Figuring out whether someone’s code is useful by reading some documentation and b) Importing it and using it.

The library I worked with for scraping this page is Beautiful Soup, it can parse html and xml. It’s amazing – a few lines of code and you can probably get to any data element on a web page! Here’s the bridge you have to cross – figuring out how to use Beautiful Soup so you can make it work for you. That’s it.

Notes on the screen scraping process:

–  You first need to understand basic html structure of the page and the data you want. In my case it’s a table. An easy way to understand page structure is to use Firebug, click on Inspect Element and you can see the html associated with each element.

– What you’re looking for is how that element or similar elements can be uniquely identified – it may be a tag or an attribute. Since in my case it’s a table what I did was first grouped all the elements with the <td> tag.

–  Find patterns: Since I was looking for data in the Production Company column only, I needed to find a pattern to see how I get to it. What I did was simple – since the production company is the 2nd, 5th, 8th, 11th …. element of the table -a simple arithmetic progression. The only caveat here is that there are more tables after the actual data ends- so you need to put a condition on where the loop should end at.

– Scraping Wikipedia pages will also require an additional condition of passing these headers (otherwise it will give you a 403 Error)

Here’s my first few lines of the script which address this:

import urllib2
from BeautifulSoup import BeautifulSoup
req = urllib2.Request(“http://en.wikipedia.org/wiki/Academy_Award_for_Best_Picture”, headers = {‘User-Agent’ : “Magic Browser”})
con = urllib2.urlopen(req)

I scraped the data element and wrote it in a .txt file. The magic of Python and coding: to get these 484 elements it takes a fraction of a second to create this file. That’s so much time saved once you’re familiar with it.

Also there is nothing more satisfying than finally getting a script to finally work!

Data Cleaning:

The data you get from Beautiful Soup will require some slight additional cleaning. You’ll see some <b> tags around some elements. I could have probably cleaned it out with more refined code – but decided to opt for a cleaning tool instead: Google refine. You can form filters, edit cells and apply those edits to the entire data set.

The answer:

The answer to the question is straightforward once you have the data. Instead of binning it and categorizing it using a bar chart, I decided to opt for a word cloud instead.Here’s what it looks like :

Over the last 90 years – it’s really just 5 production houses which have dominated the Oscars. Not surprised by this, actually.

Anyway fun experiment – will be back next week with another data set, this time in another format!

 

6 thoughts on “Data Deconstructed : Which Production Company won the most Oscars?

  1. Awesome! Its actually awesome paragraph, I have got much
    clear idea on the topic of from this piece of writing.

    Here is my homepage – free mobile slots for real money,free mobile slots games,free mobile slots games download,free mobile slots instant play,free mobile slots no deposit required,free mobile slots no download,free mobile slots online,free mobile slots real cash,free mobile slots win real money,magic mobile slots free download,mobile slots,mobile slots for real money,mobile slots free,mobile slots free 5,mobile slots free bonus no deposit,mobile slots free bonus no deposit required,mobile slots free download,mobile slots free money,mobile slots free money no deposit,mobile slots free no deposit bonus,mobile slots free play,mobile slots free sign up bonus,mobile slots free sign up bonus no deposit,mobile slots free spins,mobile slots free spins no deposit,mobile slots free welcome bonus,mobile slots games,mobile slots games download,mobile slots games free,mobile slots games free download,mobile slots no deposit,mobile slots no deposit 2013,mobile slots no deposit bonus,mobile slots no deposit bonus codes,mobile slots no deposit bonus usa,mobile slots no deposit free spins,mobile slots no deposit keep winnings,mobile slots no deposit needed,mobile slots no deposit required,mobile slots no deposit sign up bonus,mobile slots no deposit welcome bonus,mobile slots online,mobile slots real money no deposit,mobile slots real money no deposit bonus,mobile slots real money usa,mobile slots win real money,new mobile slots no deposit bonus,no deposit welcome bonus mobile slots,play free mobile slots games,play free mobile slots online,play mobile slots online

  2. The Toronto Personal Injury Lawyer concentrate on injury compensation for the victim so that the injured can avail the
    social security benefits entitled to him. You should look for a
    personal injury attorney who has a reputation for achieving
    a fair settlement. This law protects people who are injured
    in a slip, trip or fall accident while in the vicinity of the
    property of the defendant.

  3. This article is on 11 spot in google’s search results, if you
    want more visitors, you should build more backlinks to
    your blog, there is one trick to get free, hidden backlinks from authority forums, search
    on youtube; how to get hidden backlinks from forums

Leave a Reply

Your email address will not be published.