Data Deconstructed – Exploring Farmers Markets

So it’s been a while since I’ve done my data posts. I was hunting for some real world data to work through ggplot2 (a graphics package in R) and I found an interesting one on Farmers Markets in the US. Figured this might make for a more fun post than an R tutorial. I need to do a ggplot2 tutorial when I get a better grasp – it’s amazing so far.

What interested me was investigating a correlation of being healthy with the number of farmer markets in the area.

About the Data set – This data set has about 7000 farmer markets in the US with the address, zip code, state, latitude and longitude where the farmer market is located.

Tools used – R

Side Note: How do you decide which tool you are going to use for data analysis? My answer is simple – whatever gets the job done. What I’m finding with using a programming language is that it’s helpful  because of the re-usability and the ease with which you can do certain things. It’s just a preference.  What causes great analysis is the ask from the data set, what questions we are asking from it. Everything is secondary if you aren’t asking the right questions to start with.

1. Let’s start with finding out which states have the most farmer markets.

Process –  I’m choosing to talk about the process here because the data set on inspection shows lots of missing values in zip code. Wherever the zip code is missing the latitude and longitude is put in as zero.  Data cleaning may not be the most fun thing in the world – which is why it’s not talked about enough. But in most data sets there is a considerable time spent on it.

Here, the actual state names show no missing values so I chose to start with that. However when I plotted it I found a “none” column showing up, the state Massachusetts in three different bars with different spellings. A small highlight as to an advantage of using R/a programming languauge  here – the code is reusable  when I’m doing a back and forth with data cleaning.

Post cleaning and plotting – here’s what the distribution of states looks like –

So the results are clear – the top 3 are 1. California 2. New York and 3. Michigan

Now if I’m trying to make a correlation between a healthy state and the number of farmer markets – is it fair to say that the state with the most number of farmer markets are the most healthy? Not really – there may be a state with a much smaller population – so comparing health of states with just numbers of farmer markets doesn’t really make sense.

Let’s normalize the data – into finding what’s the density of farmer markets in every state.

Side Note: Normalize, normalize , normalize. It’s important to remove the underlying variables which may have an effect of what we’re trying to compare.

So I take some external data of US populations by state and then divide the number of farmer markets by state. This gives me the population in each state which on an average has at least one farmers markets.

Let’s see what we find –

There –  we see normalizing the data completely flips the results. California goes somewhere down the list, Texas has the least density of farmers markets.

Can we establish a correlation between either the density and the health of a state? We’ll find out next time.




Data Deconstructed : Which Production Company won the most Oscars?

This was a fun weekend experiment- more in getting the data rather than data deconstruction.

The question I’m going to cover is straightforward (and completely random): Since the Academy Awards started in 1920’s – which production company won the most Oscars?

If you’re working with public data sets like I am you will realize that mostly data is available in formats – which isn’t ready to be used with most visual analytics / spreadsheet software /stats tools. Working with different data formats and getting them through API’s requires some basic coding skills (unfortunately for a lot of analysts).

For a while I’ve been trying to avoid the programming route – it really isn’t my natural ability.  I think I came to a point where it’s not worth not know how to write some basic code anymore. For two reasons a) The flexibility of working with different formats and more importantly b) The ability to create your own data sets – through API’s/ pulling data directly through sites – that’s exciting!

This week after unfortunately trying to find an interesting data set in a workable format – I decided to create my own. I started looking Googling random stuff around movies and found this page on Wikipedia on a list of all best movie academy award winners and came up with a straightforward question.

Data Sourcing and data cleaning:

Not a lot of people talk about data cleaning and sourcing because it’s probably the most tedious part of the analytics process. The problem I have today is actually a piece of cake if I had the data in front of me in the right format. In future, I will explore a few blog posts where we will tinker with different data formats.

Tools used: The programming language I decided to use was Python. The only reason behind this was that I had played with some basics before (I eventually realized I had mostly forgotten them) so I figured it was probably better to go with the one you had a miniscule head start in.

Note: My python and programming knowledge going into this was extremely basic. I have read Learn Python the Hard way with the sole intention of creating random patterns on Nodebox – which I did for a couple of months. So even if you’ve never touched code before, I probably didn’t have much a of headstart over you.

Step 1 – Getting the data:

Sounds easy, right? It’s actually not that bad.

When I started looking into how I’ll scrape a web page here’s one thing that may encourage you to learn some coding,if you’re thinking about getting into it – Most of the hard work has actually been done.  Amazing people have written tons of useful stuff already – I quickly realized the skills which are really useful and time saving are a)Figuring out whether someone’s code is useful by reading some documentation and b) Importing it and using it.

The library I worked with for scraping this page is Beautiful Soup, it can parse html and xml. It’s amazing – a few lines of code and you can probably get to any data element on a web page! Here’s the bridge you have to cross – figuring out how to use Beautiful Soup so you can make it work for you. That’s it.

Notes on the screen scraping process:

–  You first need to understand basic html structure of the page and the data you want. In my case it’s a table. An easy way to understand page structure is to use Firebug, click on Inspect Element and you can see the html associated with each element.

– What you’re looking for is how that element or similar elements can be uniquely identified – it may be a tag or an attribute. Since in my case it’s a table what I did was first grouped all the elements with the <td> tag.

–  Find patterns: Since I was looking for data in the Production Company column only, I needed to find a pattern to see how I get to it. What I did was simple – since the production company is the 2nd, 5th, 8th, 11th …. element of the table -a simple arithmetic progression. The only caveat here is that there are more tables after the actual data ends- so you need to put a condition on where the loop should end at.

– Scraping Wikipedia pages will also require an additional condition of passing these headers (otherwise it will give you a 403 Error)

Here’s my first few lines of the script which address this:

import urllib2
from BeautifulSoup import BeautifulSoup
req = urllib2.Request(“”, headers = {‘User-Agent’ : “Magic Browser”})
con = urllib2.urlopen(req)

I scraped the data element and wrote it in a .txt file. The magic of Python and coding: to get these 484 elements it takes a fraction of a second to create this file. That’s so much time saved once you’re familiar with it.

Also there is nothing more satisfying than finally getting a script to finally work!

Data Cleaning:

The data you get from Beautiful Soup will require some slight additional cleaning. You’ll see some <b> tags around some elements. I could have probably cleaned it out with more refined code – but decided to opt for a cleaning tool instead: Google refine. You can form filters, edit cells and apply those edits to the entire data set.

The answer:

The answer to the question is straightforward once you have the data. Instead of binning it and categorizing it using a bar chart, I decided to opt for a word cloud instead.Here’s what it looks like :

Over the last 90 years – it’s really just 5 production houses which have dominated the Oscars. Not surprised by this, actually.

Anyway fun experiment – will be back next week with another data set, this time in another format!


Data Deconstructed – Physician Payments

Finally back with a data post after a long time! I moved to San Francisco to join the team – took me almost a month to find my way around this city.

The data for today’s post is super interesting – it’s only available by request from the Medical Data Informatics association. Thanks to Amaresh Tripathy for introducing this to me!

About the data set:  This data is described by the website as follows –

The  “Physician Payments Sunshine Provision” of the Health Care and Education Reconciliation Act of 2010 requires that medical device, pharmaceutical and biotech companies begin reporting gifts or payments  starting  January 1, 2012 with detailed specifications as to how payment data is to be submitted made available October 1st 2011.

Wow – so, essentially all the payments made by pharmaceutical companies and manufacturing device manufacturing companies  to physicians. As gifts or otherwise.

The name of the physician or organization to which the payment was made was not disclosed by the Medical Data Informatics to me. It’s available. Also more details of the data set and tools used at the end!


Let’s formulate a set of questions which I’m interested to find out. The analysis today is centered  around exploration.

Note: The more interesting the data set is – the more interesting exploration is. If a data set is generic it can be made more interesting by coupling it with another data set.

1. Which pharmaceutical and medical device manufacturing companies make out the most payments?

2. Which companies spend the most per “gift” or payment?

3. As a physician which state will give you a higher probability of receiving a payment from these companies?

The answers:

1. Which pharmaceutical companies make out the most payments?

To answer this question I want to reduce the size of the raw data set. The way I will do this is break it down to large categories (types of payments, types of companies, state) and then convert it into smaller data sets.

Here is the answer to the first question:

As you see this is a sorted distribution of the 22 companies. The answers to this one –

Nearly 2 billion dollars in payments were made out to physicians by these 22 companies in those 4 years. That’s incredible. The leading companies are.

1. Zimmer – $750 million

2. Pfizer – $250 million

3. Eli Lilly – $160 million

Let’s not stop here but try and understand how the payments by these three payments are reported.

Looking into Zimmer most of the larger amounts are just not disclosed. Only the smaller amounts – less than a thousand dollars have the type classified- lodging, meals , gifts. Not much insight there.

In Pfizer – the larger payments are in a category called Pfizer sponsored research. Looks like (of course, contingent on them telling the truth) that Pfizer is spending a lot of it’s resources on research – hopefully a good thing.

In Eli Lily – most of the larger amounts are in a category called Healthcare Education. Again, hopefully a good thing.

A quick note on the display here- none of the charts today are interactive. The reason being I didn’t have access to a data visualization software which supports interactivity of really large data sets – to be published publicly. If you want to go down the interactive route here the best way is to use one of the JS frameworks.

2. Which companies spend the most per “gift” or payment?

I could have alternatively worded this as “As a physician which company will you be making the most out per payment you get”!

As you can see here are the top 3:

1. Zimmer – Almost 300k on an average per payment!

2. Medtronic – Over 100k per payment on an average.

3. Stryker-  Nearly 100k per payment on an average.

More interesting stuff- These answers are interesting but we can drill down into even more interesting answers here. The first is from the caveat: these are average payments and by itself don’t mean much. Hypothetically, there can be lots of $1 payments or some really high ones. And from what I’ve seen practically – that’s usually the case. It’s always a lot more more interesting to drill into average numbers.

Exploring each category a bit more I find the highest payment ever made in this data set was by Zimmer – a sum of 18 million over six months in 2007! Wow, that’s insane. The type of this payment wasn’t even disclosed. I need to ask the folks at Medical data informatics who the organization/person was.

Going along the thread to illustrate how misleading averages can be: only 52 out of 2620 records are over 300k! This actually implies that if you were to bet on an unknown payment by Zimmer, you have less than 2% chance it will be above the average. Averages are wrong 98% of the time here!

The most interesting derivative for me from this question-  Zimmer made 19 payments in 2007 – above 11 million dollars where the type was not disclosed!

3. As a physician which state should you belong to have a higher probability of receiving a payment from a company?

Most payments are made out to California, Texas and NY!

Additional exploration- I looked into payments declared as “gifts” and the largest was by Zimmer for just $258! I’m pretty cynical about those numbers reported by those companies! Even most companies travel expenses seemed fairly accurate- no nuggets of gold there.

Data set and tools used:

The data set is fairly huge – contains 550,000 rows of information of each payment made from Jan 1 2007 -Jan 1 2011. The data fields breakdown  is as follows – The  company name  who made the payment, the year the payment was made, the period between which the payment was made, the city and state in which the physician was located and the estimated amount.

Tools used- The tools I used for this analysis were Tableau – the desktop version and Excel – pivot tables to summarize some of the initial data. Most tools and spreadsheet software will not handle either over 5Mb of data or over 100k rows.The best alternate way is to run this is R and work on the data from there. Or even reducing the scope and size in SQL and then feed it into another analysis tool.

Side Note: A huge thanks to Mark Hubbard and Julia from Medical Data Informatics for giving me access, answering my questions and patiently waiting till I settle down in this city.

Data Deconstructed – Refugee Statistics

Today’s post uses this data set – UNHCR 2011 Refugee Statistics. I found this analyzed the day before on one of my favorite data journalism blogs: the UK Guardian.

I’m going to take a different approach with a data set today: Instead of coming up with questions and then figuring out the answers, I will take you through my ideas on data visualization on a data set already analyzed by someone else. I will answer the same questions as the Guardian did in their analysis, with my take on it. This is not to critique what they have done, they have of course done an awesome job but to bring to light of how I would have  done it (hard to do now that I have read the answers!). I would first recommend reading the Guardian post – it’s a short one.

Note: There is usually no wrong or right with what you can show with a data set – just different things you can bring to light.

About the data set – This data was set released by the UN Refugee Agency and is super interesting. A line about the UNHCR – it’s broad (and amazing) mission is to protect refugees worldwide. This data set has information about a) Where all the refugees currently are b) Where they come from c) Very interesting context – host country capacities, settlements and camps, inflows in 10 spreadsheets.

The questions – Like always, let’s first try and understand what the Guardian wants to uncover.

Their post is basically answering two questions:

1. Where do the all refugees come from?

2. Where do all the refugees from Afghanistan go to?

Process: So now let’s solve this my way –

1. Where do all the refugees come from?

This is fairly straightforward. The way I would explore this data subset is juxtaposing it against a world map.  The colors in the map are proportional to the number, the darker the orange – the greater the # of refugees. Not surprising, the darkest is Afghanistan. As usual, click on the “Click to Interact” to play with it.

Let’s compare it to what the Guardian did. The Guardian did it pretty much the same way, they used a different tool – Google fusion maps. If you look carefully – The only difference here is that if you hover over their map you cannot see the data labels – the number of refugees or the name of the countries. If your geography is as bad mine in recognizing all the countries – this is not extremely useful to understand the complete data set. The labels are only visible when you zoom out and the color shadings are replaced by bubbles.

A side thought here – are data labels important? The answer  – it depends on the question you’re asking. If you’re looking for the place with the maximum refugees, all you need to do is look for the darkest colored region(s). However if your question is to understand the distribution of all refugees throughout the world – then yes, data labels are important.

Let’s goto question 2.

2. Where do all the refugees from Afghanistan goto? For this I separated a small chunk of the data. All the refugees from Afghanistan goto 9 different countries. It’s clear that the maximum number goto Pakistan! Now how can I best communicate these results. Let’s take a look:

For this I chose a treemap. The way this could have been visualized is by two ways a) Pie Chart b) Treemap. Both of these are typically used when we are representing parts of a whole. For those not familiar with a treemap – a treemap is very similar to a pie chart, but clearer and also unlike a pie chart we can add another parameter of color like we did with the world map. As usual, click on the “Click to Interact” to play with it.

Let’s compare this to what the Guardian did.They used a pie-cart. The reason I didn’t go for a pie chart here is because after Pakistan and Iran it’s really hard to distinguish between the rest of the countries! My rule of thumb is when there are more than 4 or 5 segments, I never use a pie-chart and usually opt for a treemap instead.

A side thought: is showing a pie chart wrong in this case? Not necessarily – it depends on the question/intention of the analyst. If the question is where do most of the refugees go from Afghanistan – the pie chart works just fine. It immediately tells us that most of them goto Pakistan and Iran. However if you’re looking to understand a clear distribution of the data, it’s not sufficient.

Additional ramblings on Data Visualization-

I want to talk about my thoughts on the value of data visualization and approach to how I use it when I play with data. I see a lot of data visualization being termed as just “cool” and of very little value.  Data visualization can potentially be of little value, not because of the visualization itself but of the intention and the questions we ask of the data. If my data visualizations are not answering the questions which I want to understand- then yes, it’s just cool (if it looks good!) and not valuable. If I’m looking at someone else’s visualization and fail to understand the intention then I’ll think of it as having very little value. But if used and understood correctly it can be extremely valuable.

Word of caution: I think I’ve referred to this before, but it’s very easy with data graphics to choose something that looks sophisticated but doesn’t answer your questions very well. Examples are – going for a 3d graph instead of 2d, not choosing a simple bar chart etc. The way I personally try to avoid the error of choosing coolness over functionality is by relentlessly looking for value.

I drew a chart of the data analysis process here –

If used well, data visualization can be immensely valuable in exploration and communication. Exploration – it can make it so much easier to look at numbers (like I did in some of my last posts) and Communication  – it usually makes a bigger impact to show a powerful graphic with the final results than text.

Would absolutely love to hear your thoughts. Today was fun. Will be back next week with another data set and another topic!


Data Deconstructed – US Visas (Part 2)

So this post is in continuation with the last one – analyzing last week’s data set on how US visas are distributed. You will probably make the most of this one if you first read the last one!

We left the last post at an interesting juncture – where we had just discovered that the US had given out over 7 million visas in the year 2000 and nearly six million in 2009. To solve this question and figure out why there was a drop, is a fairly basic introduction to data segmentation which I will cover in this post. A heads up: this post is data analytics 101, feel free to skip it / solve the problem directly if you’re looking for something more advanced.

Before we start solving the question let’s go over a couple of basic concepts –

The first basic rule with dealing with data – Aggregates and averages never convey anything meaningful.

If I’m looking at aggregate or average trends over time, they always answer the question of “what happened” but will never give any numerical insight or answer “why”. Insights pretty much come out of asking “why”.

Now let’s define data segmentation before we try it out in this problem (so we know what we’re doing) – Data segmentation is to drill down into an aggregate or average metric to understand the numerical cause(s). I keep saying “numerical” cause because that’s what a data point can explain by itself, the actual qualitative cause is understood by connecting the numerical dots with something more than the data set.

In our case our aggregate metric is – total number of visas. The question is to drill into this aggregate metric to figure out what cause of the drop is.

We have visas listed in 10 data sets – one data set for each year, classified by over 80 types of visas and over 100 countries. How do we drill into this most effectively? Of course, if we had the luxury of unlimited time (and infinite patience) we can compare every single country and then every single type, but the key to great data segmentation is to drill into the data in the shortest number of possible steps.

Note: In our case the drilling down by any method may not take a long time, but I’m going to write my thoughts down on data segmentation because these are the steps I’d follow for massive data sets as well.

So this brings us to the next question – How do we shorten the number of steps in data segmentation? There are three ways we do this –

1. By forming an initial hypotheses of what the cause is and then looking into it first –

This is more important than it looks like. You can usually form a really good hypotheses if you’re a domain expert. A domain expert by nature implies you’ve developed great intuition to have an idea why the visas would drop. Good intuition is a key here.

Since I’m nowhere close to a domain expert in visas, I will more than likely not get my hypotheses verified the first time. But it’s still important to have a starting point if there is anything you want to test out – I have two 1. The drop happened in tourist visas 2. The drop was more in Asia and the Middle East than in other countries. We will test both of them out in the process.

2. By drilling down into the next highest aggregate –

All this means is to come up with clusters you can drill down into as opposed to comparing the actual data points . So for instance here – instead of comparing over 100 countries, how about I first look at the trends in continents? I could possibly find some continents with a constant or increasing number of visas in the last ten years and then maybe some with a large drop – and then drill into those specific continents which have dropped. Much easier this way. Caution – this also  depends on how the data is organized, it’s much easier with the data I have because it’s already organized by continents.

I can also make similar clusters for visas – there are three kinds of work visas: I could group all of them together and so on.

So the rule here is – Look and come up with the next highest aggregates. Compare them first before going down. Instead of going straight into the countries and types of visas.

3. Use a good data visualization tool –

A good data visualization tool is a great way to quickly drill into data and play with trends. Effective data visualization is used in two parts of a data analytics problem 1. Exploration (Initial stage) 2. Communication of the results (Final stage).  Using data visualization to your advantage can make your life infinitely easier in the exploration stage.


If you look at the trend in the total # of visas you will see the biggest drop was in the year 2003 where the visas dropped to less than 5 million. It’s picked up a bit since but still less than million when compared to the year 2000. Note- You can look at the visa trends in my last post.

So what I’m going to do here is first test my first hypotheses – which is tourist visas. Since there are lesser types of visas than countries, let’s first see the effect of the types of visas in the year 2000 compared to 2009.

The process for this involved data cleaning and then I created another subset by merging the two data sets of the years 2000 and 2009. To visualize things better I introduced a third parameter called “difference”, so that we can easily identify the cause for the type of visa.

This is a simple way to visualize : you can use multiple techniques to fit this 1. Bar charts 2. Bubbles 3. Treemap for comparisons 4. Stacked charts (I personally avoid them).

Let me go with bubbles this time, just because I haven’t shown them in these posts before! Click on the “click to interact” to see how it will look. What we are looking for is the largest bubble (when the parameter “Difference” is selected). Hovering over the bubbles will give you the actual data points. In this visualization only the largest bubbles are important for us to look at, so readability doesn’t matter as much. If it did I would have probably chosen a longer bar chart. Always choose functionality over coolness :).

If you select “Difference” in the dropdown you will see that the maximum difference in visas is from a category called B1/2 – BCC. It is not a tourist visa (unsurprisingly my hypotheses is proved wrong – I’m not a domain expert!).

Reading up on B1/2 -BCC visas I find out that they are border crossing visas (also called laser visas). Here is the definition taken from the official department of immigration site – “The biometric border-crossing card (BCC/B-1/B-2 NIV) is a laminated, credit card-style document with many security features. It has a ten-year validity period. The card is commonly called a “laser visa.” Most Mexican visitors to the U.S., whether traveling to the border region or beyond, receive a laser visa.” More here.

This is the other thing I love about analyzing these public data sets – you get to learn something new in a different industry every week!

There has been a decrease of more than 1.13 million of such visas from 2000 to 2009!

I need to drill down again, this time around my job is easy (by definition of the visa) to figure out which country caused this the most. I look at continent trends first and immediately see the biggest difference in North America. One more step further down takes me directly to the only responsible country for the decrease – Mexico!

The answer to why there was such a huge drop – mostly because US granted 1.5 million visas less to Mexico in 2009 than in 2000. Wow.

One more thought – How do you know how many steps to drill into in data segmentation? The answer is till you’ve got till the lowest possible segment. So if there are n possible ways of classifying the data – that’s how far you need to go.

The other important side takeaway from this post is that individual trends may be completely different than aggregate trends. In this case the total number of visas went down by a million, but we saw a decrease of 1.13 million in just B1/2 -BCC visas. Some went up and some went down. Coming to conclusions of individual trends just based on average trends can be very dangerous!

That’s the end of this data segmentation! Hope you had fun, see you next week with another data set.