Dissecting usual media streams today of who’s right and wrong and all the reasons behind it.
Side Note: One of the days when I have a data backed graph. Got this data from Google insights for search – fascinating for me that at one point last year the intensity of searches pre thanksgiving on cooking turkey is the same as leftover turkey after it!
I have a theory on why that happens – it’s because noone searches on how to cook the right amount of turkey for thanksgiving. Hope everyone had a great thanksgiving!
So it’s been a while since I’ve done my data posts. I was hunting for some real world data to work through ggplot2 (a graphics package in R) and I found an interesting one on Farmers Markets in the US. Figured this might make for a more fun post than an R tutorial. I need to do a ggplot2 tutorial when I get a better grasp – it’s amazing so far.
What interested me was investigating a correlation of being healthy with the number of farmer markets in the area.
About the Data set – This data set has about 7000 farmer markets in the US with the address, zip code, state, latitude and longitude where the farmer market is located.
Tools used – R
Side Note: How do you decide which tool you are going to use for data analysis? My answer is simple – whatever gets the job done. What I’m finding with using a programming language is that it’s helpful because of the re-usability and the ease with which you can do certain things. It’s just a preference. What causes great analysis is the ask from the data set, what questions we are asking from it. Everything is secondary if you aren’t asking the right questions to start with.
1. Let’s start with finding out which states have the most farmer markets.
Process – I’m choosing to talk about the process here because the data set on inspection shows lots of missing values in zip code. Wherever the zip code is missing the latitude and longitude is put in as zero. Data cleaning may not be the most fun thing in the world – which is why it’s not talked about enough. But in most data sets there is a considerable time spent on it.
Here, the actual state names show no missing values so I chose to start with that. However when I plotted it I found a “none” column showing up, the state Massachusetts in three different bars with different spellings. A small highlight as to an advantage of using R/a programming languauge here – the code is reusable when I’m doing a back and forth with data cleaning.
Post cleaning and plotting – here’s what the distribution of states looks like –
So the results are clear – the top 3 are 1. California 2. New York and 3. Michigan
Now if I’m trying to make a correlation between a healthy state and the number of farmer markets – is it fair to say that the state with the most number of farmer markets are the most healthy? Not really – there may be a state with a much smaller population – so comparing health of states with just numbers of farmer markets doesn’t really make sense.
Let’s normalize the data – into finding what’s the density of farmer markets in every state.
Side Note: Normalize, normalize , normalize. It’s important to remove the underlying variables which may have an effect of what we’re trying to compare.
So I take some external data of US populations by state and then divide the number of farmer markets by state. This gives me the population in each state which on an average has at least one farmers markets.
Let’s see what we find –
There – we see normalizing the data completely flips the results. California goes somewhere down the list, Texas has the least density of farmers markets.
Can we establish a correlation between either the density and the health of a state? We’ll find out next time.