I’m trying out a new side project on this blog – a)writing and b)analyzing one public data set once a week.
The data set will be public and the analysis tools will mostly be public too, and I will be exploring it to hopefully uncover something interesting. I will try and take you to through as much of the analysis while keeping it fun. All future blog posts related to data sets with be under the category “Looking at Data”. I think this will be fun! Of course, I’ll be putting up all the graphs I draw, don’t expect this to affect the pace of that at all.
Now let’s go into the data set I took for this week: Analyzing earthquake data from April 3rd to April 10th. The data set is available on data.gov.
Now in the time period mentioned, there were 866 earthquakes in the world!
The biggest challenge for me with exploratory data analysis is that – You aren’t solving a problem. You’re looking for something to stand out from this mass of data. Which is immensely harder.
The way I overcome this challenge for data sets is that I form certain hypothesis/questions before I start. And then look for answers.
So when I look at this data set (which has 10 parameters of information about each of these 866 earthquakes) these are the questions that I want to answer:
1. Which places have come really close to having an earthquake and not realized it?
2. Which have been the more earthquake prone areas?
3. Are earthquakes reported in the media for the richer countries and do the poorer ones go unnoticed? Like “OMG! we are so close to dying in California, but just ignored the exact same thing which happened in Africa.” (Thanks to Amaresh!).
I will address Question 2 first : the more earthquake prone areas (so you all can move out of those places immediately). It’s best to visualize this data for you to look at it and it’s also ideal since I have the parameters (latitude, longitude, magnitude, state):
Click on the click to interact to see how I came up with this one:
Here’s what I decided- I took latitude on the y axis, longitude on the x axis to give as close a view to the world map as I could. A refresher in Geography: Latitudes range from -90 to +90 and longitude ranges from -180 to +180.
The size of the circles represent the magnitude, spot the ones larger in size and you can play with it since this is interactive :). And hovering over the circles will give you the exact location of the earthquake.
The answer to question 2: More earthquake prone areas: California, Alaska, Indonesia and Japan in that week.
Now let’s address question 1: Which places came very close to being damaged by an earthquake and didn’t realize it? For this the data needs context. Context here is that damage occurs usually when the magnitude is above 7 (One of the earthquakes that hit japan was 9.1 in the last month). So my filter here is filtering the earthquakes above 6:
There were Four. 2 in Indonesia (Indonesia was hit by a 6.7 earthquake.. just saved!), one in Japan and one in Mexico.
The way I did this is simple: filtered it in Excel. If you’re anaylzing it in SQL: run a similar query.
Now let’s get to the more interesting question 3: Are earthquake in richer countries reported more than poorer countries? For this I need to integrate this data set with another, a data set from the NYTimes. The NYTimes has a very friendly API for non programmers in their developer network.The documentation is really good as well.
Using the API, I just queried it to see what earthquakes were reported by the NY times in the time period. A snapshot of the data it returns and the queries I used-
After a quick inspection I found that the only relevant reporting was the earthquake in Japan. Indonesia and Mexico were not reported.
I’m not sure I can make the inference that poorer countries earthquakes go unnoticed , even though the data verifies the hypothesis (this sample seems like a little small)!
Ok I’m done with my analysis! This was fun. I’ll be back next week with another data set.