So this post is in continuation with the last one – analyzing last week’s data set on how US visas are distributed. You will probably make the most of this one if you first read the last one!
We left the last post at an interesting juncture – where we had just discovered that the US had given out over 7 million visas in the year 2000 and nearly six million in 2009. To solve this question and figure out why there was a drop, is a fairly basic introduction to data segmentation which I will cover in this post. A heads up: this post is data analytics 101, feel free to skip it / solve the problem directly if you’re looking for something more advanced.
Before we start solving the question let’s go over a couple of basic concepts -
The first basic rule with dealing with data – Aggregates and averages never convey anything meaningful.
If I’m looking at aggregate or average trends over time, they always answer the question of “what happened” but will never give any numerical insight or answer “why”. Insights pretty much come out of asking “why”.
Now let’s define data segmentation before we try it out in this problem (so we know what we’re doing) – Data segmentation is to drill down into an aggregate or average metric to understand the numerical cause(s). I keep saying “numerical” cause because that’s what a data point can explain by itself, the actual qualitative cause is understood by connecting the numerical dots with something more than the data set.
In our case our aggregate metric is – total number of visas. The question is to drill into this aggregate metric to figure out what cause of the drop is.
We have visas listed in 10 data sets – one data set for each year, classified by over 80 types of visas and over 100 countries. How do we drill into this most effectively? Of course, if we had the luxury of unlimited time (and infinite patience) we can compare every single country and then every single type, but the key to great data segmentation is to drill into the data in the shortest number of possible steps.
Note: In our case the drilling down by any method may not take a long time, but I’m going to write my thoughts down on data segmentation because these are the steps I’d follow for massive data sets as well.
So this brings us to the next question – How do we shorten the number of steps in data segmentation? There are three ways we do this -
1. By forming an initial hypotheses of what the cause is and then looking into it first -
This is more important than it looks like. You can usually form a really good hypotheses if you’re a domain expert. A domain expert by nature implies you’ve developed great intuition to have an idea why the visas would drop. Good intuition is a key here.
Since I’m nowhere close to a domain expert in visas, I will more than likely not get my hypotheses verified the first time. But it’s still important to have a starting point if there is anything you want to test out – I have two 1. The drop happened in tourist visas 2. The drop was more in Asia and the Middle East than in other countries. We will test both of them out in the process.
2. By drilling down into the next highest aggregate -
All this means is to come up with clusters you can drill down into as opposed to comparing the actual data points . So for instance here – instead of comparing over 100 countries, how about I first look at the trends in continents? I could possibly find some continents with a constant or increasing number of visas in the last ten years and then maybe some with a large drop – and then drill into those specific continents which have dropped. Much easier this way. Caution – this also depends on how the data is organized, it’s much easier with the data I have because it’s already organized by continents.
I can also make similar clusters for visas – there are three kinds of work visas: I could group all of them together and so on.
So the rule here is – Look and come up with the next highest aggregates. Compare them first before going down. Instead of going straight into the countries and types of visas.
3. Use a good data visualization tool -
A good data visualization tool is a great way to quickly drill into data and play with trends. Effective data visualization is used in two parts of a data analytics problem 1. Exploration (Initial stage) 2. Communication of the results (Final stage). Using data visualization to your advantage can make your life infinitely easier in the exploration stage.
If you look at the trend in the total # of visas you will see the biggest drop was in the year 2003 where the visas dropped to less than 5 million. It’s picked up a bit since but still less than million when compared to the year 2000. Note- You can look at the visa trends in my last post.
So what I’m going to do here is first test my first hypotheses – which is tourist visas. Since there are lesser types of visas than countries, let’s first see the effect of the types of visas in the year 2000 compared to 2009.
The process for this involved data cleaning and then I created another subset by merging the two data sets of the years 2000 and 2009. To visualize things better I introduced a third parameter called “difference”, so that we can easily identify the cause for the type of visa.
This is a simple way to visualize : you can use multiple techniques to fit this 1. Bar charts 2. Bubbles 3. Treemap for comparisons 4. Stacked charts (I personally avoid them).
Let me go with bubbles this time, just because I haven’t shown them in these posts before! Click on the “click to interact” to see how it will look. What we are looking for is the largest bubble (when the parameter “Difference” is selected). Hovering over the bubbles will give you the actual data points. In this visualization only the largest bubbles are important for us to look at, so readability doesn’t matter as much. If it did I would have probably chosen a longer bar chart. Always choose functionality over coolness .
If you select “Difference” in the dropdown you will see that the maximum difference in visas is from a category called B1/2 – BCC. It is not a tourist visa (unsurprisingly my hypotheses is proved wrong – I’m not a domain expert!).
Reading up on B1/2 -BCC visas I find out that they are border crossing visas (also called laser visas). Here is the definition taken from the official department of immigration site – “The biometric border-crossing card (BCC/B-1/B-2 NIV) is a laminated, credit card-style document with many security features. It has a ten-year validity period. The card is commonly called a “laser visa.” Most Mexican visitors to the U.S., whether traveling to the border region or beyond, receive a laser visa.” More here.
This is the other thing I love about analyzing these public data sets – you get to learn something new in a different industry every week!
There has been a decrease of more than 1.13 million of such visas from 2000 to 2009!
I need to drill down again, this time around my job is easy (by definition of the visa) to figure out which country caused this the most. I look at continent trends first and immediately see the biggest difference in North America. One more step further down takes me directly to the only responsible country for the decrease – Mexico!
The answer to why there was such a huge drop – mostly because US granted 1.5 million visas less to Mexico in 2009 than in 2000. Wow.
One more thought – How do you know how many steps to drill into in data segmentation? The answer is till you’ve got till the lowest possible segment. So if there are n possible ways of classifying the data – that’s how far you need to go.
The other important side takeaway from this post is that individual trends may be completely different than aggregate trends. In this case the total number of visas went down by a million, but we saw a decrease of 1.13 million in just B1/2 -BCC visas. Some went up and some went down. Coming to conclusions of individual trends just based on average trends can be very dangerous!
That’s the end of this data segmentation! Hope you had fun, see you next week with another data set.