Data Deconstructed – US Visas

 

Today’s analysis is a real fun data set – the analysis of the non immigrant visas issued by the US government in the last 10 years!

A bit more on the data set: it’s data on the types of visa issued by the US government from 1997- 2009. The data is segmented by a) Type of visa and b) Country and continent. A first look tells me there are more than 80 types of visas US issues each year to over 100 countries. Wow.

Since there are more than 80 kinds of visas here, I will take a specific look at 2 of them for this blog post  – F1: student visas – issued to students wanting to study in the US, H1B: work visas – for people wanting to work in the US.

Questions: These are the two questions I was pretty curious to understand –

1. What is the breakdown of the visas in 2009? This is to give more context to the data set and help exploring it

2. What does the trend of the  ratio of F1/H1 visas look like in the last ten years? How does the trend look like in the last 10 years.

Note: If you’re practicing data analysis in this form, you will realize that asking the best questions is probably the hardest part. The second really hard part is actually cutting down the # of questions (if you’re as curious as me and as insane to find almost everything interesting!). The way I approach this is write down all the questions I want to ask, then pick the two – three most interesting and figure out the answers. For instance in this one I was interested in student visas, work visas, tourists visas (B2), the impact on the tourism industry, H4 visas – people who get married to people working here and are not eligible to work in the US (sacrifices of getting married!) and also even asylum visas! But then just picked the first two and decided to maybe continue one one of them later. The third hard and sometimes frustrating thing here is sometimes realizing that you many not find data to answer to your questions, so you always need to work within a constraint of the data you have and can find.

Answers:

1. What is the breakdown of the visas in 2009?

Here goes – I put this data on a world map for the year 2009 so you can take a look from where it comes from. Over a million visas were given out last year over 80 categories. Whenever I’m looking at data, where the country is a variable it is almost perfect to visualize on a world map.

Click on the “Click to interact” and you will be able to select a visa class and then see which parts of the world get the most visas. You can play with it for such a long time, it’s so interesting. I love how you select H1, you see the most visas go to India!

2. What does the trend of the  ratio of F1/H1 visas look like in the last ten years?

To answer this question, I had to really clean up the data set (of 14 sheets) to work on what I wanted and create a new one.This is a perfect example where sometimes the best answers to your questions are not from the existing set but from a new data set you create, using the existing one. I introduced 3 parameters here: % H1B’s of the total, % F1’s of the total and the ratio of F1/H1 -> which gives a little indication of the students who come here, how many end of wanting and getting a job in the US. Again click on the “Click to interact” to see what we find.
What you will see is that # of student visas is trending upwards gently but the % of the student visas of the total have gone up. The # of work visas has been fairly constant (within statistical limits) over the last ten years and the % of the work visas of the total have gone up slightly. The ratio of student visas to work visas has been on a rise and in 2009 was 3, implying that every 3 students who came to study here less than 1 wanted and got accepted to work here! I said less than because quite a few work visas are given out to people who do not study in the US.

But what is most interesting is that the total # of visas has decreased quite a bit! Over 7 million in 2000 and just nearly 6 million in the last couple of years.

What could have been the cause of this? Maybe it’s post 9/11 or maybe something else.. we’ll find out. I will save it for next week’s data post. I think it will be a perfect introduction into data segmentation as I go through the steps to uncover the answer!

The New Job Market Rulers – Data Scientists

There is a current new wave if you’re in the data analysis job market. If you’re a data person, the new hottest job title to own – Data Scientist.

So what on earth is a data scientist? A data scientist is someone usually with an advanced statistics (or similar) degree but can also handle data, has data extracting proficiency (equivalent coding skills) and combines them with business skills to generate insights on massive amounts of data. Different than a data engineer.

Data scientist = Data guy+ Programming skills (handling databases) + Business skills.

I’ve illustrated where the skill set of a data scientist lies-

Data Scientists are  very hard to find right now and I think it’s just going to keep getting harder.  Right now, this is what the trend is looking like –

So why are data scientists so hard to find? I think it’s got to do with one reason – Most statistics/math/related field graduates even with PhD’s are not trained out of college in programming and database skills. The curriculum is designed so they are good at data mining and statistics but it’s harder to find the self learner to take the plunge into programming to become a data scientist. This is eventually going to create a huge scarcity and we will probably see newer degrees in college education addressing the need coming up. Till then if you’re currently a good data scientist, you’ll probably have a field time!

The connectors in data are becoming more valuable than the specialists.

 

Data Explained – Why my last post is wrong

This week instead of deconstructing another data set, I want to talk about why my last data post was wrong. The goal of this post is to touch upon a few basic points to consider when either analyzing data or reading an analysis – by showing an example of how I was wrong.

If you want to make the most out of this post, it’s probably a good idea to first read the last one.

In data  it’s usually easy for anyone to make you believe something by showing evidence. We see presidential poll predictions all the time where big news channels fail repeatedly (thankfully) in making the right predictions. Usually it’s the classic case when things “seem right” but aren’t. Like my previous data blog post.

Having said that let’s go straight into three factors where people usually go wrong in data analysis –

1. Correlation and causation – There is so much already talked about this point but they are the easiest to get wrong. That’s probably because it’s very easy to draw casual causation. Which is exactly what I did in my last analysis. The three countries I was looking at (India, Ethiopia and Indonesia) had child mortality rates which were falling the period when the US loaned money, so I casually assumed that the US assistance had effect. A fatal error.

Here’s a simple rule – If we are looking at the influence of a data set, we are looking at causation.Correlation might lead to causation but is more than often not the case. To prove causation you require more mathematical techniques.

The actual solution – In my case the actual solution to the problem would be to look into all the aids the three countries got, then run a suitable form of regression, then prove or disprove the hypotheses statistically. There is no shortcut to arriving to these conclusions. That maybe another reason why the media gets it wrong all the time, it requires a certain level of understanding of statistics to prove causation and people who make these conclusions may not have the technical competence.

2. The goal of proving a hypotheses – This second point is more to do with the impatience of human intuition. Many time during an analysis we want to prove a hypothesis and keep trying to find evidence to prove it as opposed to discarding the original hypothesis altogether. Things sometimes seem logically true but may be completely inaccurate. It’s sometimes our goal to prove our logical intuition right as opposed to finding the truth which is the cause of these kind of mistakes.

In my case, I was looking for evidence to show that all the money the US had been lending worked. It may not have and I came to a hurried conclusion.

3. Interdependence – It’s very logical to say that aid affects child mortality, but maybe it’s also the other way around?

For instance, there might be a cut off point in the data where once child mortality reaches a threshold aid is cut off? Which might have been the reason for US to cut off aid as well.

That’s it. I hope this helps you looking at an analysis to understand why it may be wrong or why it may have jumped to conclusions too quickly.

I will be back next week or mid next week with a new data set.

Side Note: A huge thank you to Shreemoy Mishra for his feedback!

Data Deconstructed – Is US helping the world build a better place?

Two weeks ago I looked at the data of US Foreign Assistance and explored it. Today I want to look at it’s influence and answer the question for the category “Child Health and Survival” and answer this question – Is US helping build a better world? Have all these loans affected child mortality in these countries?

I filtered out all the “Child Health and Survival data” from the US Foreign Assistance data set.  Total spending in this category is 12.9 billion in the last 10 years. Preliminary exploration shows that the US has helped 106 countries in the period 1999-2009 with child health. I found it interesting that the US only started foreign assistance in this category from the year 1999 (I have data available from the year 1954 for all US foreign assistance). Wonder what made them decide in 1999 to start aid in 1999 in this category!

Here is the first visualization I created with the Child health and survival subset. I superimposed the spending on a world map. The size of the bubbles are proportionate to the amount of spending so you can see where the most money has been spent.

The influence – To look at how much effect this money spent had, I’m going to look at the mortality rates of the top three specific countries where the US has given the most amount of money. These are *imaginary drum roll* –

1. India(389 million) 2. Indonesia (299 million) 3. Ethiopia (273 million)

Wow. These numbers are huge! What stands out now is that in 2009 US stopped aid to both Indonesia and Ethiopia in Child Healthcare  and also drastically cut their aid to India.What’s the reason behind that? Maybe it wasn’t effective enough or maybe it’s the war.. we’ll find out soon. A graph which shows that is illustrated. The colors of the bars are chosen that you can take a single year and compare the same color with those of the other countries- essentially to make it easier for comparisons.

Now I went on to find external data to tell me what the child mortality rates in these countries have been over the last ten years. Researching a bit more into child mortality told me there can be various kinds of mortality – perinatal mortality, neonatal mortality, post neonatal mortality and under 5 mortality. I decide to specifically look into under 5 mortality which is the broadest category.

This is a good example of connecting a data set to a third party data set.  I stumbled upon the data set I wanted on ChildInfo.org. This data set has under-five mortality values from 1960 to 2009 for 197 countries across the world. I decided to filter out not just the three countries we were looking at but also the USA. I took the US into account because I wanted to give some context to the data at hand.

I digress a bit here – I couldn’t help exploring the child mortality data set a bit. Just reflected such a hue inequality. There were 30 countries above Ethiopia to have higher child mortality rates. Most of them were African – Chad, Somalia and Afghanistan had almost 20% ! In sharp contrast and unsurprisingly US, Canada and Europe were among the lowest – ranging from 0.4% -0.8%.

Here’s what I found – visualized below. This is data from 2000-2009.

These numbers are deaths in 1000 babies born. The US has a constant and very low mortality rate (0.8%) over the last decade. Ethiopia has such high rate – current almost 10% of all babies dies before reaching the age of 5. That’s insane. How can there be so much inequality?

The trending downward bars does show that it is indeed having a positive effect on all the three countries! Definitely helped create a better world. Just sad that they decided to discontinue it… maybe it was because spending increased in all other areas.

I hope US re-instates the aid, especially to Ethiopia and some other African countries where the child mortality rates are so high.

Thanks for reading will be back again next week with another data set!

Side note: My friend Graham brought up a good point. He wanted to see hat the trend was before 200 for these countries. Here it is. It isn’t like you expect –  before US aid the trend has been going down. But that is true for all the 197 countries (mostly everyone had a positive increase in mortality rates) probably attributed to the advancements of medical science and technology Two thoughts here 1. It becomes exponentially harder to improve child mortality rates. 2. Which is why – Lots of African countries have improved before 2000 but posts 2000 have stayed at the same mortality rates. Some around 18% like Somalia.

Data Deconstructed – US Foreign Assistance

Time for my weekly data set analysis. The choice this week is a breakdown of US foreign aid (loans and grants) to all the countries of the world. Of course, the data set is public public, this one is taken from data.gov.

I’ve always been curious to know who the US lends money to and explore the data further.. But I think the the more interesting question is “Does the charity and all the goodwill US does have any effect and how much effect does it have?” (Thanks to Amaresh!). For instance you may loan billions of dollars to India for child health but did it affect India’s child mortality rates? It will be really cool to explore that and integrate more data sources to this one.

This week the plan is to explore the data set and then look into the broader influence of this data and merge it with other data sources set next week. I need to give this post 2 long-ish blog posts to do justice this time.

I digress a bit: Every week I will try to make the data analysis more challenging for myself. This week’s data set is larger and also I will be using Tableau Public to do some data exploration. Tableau Public is fun way to share open data visualizations, I totally recommend trying it out.

Let’s start the analysis. As done previously, I will start with formulating questions.

Before I do, let me take a quick look at the data set. The data set at first glance tells me there are more than 100 countries US has been generous to and also this data set is from the year 1946-2009 in more than 10 categories. Wow.

The areas I want to explore – I’m going to pick two here since I’m continuing this next week as well. All will be related to exploring the current data set.

1. What are the top 3 spending areas for US aid in the last 10 years?

2. Which are the top countries US gave money to last year? In the last 10 years?

The answers (and process):

Before I start, I’m going to share a visualization which will help you look and  understand the data better. In this one I put each category of spending and showed it on a world map . This way you can just scroll down, the labels will tell you where US has been spending money. This is one of those rare data sets which are so interesting to look at (just because so many interesting things stand out). One of the things that stood out for me is that how little US has given to China over the years. China of course over the years has not been as rich as Europe or Australia (who don’t need the assistance as much as them).

1.  For the first question, I came to these results by a) Introducing another variable in the data set which sums up all the spending over the last b) Segmenting the program name with the total money given. Here is the graph which shows that.

Looking at this graph (ignore the Title1, title 2) you realize that the major spending areas have been economic assistance, defense, food (everything under USDA), helping other countries cope with country specific problems (narcotics, refugee assistance) and broader philanthropy where I include child heath, helping with AIDS etc.

The top spending area has been-

1. Economic support assistance (It will be very interesting to understand how it has effected the GDP of the country, I might explore that next week as well). The US has nearly spent slightly above 16 billion in this area

2. The next highest spending is voluntary contributions to international organizations – just above 15 billion.

3. The third area of spending has been “Development Assistance” and also “Narcotics Control” . The US has spent almost 12 billion dollars in these two areas individually.

2.Now for the second question for which country did the US loan the most to a)last year b) n the last 10 years- I segmented again by country name and two variables 1. Money loaned in the last year 2. Money loaned in the last 10 years. Take a look:

Here is looks like the top 3 beneficiaries last year were 1. Afghanistan 2. Iraq 3. Pakistan! Not surprising at all. Probably should include a new category on “War against Terrorism aid”.

This also makes me wonder at this point, has the US economic assistance to other countries gone down because it decides to engage in war? This might prove that other countries (who heavily depend on US assistance) might be adversely affected by the war by not getting the assistance last year.. which we don’t know of. Another angle to explore next week.

And the top 3 beneficiaries overall were 1. Russia (over 8 billion) 2.Columbia (6.6 billion) 3. Jordan (slightly over 4 billion).

I hope this week you got a feel of the data and explored it! I will continue to drill into Childcare loans next week. Has the child mortality rates dropped in those countries where the US has been spending so much money in? Need to hunt more data for that one!

PS: Had an exceptionally hard time writing and structuring this post.. it took me 4 complete rewrites and I still don’t like it. Sigh.. Just decided to ship. Some days you just can’t write or draw.