Today’s post uses this data set – UNHCR 2011 Refugee Statistics. I found this analyzed the day before on one of my favorite data journalism blogs: the UK Guardian.
I’m going to take a different approach with a data set today: Instead of coming up with questions and then figuring out the answers, I will take you through my ideas on data visualization on a data set already analyzed by someone else. I will answer the same questions as the Guardian did in their analysis, with my take on it. This is not to critique what they have done, they have of course done an awesome job but to bring to light of how I would have done it (hard to do now that I have read the answers!). I would first recommend reading the Guardian post – it’s a short one.
Note: There is usually no wrong or right with what you can show with a data set – just different things you can bring to light.
About the data set – This data was set released by the UN Refugee Agency and is super interesting. A line about the UNHCR – it’s broad (and amazing) mission is to protect refugees worldwide. This data set has information about a) Where all the refugees currently are b) Where they come from c) Very interesting context – host country capacities, settlements and camps, inflows in 10 spreadsheets.
The questions – Like always, let’s first try and understand what the Guardian wants to uncover.
Their post is basically answering two questions:
1. Where do the all refugees come from?
2. Where do all the refugees from Afghanistan go to?
Process: So now let’s solve this my way –
1. Where do all the refugees come from?
This is fairly straightforward. The way I would explore this data subset is juxtaposing it against a world map. The colors in the map are proportional to the number, the darker the orange – the greater the # of refugees. Not surprising, the darkest is Afghanistan. As usual, click on the “Click to Interact” to play with it.
Let’s compare it to what the Guardian did. The Guardian did it pretty much the same way, they used a different tool – Google fusion maps. If you look carefully – The only difference here is that if you hover over their map you cannot see the data labels – the number of refugees or the name of the countries. If your geography is as bad mine in recognizing all the countries – this is not extremely useful to understand the complete data set. The labels are only visible when you zoom out and the color shadings are replaced by bubbles.
A side thought here – are data labels important? The answer – it depends on the question you’re asking. If you’re looking for the place with the maximum refugees, all you need to do is look for the darkest colored region(s). However if your question is to understand the distribution of all refugees throughout the world – then yes, data labels are important.
Let’s goto question 2.
2. Where do all the refugees from Afghanistan goto? For this I separated a small chunk of the data. All the refugees from Afghanistan goto 9 different countries. It’s clear that the maximum number goto Pakistan! Now how can I best communicate these results. Let’s take a look:
For this I chose a treemap. The way this could have been visualized is by two ways a) Pie Chart b) Treemap. Both of these are typically used when we are representing parts of a whole. For those not familiar with a treemap – a treemap is very similar to a pie chart, but clearer and also unlike a pie chart we can add another parameter of color like we did with the world map. As usual, click on the “Click to Interact” to play with it.
Let’s compare this to what the Guardian did.They used a pie-cart. The reason I didn’t go for a pie chart here is because after Pakistan and Iran it’s really hard to distinguish between the rest of the countries! My rule of thumb is when there are more than 4 or 5 segments, I never use a pie-chart and usually opt for a treemap instead.
A side thought: is showing a pie chart wrong in this case? Not necessarily – it depends on the question/intention of the analyst. If the question is where do most of the refugees go from Afghanistan – the pie chart works just fine. It immediately tells us that most of them goto Pakistan and Iran. However if you’re looking to understand a clear distribution of the data, it’s not sufficient.
Additional ramblings on Data Visualization-
I want to talk about my thoughts on the value of data visualization and approach to how I use it when I play with data. I see a lot of data visualization being termed as just “cool” and of very little value. Data visualization can potentially be of little value, not because of the visualization itself but of the intention and the questions we ask of the data. If my data visualizations are not answering the questions which I want to understand- then yes, it’s just cool (if it looks good!) and not valuable. If I’m looking at someone else’s visualization and fail to understand the intention then I’ll think of it as having very little value. But if used and understood correctly it can be extremely valuable.
Word of caution: I think I’ve referred to this before, but it’s very easy with data graphics to choose something that looks sophisticated but doesn’t answer your questions very well. Examples are – going for a 3d graph instead of 2d, not choosing a simple bar chart etc. The way I personally try to avoid the error of choosing coolness over functionality is by relentlessly looking for value.
I drew a chart of the data analysis process here –
If used well, data visualization can be immensely valuable in exploration and communication. Exploration – it can make it so much easier to look at numbers (like I did in some of my last posts) and Communication – it usually makes a bigger impact to show a powerful graphic with the final results than text.
Would absolutely love to hear your thoughts. Today was fun. Will be back next week with another data set and another topic!