Mind the Gap

March 21st, 2008

(This post is the first in a three-part series on data sharing websites. However, I reserve the right to add more parts should I come across more worthy pages.)

While doing research on new data sources, I came across an excellent site I hadn’t heard of before called Gapminder. I suspect the name is a play on the classic British Underground slogan, “Mind the Gap”, but I can’t be sure, especially since the site is run by a Swede named Hans Rosling (who apparently is a sword swallower in his spare time - try saying “Swede sword swallower” five times fast). Gapminder is an interesting data sharing site powered by Hans’ pioneering Trendalyzer software, which was acquired by Google last year.

So what is Trendalyzer? It’s basically a data visualization package with a fully functional (and easy-to-use) graphical user interface. There’s a beta version of the software that you can try out here, something I’ll discuss in more detail later. It comes preloaded with a bunch of interesting health and economics-related data (like infant mortality and GDP per country) which can be plotted vis-a-vis each other on the X and Y axes. I could say more, but it’s much easier to watch one of Hans’ “Gapcasts” to see the software in action for yourself. (Or go ahead and try out that program link I put up).

That’s not the only reason to watch the Gapcasts, though. Because while there’s a lot of great content and links on Gapminder, I think much of the real meat of the site is in Hans’ Gapcasts, which Gapminder releases for free over the web. At the time of this writing, there are 10 Gapcasts available for download. In my never-ending quest to fill my free time with educational activities, I sat down and watched every one of them, all but one of which are hosted by the man himself.

The Gapcasts get off to a bit of a shaky start (of course, this often happens with new endeavors, including this blog), but later on the presentations are much cleaner and more to the point. They’re all worth watching, if only to see Hans’ great Trendalyzer software in action. (I particularly liked 5, 6, and 10 if you’re looking for any recommendations.)

As I watched the Gapcasts, I took notes about what was good and what could’ve been better. Speaking of which, if you haven’t watched any of them yet, I encourage you to do so now, since what I say will make a lot more sense if you’ve seen a couple.

Go ahead and watch one or two. It’s ok, I’ll wait.

Back? Good.

Before we start, I should mention that I’m only going to talk about the Trendalyzer software here, since it’s the focus of all the Gapcasts. And if you’d like to see the basis for my data visualization comments below, check out my previous post on the subject here (which draws largely on the work of Edward Tufte and Steven Few).

One neat thing I noticed immediately about Trendalyzer is that, with the dataset Hans includes, there are a lot of non-time-series pairings (that is, where time is not a variable on either axis) that Edward Tufte would no doubt approve of. Interesting side note: it took many years after the construction of the statistical graphic for people to regularly use something other than time on the X axis. Though David Bowie once said “Time may change me, but you can’t trace time”, it sure hasn’t stopped us from trying.

As it happens, time actually is included as a third variable (if you hit a certain button) resulting in a series of “small multiples” (over time in this case), another Tuftean principle. In addition, Hans offers many compelling narratives with these data, like showing the deep statistical similarities between France and Turkey in public health. Interestingly, those narratives often turn out to be extremely short; many are only a few minutes long. A good chart is worth a thousand words, it seems.

But Hans is succinct in other ways, too. He adds yet another dimension to the data by use of color coding (usually for the region of the world a country is from), a practice Steven Few recommends. This means that every bit of data you see in the Gapcasts has at least four dimensions on a single graph (x, y, time, and color), which makes for dense sets of data that maximize the “data-ink” ratio Tufte talks about so much. Ironically enough, the “good” example from that last link is from Steven Few himself. (You may also notice my debt to Few in this blog; a good number of graphs on the Mine Shaft look just like that, since I follow many of Few’s recommendations directly.)

When you add to all this the ability to isolate and animate data points (with trails) at will, you have yourself a pretty snazzy package on the whole. It’s no surprise Google bought it. Trendaylzer is a very interactive and user-friendly web application that can really help you to visualize and understand complex datasets in a hurry. And whereas changing the data in charts to cross-compare is usually a chore at best, here you can seamlessly do so in a couple of clicks.

Before you start to think of me as some sort of Google shill, I do actually have some criticisms to make of Trendalyzer. First of all, Trendalyzer tries its best to add a fifth dimension to the data via the size of the data points, which typically suggests the population of each country in the datasets. Trying to raise the data-ink ratio still higher is an admirable goal, but the problem is that circles are hard to compare with each other in a meaningful way. It’s the same reason why Tufte says pie charts are a bad idea.

The core problem with circles is that their area increases by the square of their radii, which is a nonlinear factor. As I learned from my research into data visualization, people have a much easier time comparing things linearly. That’s why pizzas are sold as small, medium, and large, and not as 14″, 16″, and 17″ pizzas. (Though, rappers usually talk about their rims in terms of their diameter, like 20s. Maybe they’re secretly great mathematicians?)

When you compare circles that are even modestly different in size, your estimates of their area will often vary widely from their true size. In the context of a graph, that’s not very good. The best you can probably do with circles is pizza-like groupings of small, medium, and large. Even if you adjust the scale and sizing of the circles (which Trendalyzer allows you to do), it only partially deals with the problem.

To me, it would be better if Trendalyzer used vertical bars or skinny vertical rectangles to show population size, which you could easily and accurately compare with each other. This approach might not be as eye-catching, but it would probably be easier to judge.

Trendalyzer could also take the tried-and-true map approach of having different symbols for differently-sized populations, another unambiguous way to show them. This would make it harder to judge the size of all the countries at a glance, but at least it would be extremely clear what’s going on.

Anyway, as far as the Gapcasts themselves go, there’s a lot of chartjunk, bad backgrounds, and needless animation, especially early on. Thankfully, Hans seems to have learned his lesson about these early on, and most of that stuff has been cut out of later Gapcasts. (Would that make him a Clever Hans? I think the Gapcast Hans has a little more insight into his datasets, though.)

Another problem with the Gapcasts in particular is their use of logarithmic scales. On the one hand, logarithmic scales help people see exponential trends as linear ones, which is good. However, if you’re trying to compare data points and see trends over time, the difference in scale is a major issue. Hans mentions in the Gapcasts that logarithmic scales are justified because population growth and other trends he studies are exponential, but I was never sold on the idea. I read the same notion in Few’s book and in William Cleveland’s “Visualizing Data” and I wasn’t convinced then either.

From what I’ve read (and from my own experience), people just don’t get logarithmic scales, and they are much less intuitive in general. When I hear about the Decibel and Richter scales, no amount of drilling such scales into my brain has given me an intuitive understanding of how they work. I always have to remind myself that an 8 on the Richter scale is 1000 times more than a 6. (In fact, I had to look it up just now.)

So I say use linear scales whenever possible. Thankfully, you can do just that if you want. If you check out that beta version of Trendalyzer I linked, you can set the scales to be linear or exponential on any variable. This is no trivial change. I think if you go back and do the same data analysis that Hans does in his Gapcasts, but with linear scaling, many of his points are easier to understand (and also less justified). You can still get the possible exponential effects if you imagine large jumps on the graph, and you can control the viewable range of the data points to accommodate these wild swings as well. Controlling the range can also help in banking the graph to 45 degrees, which Cleveland proposes as the best aspect ratio for viewing data trends.

Thus, with some scale changes and creative range-finding, you can fix most of my criticisms of Trendalyzer. There’s not much to be done about the circles, but given how innovative and great the beta version of this software is, how can I really complain? (I’m not trying to pump up my search ranking here, honest - I just genuinely like the product.) Frankly, like these people, I’m just waiting on a version of Trendalyzer where I can use it with my own data. Given Hans’ committment to opening up data sources, and the fact that Trendalyzer is owned by Google, I have faith this will eventually happen, something that would make me, a self-confessed “data vulture”, very happy.

Speaking of being a data vulture, if you play around with the Trendalyzer software, you’ll see that you can easily dump any bit of Hans’ datasets as a spreadsheet (as well as lots of other links and facts about the graphs). For a guy like me, this is like digging ditches for a living and having someone drive by and throw you a bag of money. I was almost not sure how to react. I kind of thought to myself, “Wow, that’s it? You mean I already have the data in a spreadsheet and it’s done? Don’t I need to clean this up or link it to something else or copy and paste 80 times?” It’s almost too easy. If information wants to be free, I think Hans Rosling and Trendalyzer will end up being a part of the jailbreak.

| | del.icio.us

3 Responses to “Mind the Gap”

  1. Ben Says:

    Hans is in fact a sword swallower, I think you would enjoy this link.

    http://www.ted.com/index.php/talks/view/id/140

    Thanks for bringing that to my attention, though I’d actually seen it before. Pretty crazy stuff!

    - Dave

  2. The Data Mine Shaft » Blog Archives » I Swivel Data Says:

    […] (This post is the second in a three-part series on data-sharing websites. Part one is here.) […]

  3. The Data Mine Shaft » Blog Archives » Eyes on the Prize, Part One Says:

    […] post is also the third in a series on data-sharing websites. Here are parts one and two of that […]

Leave a Reply