Eyes on the Prize, Part One
April 11th, 2008(This week is the first of a two-part post on the data sharing site “Many Eyes”. It was such a big site with so much content that I had to break this post into two parts to make it manageable. In part one, I’ll cover general trends with an overview. In part two, I’ll upload a couple of datasets to talk about, and describe more specific findings.
This post is also the third in a series on data-sharing websites. Here are parts one and two of that series.)
“Many Eyes” is a data-sharing and visualization website, a result of a research and development (R & D) project funded by IBM. Their basic goal is to improve data visualization techniques and to “democratize” data analysis and visualization.
Because it appears to be a research project (as opposed to a business venture), Many Eyes has a much more sprawling and ambitious nature compared to a more commercial data-sharing site like Swivel, a startup I blogged about recently. If you look at both sites, you can see pretty quickly that Swivel has a much different “vibe” than Many Eyes.
For example, it’s easier to import your data to Swivel (as you might expect), but Many Eyes has a much deeper and greater variety of data visualizations. If you check out that page I linked, you’ll see a dizzying array of different ways to present your data, whether it’s quantitative or text. Each one is deep enough, in fact, for me to go through each visualization here in turn. (Describing these visualizations will also take up the vast majority of this post.) In that way, this post is as much about charting as it is about Many Eyes and data sharing, which I figure is a nice bonus (if you’re a data nerd like me, anyway).
The first set of visualizations are geographic, where you overlay datasets on top of a whopping 14 different countries (for the world map) or states/provinces/prefectures (for the country maps).
Anyway, if you have data that’s organized by country or region, this kind of map helps you visualize it easily and see regional trends. As I’ve learned elsewhere (scroll down to the second page) in the work of Edward Tufte, geographic visualization works especially well with large datasets. That is, it can be a very “data-dense” style of visualization, maximizing Tufte’s famous “data-ink ratio”. (I’m sure I’m abusing the “data-pixel” ratio myself, if you extend this concept into the digital realm.)
However, you might wonder (as I did) how Many Eyes can standardize the country/region data. They bring up the example of “The Bahamas” vs. “Bahamas, The” being in the same dataset. Obviously, you’d want both those data points to be grouped together, even though they are nominally different. Apparently Many Eyes has a “built-in facility” for dealing with these difficulties (a nice touch), but they also built in a disambiguation tool where you work with Many Eyes to determine what to do with difficult bits of data. You can see a screenshot of this tool on the geographic visualizations page, which looks pretty cool to me. I don’t happen to any good geographic data handy, so I won’t be exploring these visualizations directly, but it seems like they did a good job with them here.
The next set of visualizations are line graphs and stacked graphs. Line graphs are a classic way to visualize data over time, among other things. Regular readers of this blog will notice that I make liberal use of them (I’m guessing they’re the most common type of graphic that I use), and thus I’m a big fan. In fact, the dataset I uploaded to Swivel was a line graph. Many Eyes’ line graphs follow a lot of charting “best practices”, where there are only a few muted grid lines, with calm colors and horizontal labels on each axis, etc. I don’t want to get too much into specifics here (that’s what part two is for), but the line graphs seem like they are implemented fairly well.
If you read the bottom of the page on line graphs, you’ll notice a section called “expert notes”. Those are extremely low-key sets of notes at the end of most of these visualization descriptions. Fellow data nerds, take note: they’re amazing. As someone who’s read a fair amount about how to visually present data, I am stunned at the high-quality information they cram into these few sentences. The “expert notes” sections do an incredible job of summarizing a huge amount of visualization research in just a paragraph or two. If you care about data visualization at all, I suggest you go out of your way to read all of them. Ironically enough, except for the general conclusions, I agree with everything they say in each set of notes. (More on that later.) In this case, I think the conclusions on line graphs are right on the money, so for now it’s a moot point.
The other type of visualization here is stacked charts, with or without categories. Stacked charts are line charts with the area beneath the line filled in so you can easily see the values of the data (via the area beneath the line) at different intervals. These lines, with their filled-in areas, are stacked on top of each other for a cumulative effect. This way you can visualize “part-to-whole” relationships as well as the value of any particular data series.
In the expert notes, they claim that “Stack graphs are a standard, useful chart.” That’s the only sentence I disagree with on the page, as it turns out. They specifically mention that stacked charts make it hard to see what individual values are, as well as make it hard to compare values, but they seem to suggest that these are not fatal flaws. From what I’ve read in the works of Edward Tufte, Steven Few, and other places, I beg to differ. The main purpose of a stacked chart is to compare parts to a whole. In anything but the most general terms, they fail at this. Try it for yourself and you’ll see that you just can’t get exact comparisons, because you lack any good frame of reference.
Even worse, it’s hard to get individual values at a glance by looking at a stacked chart. As Tufte and others have suggested, you can easily fix these problems by having lots of line graphs next to each other instead, or by clever use of many sets of bar graphs, or even just using a table with percentages. In short, stacked charts are for the birds. (And corporate annual reports.) Nevertheless, they’re there if you want to use them.
After these are the visualizations for comparisons. Leading the way is the humble bar chart. Every bit the equal of a line chart, it’s a classic chart that’s been in use practically as long as charting itself has. Bar charts simply plot one axis vs. another and make a bar to represent that relationship. They’re probably the second-most-common chart you’ll see on this site, and I think they’re great, of course. You can use them with vertical bars (good for time-based data) or with horizontal bars (better for data in categories). Either are extremely useful in their own contexts. (Again, more on the specifics later.) And as the “expert notes” mention, watch out for datasets that require a large amount of bars.
The close cousin of the bar chart is the block histogram. It’s like a bar chart that splits the bars into “bins”, or standardized squares stacked on each other to make bars. They’re great for discrete datasets (like the number of students per classrom) and for showing the distribution of some variable. Like Many Eyes says, they’re a bit like stem-and-leaf plots, which you’re probably familiar with if you’ve ever had a class in statistics. (Sorry if I brought up any bad memories here. Don’t worry - there won’t be a test later.) Block histograms may not be as versatile as line or bar charts, but they’re great for specific situations.
After this, the charts start becoming more niche and experimental, so I’ll cover them a bit differently. The other two sets of comparison charts are bubble charts and matrix charts. If you read my previous post on Gapminder and Trendalyzer, you’ll see that bubble charts are a lot like the animated charts in Trendalyzer. The size of the bubbles are themselves data points, showing both scale and a data value through the size of the circle. Those bubbles can be plotted against both an x and a y axis as necessary, allowing for three dimensions, or pieces of data, per data point. In this way, bubble charts are more “data dense” than more typical charts like line and bar charts.
The trouble with bubble charts is that, as Many Eyes explains, they are “performing a visual square-root transform on the data set” (much like viewing a chart on a logarithmic scale.) That is, the area of a circle grows exponentially faster than the radius (since area of a circle is pi * radius(squared)). So when you compare two circles, you probably won’t understand that one with even a slightly larger radius can have a much larger area (i.e. data point) than the other.
If you’ve read my other posts on charting, this is familiar territory. It’s the same reason I dislike pie charts - a topic I’ll revisit again here in a bit - which is that the bubbles are hard to compare on anything but the most basic level, and it’s hard to figure out the areas of the circles. Not to burst your bubble (hey, stop groaning), but it’s better to use horizontal bar charts, or color instead of size as the variable, or shapes other than bubbles, like lines or small rectangles (which don’t have these issues).
The last type of data comparison chart is the matrix chart. The best way to describe these is, as they mention, is as a sort of visual/chart-based pivot table. If you find that description a bit baffling, an excellent example of these that most of you should be familiar with are the circle ratings on Consumer Reports. Matrix reports are an excellent way to see the same category-based information you’d see in a pivot table but in a more visual format.
There’s nothing wrong with matrix charts in theory, but in practice they are usually plagued by the same circle comparison problems that bubble and pie charts have. Consumer Reports actually side-steps this issue elegantly with their 5-level rating system, using two circles to make each of the 5 levels clear. This allows, along with their use of color, for an extremely dense and visual layout of the data. Most matrix reports will probably not be so well-designed, though, so watch out.
Next up are more abstract relational charts. The last of the “classic charts”, the scatter plot, is in this category. Like all classic charts, the premise is simple. All you do is plot data points on two related axes. Scatter plots are great, but more experimental since they can be extremely data dense and they often do not have time as one of their two axes/variables. Once you get used to them, though, they’re really useful. (You’ll see me use them here from time to time.)
In addition, Many Eyes allows you to add in size to the data points to increase the number of dimensions per data point. In this way, these scatter plots function almost exactly like Trendalyzer charts, except they can’t be animated on the spot to show the passage of time (too bad, since this is a pretty cool feature). Of course, that brings up the same issues of comparing circle size. (Am I beating a dead horse yet? It’s an important and underappreciated point, though, so I feel it bears repeating.)
Network diagrams are the other type of abstract relational chart. All they do is visually show the relationship between two points by connecting them with lines (with arrows to show the relationship direction, if necessary). They remind me of Planarity, the vertex untangling game I played some months back, where the goal was to make all the lines not overlap. The game is way more fun than it has any right to be, if you ask me. (Yep, I’m a nerd in more ways than one.) If you put in a lot of vertices, it can get quite challenging. However, once I read (via the social bookmarking site Reddit) a general strategy for solving any vertex map, it was less exciting. (True nerds reading this post should try and figure out the algorithm on their own.)
There’s a lot to be learned about network diagrams (it’s a branch of graph theory in mathematics, a topic familiar to any computer science folks reading this) and Many Eyes is actively researching the topic. As it turns out, I actually participated in some of their research, where I attempted to provide some order to a random permutation of a network diagram. It was an interesting experience, taking nodes and dragging them around without any direction. I actually had to spend a lot of time coming up with a coherent order. In all, though, these are still fairly experimental charts with fairly limited uses. Still, keep your eyes peeled for these in the future.
The last set of charts covers part-to-whole data, namely pie charts and treemaps. Most people have seen a pie chart before. The “slices” of the pie tell you what proportion each bit of data (represented by the area of the slice) has in relation to to the whole.
If you’ve been paying attention (or even if you haven’t been, really), you should know by now how I feel about pie charts. The “expert notes” for pie charts are telling. They mention ominously that pie charts have “a mixed reputation”. Because of the same problems I mentioned earlier, they are horrible for picking out data points, and comparing pie slices is even harder than comparing circles.
What else could they be good for? Many Eyes suggests that pie charts may change the way someone thinks about problems involving these datasets, but given the fact that they can screw up your intuition so badly, this may in fact be a really bad thing. There just isn’t an excuse for using them, as far as I’m concerned. As I suggested before, far better to use horizontal bar charts or color instead of size as the variable, or shapes other than bubbles, like lines or small rectangles. Or just about anything else, really; even using a data table or slicing up a rectangle would be better (like with treemaps, which I talk about next) for comparison purposes. If you must use a pie chart, keep the number of data points small, and your slices above a certain size. (And for God’s sake don’t add 3D effects to it later. 3D pies are for eating, not modeling!)
Right next to pie charts on the list are treemaps, which are a far better candidate for part-to-whole comparisons. Like matrix charts, these are a lot like visual pivot tables. Treemaps basically take data and organize it into rectangles, which represent the size of the data. Rectangles which encompass other rectangles represent categories. This is much better explained visually, so go check out that Many Eyes treemap link if you haven’t.
I was first exposed to treemaps through an excellent program called SequoiaView, which visually shows the size of files of your entire hard drive (or any subfolder) at a glance. I later found a similar program called WinDirStat which is even better for this task - I think it makes better use of color and has some interesting tools. If you check out either of these programs, not only will you get an excellent utility for analyzing your hard disk usage (double-clicking to drill down into subfolders is awesome), but you’ll also see an exceptionally powerful and useful example of treemaps in action. I have found many huge files just taking up space by firing up WinDirStat for 5 minutes here and there. I hear computer techs at places like Geek Squad make good use of these tools, so they’re not just academic curiosities.
The final set of visualization tools are for text. First, there’s the oh-so-common tag cloud, which, as Many Eyes notes, has often been used without any regard for context or how useful it is. Still, tag clouds are extremely simple and data-dense, so in the right contexts they are quite useful. (Probably when there are a few, really interesting and frequently-used words.)
Thankfully, Many Eyes filters out “stop words”, which are extremely common words like “the” and “about” that occur all the time but don’t add much to your tag clouds. The phrase “stop words” is new to me, though I am familiar with the idea. (And if you thought “love” was a “stop word” then you should probably stop listening to The Supremes quite so much.) Still, I wish I had known what these were called back when I did my post on video game names, as it would allowed me to remove a more comprehensive list of useless words, instead of the 8 that I came up with.
Even better, about a week ago Many Eyes came out with a tag cloud that compares 2 texts at the same time. The size of each word shows you the frequency, and the color (red or blue) tells you which of the texts it came from. Thus you can tell at a glance how often words come up in each body of text.
I think comparative tag clouds are extremely cool and useful, and I will definitely be getting some mileage out of them in the future. I haven’t really seen anything like these before, so I can’t wait to give them a try. (They added this feature on April 1st, but it’s not a joke. It made me wonder for a bit, though, when I first read about it. As you might have noticed, data nerd jokes tend to be a bit on the dry side.)
Last on the list is another text comparison tool, the word tree. You plug in a body of text, then when you search for a word, it shows you via a tree-like structure all the phrases that come after it. Many Eyes uses punctuation to help make sense of how to order the “branches”. More common phrases make larger branches.
It’s very experimental (for example, they use a few techniques to truncate the branches), but also very cool. With a simple search, you can really see the structure of texts at a glance. It’s kind of like being able to read something in parallel instead of serially (like how you read a book). You can tell that these innovative chart types are where Many Eyes really excels.
I’m not sure exactly how to make use of word trees, but I have a few ideas. I bet you could find some interesting trees searching through propaganda speeches or George Orwell’s futurist novel “1984″. Considering how technology is usually used in 1984, though, maybe that’s a double-plus ungood idea after all
Whew! As you can tell, there’s a lot of different ways to visualize your data on Many Eyes. It took me a whole post just to go over them all! Next time, in part two, I’ll go over what it’s like to upload and use specific datasets on Many Eyes. Stay tuned.
| | | del.icio.us |
April 18th, 2008 at 1:20 am
[…] post on the data-sharing site “Many Eyes”. If you haven’t, you should definitely read part one (about the site’s visualizations) first. This week, by contrast, is going to be about the […]