Front Loaded
February 1st, 2008A few months ago I wrote about video games titles. I tried to see what words were popular in video game titles and what weren’t. Sometimes, though, I wonder about other aspects of titles. Anybody’s who’s ever lookeed in a phone book has no doubt noticed businesses that start with earlier letters of the alphabet on purpose. (Though supposedly this doesn’t help those businesses much. Sorry, AAAAAAdvantage Plumbing.)
I think this applies to names of people, too. Take a look at this graph, so that you can see what I mean. “A” is the most popular first letter in girl’s and boy’s names, by far. (If there’s any Aarons out there reading this, I hope you’re feeling pretty good right about now.) This is despite the fact that “A” is only the thirdmost common letter used in English, and the thirdmost common first letter in words, as you can see here. (Plus, both of those rankings include the extremely common words “a”, “an”, “and”.)
But are there biases in other titles as well? That’s something I’ve wondered about for awhile, though I’ve never had a great way to check. However, the other day my friend Dan pointed out that you can download an entire list of movie titles from the Internet Movie Database (IMDB), as well as lots of other tidbits about those movies. Off I went, then, to go download their massive datasets.
I quickly found out how unwieldy they were. There are hundreds of thousands of movies and TV shows tracked on the IMDB, even if you take out all the foreign films and repeats. (For example, a TV show will have separate entries for each episode, and some movies have tons of supporting DVDs, like Harry Potter.) To try and keep things manageable, I went and downloaded their ratings dataset, which seemed to constrain the number of entries somewhat.
After a lot of typical data wrangling (yee-haw), I got the text files into a spreadsheet format and stripped out all the repeats and obviously unimportant entries (like blocks of foreign movies), resulting in a pared-down list of about 150,000 movies (and TV shows). Then I grabbed the first letter of each movie title and counted them up.
The filtered data set I used wasn’t perfect, though. Even though I was using English language metrics like the word frequency link I mentioned earlier, some of the IMDB titles were obviously for foreign films. Other languages (even those which use roman characters) might have different patterns for first letters in titles, obviously, so this was a bit of an issue. And some movies began with numbers, quotes, apostrophes, and other special characters. I tried to save those names where I could, but I had to throw most of them out.
Still, I didn’t throw out that many movie titles (no more than, say, 5%-10%) since most movies have more conventional names. On the plus side, the IMDB puts articles and the like at the end of titles, like “Godfather, The”, which made their data more meaningful for my analysis.
Along the way, I couldn’t help but notice some of the more humorous titles. Some movies have unintentionally humorous names, like “The Gay Deception” (which probably had a very different meaning in 1935). Others were just bad like “The Tony Blair Witch Project” (#9 worst movie on the IMDB), which apparently features references to “Deliverance” (I’m sure I can guess the scene) and “Cannibal Holocaust”, one of the most violent and disturbing movies out there. (Unless you see a lot of pig gore in your daily life, I guess.) They even track pornographic movies, which I didn’t know about. Porn names like the classic “Pump Friction” were hard to miss as I scanned through the lists. (You have to log in to see that one, by the way. Or the OTHER movie with the same name - I guess even porn directors get writer’s block sometimes.)
Cheeky movie titles aside, what did I find out? Not much, actually. When I ranked all the first letters in movie titles, “A” came in a disappointing 8th (despite being much more common generally) and even “B”, “C”, and “D” were just 3rd, 4th, and 5th respectively. (Though that is kind of a weird pattern, I must say. Maybe people get lazier entering movies as they go? I can see it now: “There are HOW MANY ‘Blair Witch Project’ movies? No way, I’m done with this.”)
I thought of a lot of different ways to show what I was saying in a table or chart. I had to rack my brain for almost an hour to think of a good approach. In the end, as usual, I went with something simple. I came up with a basic chart that plots a letter’s “rank” in the alphabet (”A” = 1, “Z” = 26) against how common it was as the first letter in titles. In this case, “S” was the most-used first letter, followed by “M”, which roughly correlated to that link I posted on boy’s and girl’s names. (Sounds funny to say that lots of people like “S” and “M”, though.) Anyway, here’s the chart of frequency vs. “rank” in the alphabet:
If you can see a clear pattern in there, you’re a better person than me. That chart is all over the place. If there was a bias towards earlier letters, you would see something like a line angling upward, especially at the beginning. Instead, the chart looks more like noise. There is maybe a slight upward trend at the end, but this is probably because “Q”, “X”, “Y”, and “Z” are so infrequently used anyway, and they are all towards the back of the alphabet. When I compared the common first letters in movie titles to common first letters in words and letter frequency as a whole, there was no discernible pattern either.
Issues with foreign movies and titles with special characters notwithstanding, I’d say there’s little evidence that people name their movies with letters at the beginning of the alphabet any more than you’d expect (intentionally or otherwise). In a way, though, that’s probably for the best. Who wants to go see a movie called “AAAAAAASpiderman”? It could work for horror though, I guess. Who’s ready for the screamfest “Aardvarks Attack!”?
(I didn’t include the dataset this week because of the licensing restrictions on the use of IMDB data. Though I probably wouldn’t run afoul of their policies by posting it, I wasn’t entirely sure I wasn’t violating some usage restrictions. They specifically encourage people to spread their data around and use it, but I figured it wasn’t worth taking any chances.)
| | | del.icio.us |
February 8th, 2008 at 10:05 pm
I think it is cool that you made this post. Posting things where there are no correlation are interesting as well. Not posting negative studies is a problem in scientific literature. No one will be able to blame Dave of the “file drawer effect”.
Thanks, Bob. Always appreciate the kind words.
- Dave