Feeling Chatty

November 30th, 2007

Everyone knows people speak differently in online chat than they do in person, or even in emails. (Except for those people who say stuff like “LOL” in person. I’m afraid there’s no hope for y’all.) The question is, how much do we speak differently in chat?

A quick Google search of mine didn’t turn up much. I mostly found stuff about what chat acronyms mean and search engine optimization and the like. (Try getting anything meaningful out of a search for “ROFL”. Go ahead, I dare you.) I did have another source of data at my fingertips, however: almost a year’s worth of chat logs from my favorite chat client, AOL Instant Messenger (AIM). The results of 78 different chats I had were stored as HTML, but HTML isn’t an easily manipulated format for me, so I found a nifty program that converts HTML to Excel spreadsheets, and then found some cool VBA code that I used to put all the data into one spreadsheet. (That code doesn’t work “out of the box”, I think, so if you use it you’ll probably have to tweak it some. And I have a weird definition of the word “cool”.)

After that, I simply applied some formatting/data processing techniques, much like what I did for my analysis of video game names. That is, I wrote some nifty VBA macros, in conjunction with some liberal use of the “Text-to-Columns” feature to parse out all the words in my various chats. The format of the chat logs was very convenient; I could tell who said what and I could parse it all without much trouble.

As usual, the parsing didn’t work 100% perfectly. (The formatting of the HTML left a little to be desired.) Reading through the few intact phrases that were left were pretty hilarious, though, out of context. My favorites were “my casual crack habit” and “what would Hillary do”.

On to my findings. First, the words used in general, then the differences between me and my friends. Here are the Top 30 words used in my chats, followed by the Top 30 words I used, and the Top 30 my friends used:

Me Friends
Rank Words Count Rank Words Count Rank Words Count
1 I 1407 1 ha 848 1 I 867
2 ha 883 2 I 540 2 It 362
3 You 745 3 you 457 3 That 323
4 It 699 4 it 340 4 is 300
5 That 554 5 yeah 246 5 You 298
6 is 530 6 is 230 6 Yeah 212
7 Yeah 458 7 that 230 7 have 176
8 have 309 8 it’s 168 8 like 163
9 so 300 9 so 137 9 so 163
10 like 293 10 have 133 10 Not 152
11 Not 267 11 like 130 11 on 134
12 it’s 246 12 not 115 12 My 131
13 on 231 13 that’s 112 13 was 124
14 about 219 14 about 104 14 be 123
15 just 218 15 lol 104 15 me 122
16 be 213 16 if 100 16 just 121
17 was 212 17 can 99 17 about 115
18 My 205 18 all 97 18 with 109
19 If 198 19 just 97 19 Are 106
20 Are 189 20 on 97 20 don’t 99
21 me 188 21 be 90 21 If 98
22 with 188 22 do 90 22 They 97
23 do 185 23 I’m 90 23 do 95
24 can 183 24 was 88 24 this 92
25 all 177 25 are 83 25 up 89
26 I’m 176 26 out 82 26 I’m 86
27 up 171 27 up 82 27 get 85
28 good 162 28 ok 80 28 can 84
29 They 158 29 with 79 29 good 84
30 don’t 156 30 good 78 30 some 81
10120 5226 5091



(Note that these don’t add up exactly due to noise in the parsing method I chose.)

Nothing mindblowing there. Those lists don’t differ much from these Top 100 and Top 1,000 words used in English. As with my video game post, though, I filtered out 9 common articles, conjunctions, and prepositions: a, an, and, but, for, in, of, to, the. Hopefully that makes these lists more interesting. (All of these 9 words would appear on the Top 30 lists if I didn’t filter them.)

Even considering my filtering, though, the rankings of each word are markedly different on those top lists as opposed to my chats. I’m guessing this is because the language of chatting is different from writing as a whole, with more pronouns and informal usage. For example, I read a long time ago that after researchers analyzed hundreds of phone conversations, the most popular word is “I” followed by “you”. My own bizarre usage of the word “ha” for anything funny notwithstanding, I found the same was true for chat. (My friends use the word “you” a lot less than me, though, and certainly not the second most. I kind of skew things here - I guess I like to externalize my problems, ha ha.) I had a lot of trouble finding statistics on word usage in internet chat, but I did find this link, which seems to corroborate what I found. I’m guessing me and my friends are fairly representative of chatters as a whole.

One thing you’ll notice, as they mention in that Top 1,000 words link, is that very few words (under 100) account for the vast majority (65 percent or more) of words used. Here’s a graphical representation of said fact, based on my data:

Word Frequency Chart
Total Word Frequency, Sorted by Occurrence in Descending Order



There’s something weird about this finding, like somehow our everyday speech and chats are so much simpler than what we’d expect. I guess we’re mostly saying and typing the same few words over and over. (And over. And over. And…all right, I’ll be good.)

There are a few other interesting facts to note. In my data, “lol” is the most frequent chat acronym, though at #32 in usage it’s less common than you’d expect. (”Lmao” was #90, mostly due to me.) “Yeah” is the most common interjection, coming in at #7. Surprisingly, I couldn’t find any common unintentional misspellings that were used at least 5 times. This is partially due to the fact the words are often misspelled in different ways, and partially because me and my friends are fairly anal when it comes to spelling. (For a sampling of the average spelling level online, try searching for “teh” along with any other phrase in Google and you’ll get to see the scope of the phenomenon first-hand.)

For the scatalogically-minded, “shit” is easily the most common curseword, coming in at #88 with 71 uses. The next closest is “fuck”, at #205 with 26 uses. The best part was looking up these, because I found countless variations like “batshit” and “assfuck”. Searching through them rapid-fire was good for a laugh.

“2″ is the most common number by far, as I also discovered in my video game post. Also, the number “6″ was in there 6 times. If that came out of 6 separate chats I fear this post would have the mark of the beast. (I feel OK though, because “Aquinas” came up 6 times as well. He’s got my back, I think.)

How’s my chatting different from my friends, then? Well, first of all, collectively they use more words more often than I do. (Big surprise, I know. But how many of them are writing blogs with their fancy-pants vocabularies, huh? Huh?) Here’s two charts showing you what I mean:

Me vs My Friends Chart, All Words
Word Usage, Sorted by How Often I Use The Words in Descending Order



(Note that the words considered had to be used at least 5 times overall. Some of the lesser used words were stuff like links and things I’m trying to filter out.)

Me vs My Friends Chart, Common Words by How Often I Use Them
Common Word Usage, Sorted by How Often I Use The Words in Descending Order



(You can see how anomalous my use of the word “Ha” is, since I marked it. What can I say, I like the word. Ha!)

If you look at the “inflection point” on each chart (around a third of a way across) you’ll notice that my friends use a greater diversity of the same words. The first chart shows that this is true for most words I use. The second chart shows this is true for even the top 30 words we use.

It’s not that their vocabulary is much larger than mine. (At least not in chat. Me maybe have problem with write or talk.) We share 797/866 = 92% of words. There are 69/866 = 8% of words that me and my friends don’t share. I have 15, they have 54. So while they use over 3 times as many distinct words as I do, they’re both a small fraction of the total. (1.7% of my vocabulary is distinct compared to 6.2% of theirs.) You can see the same point made in this chart:

Me vs My Friends Chart, Common Words by Word Frequency
Common Word Usage, Sorted by Word Frequency in Descending Order



That chart says we may differ in how much we use the 30 most common words, but there’s not a lot of obvious patterns, and me and my friends obviously get a fair amount of mileage out of all of them. They just happen to use the greater variety of words we share more often than I do. Nevertheless, the words that are distinct between us are very telling. I prefer words like “hardcore” and “yikes” and they prefer “pissed” and “yea”, although the distinct words were often alternate spellings of words we shared.

I guess the main takeaway for this post is that my friends and I talk a lot alike. But what about chatters as a whole? Well, I can’t tell you that. (Though I suspect it’s a lot like us in many ways, as you can see.) What I can tell you, though, is how common the words in this blog title are. Which is to say, not very. (Even including the subtitle.) Here’s a great link about popular words in blog titles. “Data”, “mine”, and “shaft” are nowhere to be found in the top 100. Guess that’s means I’m part of the “long tail” of blog names. That’s pretty hardcore. Ha!

(Sorry for not linking to my source data in this post. Normally I would, but I don’t want to give out sensitive information accidentally or anything.)

| | del.icio.us

5 Responses to “Feeling Chatty”

  1. Matt Says:

    Great post Dave. I really liked the graph showing the frequency of word usage of you vs friends.

    Thanks man, I always appreciate feedback and kind words.

    - Dave

  2. Kim Says:

    ““Aquinas” came up 6 times as well. He’s got my back, I think.”

    Hmm, wonder whose fault that is? :P

    I enjoyed this post too. I think your graph of you vs. friends chart is cool, but could use a little better labelling or explanation in the text. Matt had to explain it to me. Maybe I’m just tired? Once I got it, I thought it was pretty cool.

    Overall, there’s a really interesting mix of information and unique data analysis in this post. Well done!

    Yeah ha ha that was definitely your influence with the Aquinas. :)

    You’re really kind and complimentary as usual, Kim. Thanks. I will try to explain the charts more in the future. I personally felt they were a bit underexplained, but I wasn’t sure and I hate to beat topics to death. I appreciate the feedback.

    - Dave

  3. Kenny Says:

    I see that “heh” is not overly used, which leads me to believe that I am not a large sample for the data in this study. My heh is your ha, although I don’t have saved chat logs to prove it.

    I know that personally, I tend to use more of my vocabulary when typing, writing, or chatting than I do in spoken communication. I could venture a guess as to why that is, but I wouldn’t know for sure.

    I’m also seeing that even in the replies and comments here, we all tend to use “I” very frequently. Maybe it’s because we’re being analytical when we make responses, and maybe it’s because we try to relate what we read to our own experiences. I’d speculate that part of it is also that we’ve learned to communicate in a certain way to try and be complimentary, and that we interpret as not being overly commanding. Instead of just saying “Your article says…”, we might say “I think that your article says…”, and by doing that we go from telling you what you’re trying to tell us to telling you what we interpret it is that you are trying to tell us.

    You’ve done the chat study.. now you could expand it to a sampling of blog studies, and see if the two sets of data still compare or if they diverge. Interesting stuff.

    I guess so, Kenny. I agree with you, I also use more of my vocab in written media. Guess it’s just more formal and you have more time to think.

    “I” is such a common word. Probably because blog communication is so personal, which I think is the same point you’re making. I’d like to analyse blog posts from other places, Kenny. That’s a great idea for a follow up post, thanks. Gotta make sure not to go to crazy with the data, though, since I prefer to use a spreadsheet.

    - Dave

  4. Dan D. Says:

    Dave,

    I really like your charts, thumbnails and tables in this post. Nice job.

    –Dan D.

    Thanks Dan, that’s really nice of you to say.

    - Dave

  5. Dara Says:

    Haha

    The best part was looking up these, because I found countless variations like “batshit” and “assfuck”

    This made me LOL ;)

    I see a lot of new internet slang on FunAdvice every day…makes me feel old. One thing I’m surprised about in your results is that “like” isn’t higher on the list. I think it would be if we studied verbal usage instead of Internet usage.

    You might be right, Dara. Or my friends might have less “valley girl” in them. :) I like the take y’all have on Yahoo Answers on your site by the way, it’s interesting.

    - Dave

Leave a Reply