If the Shoe Overfits… Part Two

August 28th, 2008

(This is part two of a two-part series on criticism of MBTI personality theory. In part one, I set the stage and covered some basic criticisms. In part two, I’ll cover some in-depth criticisms and wrap things up.)

When we left off before, I was addressing various criticisms of MBTI personality tests, specifically the David Keirsey variety. But the concern that looms largest to me (which I haven’t mentioned yet) is the question of consistency.

I, like many other people, wonder how consistent the MBTI tests really are. Lots of research has been done on the topic, so allow me to quote liberally from a few sources. (Please bear with me through these. The discussion ahead will be a little more technical than is usual in here. I’ll explain any difficult jargon used after each quote.)

From the British Medical Journal:

“The continuous MBTI dimensions have reasonable split half reliability ~.84, and proponents of the MBTI have been keen to imply that this validates the MBTI [e.g. 8]. However, these test-retest measures are highly sensitive to inter-test interval (ITI). With an ITI of less than nine months test-retest reliability is ~.80, but over nine months it is ~.65 [8]. This is not what we would expect of a trait that is supposedly stable over time!

The dichotomous classifications are actually much less reliable than the measures for the continuous dimensions would imply. This is because using mid-distribution dichotomous cut-offs actually requires even more reliable continuous measures than trait instruments [8]. Only about 50% tested within nine months score the same on all four dimensional dichotomies (i.e. remain the same type) and around 36% remain the same after nine months. Within each scale ~83% retain the same categorisation when retested within nine months, and ~75% when tested after nine months [8]. This is not good for a test supposed to detect categorical type, fixed over a lifetime, and undermines the use of a typological classification, particularly given the many revisions to the scoring system. Form M of the MBTI, scored by IRT, is reported to show an overall type agreement with the previous Form G of 60% [14]

In true-type studies (where the personal evaluation of the MBTI type is compared with the score allocated MBTI type), Carskadon & Cook [2] found that 50% of people picked their MBTI profile, while 13% picked the completely opposite profile.”

“Split-half reliability” is the idea that if you break a test into two similar groups, the results on each half of the test should correlate with each other. Simple enough. These quotes highlight the fact that MBTI tests are consistent if taken close in time to each other. The longer between tests, the more variable the results, both on each type and in general. Also, people tend to identify less with their tested personality types than you might expect.

From the Center for Applications of Psychological Type:

” 1. Reliabilities (when scores are treated as continuous scores, as in most other psychological instruments) are as good or better than other personality instruments.
2. On retest, people come out with three to four type preferences the same 75-90% of the time.
3. When people change their type on retest, it is usually on one scale, and in scales where the preference clarity was low.
4. The reliabilities are quite good across age and ethnic groups, although reliabilities on some scales with some groups may be somewhat lower. The T-F scale tends to have the lowest reliability of the four scales.
5. There are some groups for whom reliabilities are especially low, and caution needs to be exercised in thinking about using the MBTI® instrument with these groups. (For example, children)”

This is the other half of the story. MBTI tests are more reliable when you take them nearer to each other. Also, strength of preference is important. On the flip side, the T-F (”thinking-feeling”) distinction may be weak. (You can read more about the specific dimensions of personality on Wikipedia if you’re unfamiliar with them.) Also, specific groups of people don’t respond to MBTI tests well.

And finally, from Wikipedia:

“Split-half reliability of the MBTI scales is good, although test-retest reliability is sensitive to the time between tests. However, because the MBTI dichotomies scores in the middle of the distribution, type allocations are less reliable. Within each scale, as measured on Form G, about 83% of categorizations remain the same when retested within nine months, and around 75% when retested after nine months. About 50% of people tested within nine months remain the same overall type and 36% remain the same after nine months.[21] For Form M (the most current form of the MBTI instrument) these scores are higher (see MBTI Manual, p. 163, Table 8.6).

For example, some researchers expected that scores would show a bimodal distribution with peaks near the ends of the scales, but found that scores on the individual subscales were actually distributed in a centrally peaked manner similar to a normal distribution. A cut-off exists at the center of the subscale such that a score on one side is classified as one type, and a score on the other side as the opposite type. This fails to support the concept of type–the norm is for people to lie near the middle of the subscale.[5][6][23][24][7] Nevertheless, “the absence of bimodal score distributions does not necessarily prove that the ‘type’-based approach is incorrect.”[24]

It has been estimated that between a third and a half of the published material on the MBTI has been produced for conferences of the Center for the Application of Psychological Type (which provides training in the MBTI) or as papers in the Journal of Psychological Type (which is edited by Myers-Briggs advocates).[25] It has been argued that this reflects a lack of critical scrutiny.[23][25] Estimations on the research related to the most utilized tool published in fifty years (e.g. 40 million administrations) is affected by the popularity of the instrument.

Some researchers have interpreted the reliability of the test as being low, with test takers who retake the test often being assigned a different type. According to surveys performed by the proponents of Myers-Briggs, the highest percentage of people who fell into the same category on the second test is only 47%.[citation needed] Furthermore, a wide range of 39% - 76% of those tested fall into different types upon retesting weeks or years later.[23][7]”

These quotes emphasize and cover some of the same territory that the other quotes do. Wikipedia also mentions that MBTI scores tend to cluster around the boundaries of each of the four personality types, much like a bell curve or normal distribution, where the center is the boundary of the type. Also, a lot of research into MBTI was funded by people giving MBTI training (CAPT).

This reveals three main criticisms about the reliability and consistency of MBTI tests, which are related to each other. The first is that people’s MBTI scores tend to vary a lot over time, despite Keirsey’s claim that personality is inborn and invariant. I think the response to this is easy. Personality isn’t inborn. I see no reason why it should be. My guess is that people have a genetic tendency toward a certain personality, but that you can change that over time. Plus, for each personality dimension, MBTI scores tell you how far you lean one way or the other, as a percentage. I would say, as others do, that how far you lean is how strong your personality is in that dimension. Not everyone has a strong personality, and those that don’t will probably change their personality more often. I don’t see why that should be a problem. In fact, it would help explain some of the variation over time.

The second criticism is that despite MBTI tests being modal, people cluster around the boundary lines for each dimension of personality. Classifying people into one group or the other, then, is essentially arbitrary. Again, I don’t think this is a problem. It just signals that most people don’t have strong personalities. I see this as a strength of MBTI tests, actually. Like in politics, I think most people have moderate (rather than extreme) personalities, and that they’re quite capable of being influenced by others.

The final criticism is that MBTI tests are too specific. Why think there are 16 separate personality types? Well, I’ve already said how I feel about that. 16 is too many, especially when you’re trying to analyze something as fuzzy, complicated, and malleable as personality. Keirsey’s four temperaments seem like a much better bet. We’ve already seen that MBTI tests can have issues with reliability over time. That’s a classic symptom of “overfitting” in a model. Overfitting is the concept that your model has too many rules for too little data. While that might do well for the data you’ve made that model around, when you throw new data at an overfitted model, it will often perform poorly.

Imagine you’re teaching someone how to play basketball; if you try to be too specific, you won’t be able to cover the large range of situations that can come up in-game. If you try and tell them how to dribble, which way to go, exactly how to shoot, etc., they’ll get lost in the individual details and won’t be able to play the game very well. That’s what overfitting is like. Far better to give them a few basic rules and let them experiment. This idea is often summed up by “Occam’s Razor” or the Principle of Parsimony. As Einstein said, “Make everything as simple as possible, but not simpler.”

Four temperaments is a lot simpler than 16 personality types. Four times simpler, in fact. Not only that, but if you think 16 personality types is overfitting and personality is a moving target, four temperaments will do a lot better job over time in modeling personality. There’s more wiggle room in the temperaments, and thus more stability. I couldn’t find data on exactly how reliable temperaments are compared to types, but my suspicion is that they hold up a lot better. The theory is a lot more sound, anyway.

However, those four temperaments are only decided by 3 of the 4 dimensions of personality; Introversion vs. Extraversion is superfluous. I’m OK with that. Introversion vs. Extraversion, as I’ve read, has always been less important for determining your personality and more just a particular flavor of it. It’s good to know, but not critical.

Given that the Feeling-Thinking distinction is the weakest, that also means that there’s less of a difference between Idealists (NFs) and Rationals (NTs) than between Artisans (SJs) Guardians (SPs). As before, that’s OK with me. There was a fairly long stretch where I tested as NF, so that goes along with my experience. (However, over time I consistently test NT much more than anything else.)

So, if you strip down MBTI theory to its barest bones as Keirsey’s 4-temperament theory, I think you have something. Otherwise, I’m not so sure. As Keirsey mentions, you’ll see these four temperaments pop up again and again in history. They were mentioned, in one form or another, by Plato (“The Republic”), Aristotle, the “Old Testament” (Ezekiel), the “New Testament” (the four Apostles/Gospels), Galen, Paracelsus, Shakespeare, “Harry Potter”, “The Wizard of Oz”, “Sex and the City”, many modern psychologists, and maybe even the “Teenage Mutant Ninja Turtles”. Even hardcore skeptics like the essayist Michel de Montaigne and the philosopher David Hume accepted the temperaments as a basic fact. Moreover, there are abundant examples in history, literature, and pop culture. I’m not just taking Keirsey at his word, either. I’ve actually read at least 3/4ths of the source material he’s talking about, and from what I know I’m inclined to agree with him. By contrast, I doubt I’ve ever seen a reference to 16 personality types.

Another thing I like about Keirsey is how applied he is. Myers, though she tried out her tests extensively, was very much against using research to guide the types themselves. Keirsey, on the other hand, let his temperaments grow out of more than 20 years of research and working directly with kids. Though I’ve linked to this before in my blog, Keirsey’s method for controlling problem children called “Abuse It — Lose It” was fascinating to me and was based directly on a four-temperament framework. Keirsey writes about the experience here.

There’s a lot more I could say on the topic, but in the spirit of the Principle of Parsimony (and being an efficiency-loving Rational), maybe I should stop here. And I’m sure many of you reading would agree, since Rationals tend to be overrepresented online as well. So from one Rational writer to all you Rational readers out there, that’s it for this week. See you next time!

| | del.icio.us

One Response to “If the Shoe Overfits… Part Two”

  1. Bob Says:

    Being a good skeptic I have to question personality tests, but in a lot of ways I want to believe in every one of them, without too much critical thought. I mean whenever you take one you feel good about yourself, for the reasons mentioned in the previous post of confirmation bias, and the fact that they rarely say anything bad. I think the less specific you get with them, as you have mentioned, the more accurate they can be.

    My main concern with personality tests now is one more of functionality than accuracy. If you assume that for any given person they match their type 90%, I am not sure that give you enough of a guideline to deal with them. People are really complex, and if you try to interact with them only in the way that it has been deemed you should treat their type I think it will lead to more problems than if you had no information on them. It seems even if you use the type as a “guidline” you will not have any more information on the person than if you knew nothing.

    For me, the scariest application of this is the idea that you can place someone in a job based solely on they personality type. This is a concern for me because I think in myers-briggs I am exactly 180° out of phase with the typical engineering archtype. If taking a personality test was a requist for my job there would be a 0% chance I would be accepted. Now, you can argue that I am not a great engineer and the company would have been better off, if not me, but I am still not sure I think that a personality type should make that decision.

    Actually Bob, it’s funny. MBTI was specifically designed not to be perjorative (it’s like horoscopes in that way, I admit). Many other personality tests are not that way, oddly enough. I see this as a strength because people will never identify with a negatively perceived personality type, and plus it brings the whole morality question into play. We both agree, though, that the tests should not be really specific. However, I still think they should be specific enough to not be “horoscopey”, which I think is true, actually.

    Do personality types work as useful guidelines, then? My experience has been that they do. I believe they are one of many “tools” in the toolbox for understanding and predicting other people’s behavior. To miss out on these large scale trends in people’s personalities really hamstrings yourself, I feel. And if they weren’t useful in understanding people, why have they loomed so large for so long in literature and history? This has been borne out by my personal experience as well.

    The job placement thing is another situation entirely. That’s iffy territory. Long ago, my original inspiration for reading the Occupational Handbook was to classify every job in rough MBTI type terms. Though it was a very interesting exercise, I found some professions were much easier to classify than others, and plenty had vague or even non-existent categories. What personality type is a fast food worker? A busboy? A manager? It’s not obvious for some jobs what the associated Keirsey type should be. After finding that out I somewhat abandoned the project. However, like before, I will say that MBTI can be a good tool for jobs that require specific, well-established, strong personality types. Even then, it should not be used as a screening tool but as one to aid in interviewing and hiring, I think. (Which is exactly how many are used. Sadly, some are also used for screening. Some personality tests are also used in team formation, which is OK I guess.)

    - Dave

Leave a Reply