Why We Should Not Discretize Continuous Variables

This first post on regression interactions is designed to explain why we should not discretize continuous variables to use an ANOVA-based model, but should instead examine continuous interactions via a regression analysis.

The data that I use in this example is based on Language Expectancy Theory (LET; Burgoon, 1990; Burgoon, Denning, & Roberts, 2002; Burgoon, Jones, & Stewart, 1975; Burgoon & Miller, 1985), which is a message-centered theory of persuasion that explains why certain linguistic formats in persuasive messages promote or inhibit persuasion. LET focuses on expectations for language use and how violations of these expectations impact persuasive outcomes.

When both IVs are categorical, a $2 x 2$ Analysis of Variance (ANOVA) is an appropriate way to test for an interaction. For example, let's say we are interested in examining Language Expectancy Theory's proposition that language use (intense, neutral) and gender of the source (male, female) interact on persuasive outcomes. An ANOVA-based model would be an appropriate choice to test this interaction.

However, let's say we are instead interested in Language Expectancy Theory's predictions about the interaction between language condition (intense, neutral) and author credibility (measured as a continuous variable) on message effectiveness, which is one measure of persuasive outcomes. In my thesis data, language condition was a categorical variable, and author credibility was measured as a continuous variable.1

IV: Author Credibility (Continuous)

IV: Language Condition (Categorical: Intense, Neutral)

DV: Message Effectiveness

In this type of situation, with one categorical IV and one continuous IV, people often force continuous variables into categories (e.g., by using a median split) and then use an ANOVA-based model to analyze their data. A median split on the continuous IV used in this example, author credibility, would involve splitting the data into a high credibility and low credibility group based on the median, which is depicted as follows:

Discretizing continuous data this way, however, should NOT be done for several reasons:

First, discretizing continuous data is conceptually questionable. To illustrate this point, let's again examine the distribution of author credibility:

If we were to dichotomize (i.e., turn into two categories) author credibility using a median split, we are saying that B is more similar to C than it is to A. As you can see, this doesn't really seem to be the case, thus illustrating the conceptually questionable practice of discretizing continuous variables.

Second, discretizing continuous data results in a loss of information, a loss of variance, and a loss of power. When you dichotomize a continuous variable, you throw this information away and lose the variance:

The estimated correlation, $\hat{\rho}$, with message effectiveness and:

The correlation drops in magnitude because you have lost power in discretizing the variable.

Third, discretizing continuous data results in misclassification. If you perform a median split on any continuous scale, you might incorrectly identify a case with one group when it should have been in another because of measurement error. The author credibility scale in this example has a Cronbach's $\alpha = .90$, which means approximately $10\%$ of the variance reflects error variance. With a sample size of $N = 218$, this means that approximately 21 people will be misclassified with a median split!

To illustrate this point, Hayes (2005) created a table that displays the estimated percent of cases misclassified as a result of discretization:

As you can see, if we were to perfectly measure the latent variable, none of cases would be misclassified. However, we are never able to perfectly measure a latent variable, and, as you can see, as the reliability of the measure decreases, the percent of cases misclassified with a median split increases, as does the percent misclassified with trichotomization (i.e., splitting the continuous variable into three categorical segments, which should also not be done, for the same reasons).

So, do NOT discretize continuous variables to analyze a hypothesized interaction. Instead, you should use a regression interaction analysis!

References

Burgoon, M. (1990). Social influence. In H. Giles & P. Robinson (Eds.), Handbook of language and social psychology (pp. 51-72). London: Wiley.

Burgoon, M., Denning, V. P., & Roberts, L. (2002). Language expectancy theory. In J. P. Dillard & M. Pfau (Eds.), The persuasion handbook: Developments in theory and practice (pp. 117-136). London: Sage.

Burgoon, M., Jones, S. B., & Stewart, D. (1975). Toward a message-centered theory of persuasion: Three empirical investigations of language intensity. Human Communication Research, 1(3), 240-256.

Burgoon, M., & Miller, G. R. (1985). An expectancy interpretation of language and persuasion. In H. Giles & R. St. Clair (Eds.), Recent advances in language, communication and social psychology (pp. 199-229). London: Lawrence Erlbaum.

Hayes, A. F. (2005). Statistical methods for communication science. Mahwah, NJ: Erlbaum


  1. Note, in the actual experiment author biographical information was kept constant, and the author credibility scale was used for a manipulation check. I am only using the data here as an example.