JAMPJournal of Applied Mathematics and Physics2327-4352Scientific Research Publishing10.4236/jamp.2019.77103JAMP-93794ArticlesPhysics&Mathematics Why Quantitative Variables Should Not Be Recoded as Categorical AntônioFernandes1CaioMalaquias1DalsonFigueiredo1Enivaldoda Rocha1*RodrigoLins1Department of Political Science, Federal University of Pernambuco (UFPE), Recife, Brazil1007201907071519153013, May 201919, July 2019 22, July 2019© Copyright 2014 by authors and Scientific Research Publishing Inc. 2014This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/

The transformation of quantitative variables into categories is a common practice in both experimental and observational studies. The typical procedure is to create groups by splitting the original variable distribution at some cut point on the scale of measurement (e.g. mean, median, mode). Allegedly, dichotomization improves causal inference by simplifying statistical analyses. In this article, we address some of the adverse consequences of recoding quantitative variables into categories. In particular, we provide evidence that categorization usually leads to inefficient and biased estimates. We believe that considerable progress in our understanding of data analysis can occur if scholars follow the recommendations presented in this article. The recodification of quantitative variables as categorical is a poor methodological strategy, and scientists must stay away from it.

Dichotomization Inefficiency Bias
1. Introduction

Imagine a political scientist wants to estimate the effect of income, as measured by a continuous yearly revenue, on partisanship. Before performing data analyses, she decides to split income into three levels: low, medium, and high. Similarly, suppose a physicist wants to examine the effect of age on the likelihood of developing coronary heart diseases. Before running the model, she recodes age into four groups. In this article, we address some of the adverse consequences of dichotomizing quantitative variables. Technically, categorization always implies a loss of information, and it usually leads to misleading results     . To make our case, we reproduce data from  and  . Besides, we employ basic simulation to show how dichotomization generates inefficiency and bias. To increase transparency    , we report all computational scripts used to generate statistical analyses.

Our target audience is graduate students in the early stages of training and scholars with a minimum mathematical background. For this reason, we minimized algebraic applications to facilitate the understanding of the original content. In particular, the paper fills a gap in the political methodology literature. We reviewed 24 articles on dichotomization published in 20 journals from 1983 to 2017, and none of them was available in political science journals (see Appendix Table A1). As long as the categorization of quantitative variables is a common practice not only in the Social Sciences but also in the Health Sciences   , we believe that considerable progress in our understanding of data analysis can occur if scholars follow the recommendations presented in this article.

The remainder of the paper is structured as follows. Following section reviews the literature on categorization. The second section replicates data from different studies to show how the transformation of quantitative variables into categories may lead to wrong conclusions. The third section uses basic simulation to highlight the shortcomings of dichotomization, focusing on both bias and efficiency. The final section concludes.

2. What Is the Problem?

Information loss, Inefficiency, Bias, concisely, these are the main problems generated by the categorization of quantitative variables  . Despite its widespread use, the scholarly literature has accumulated systematic evidence on why scholars should avoid dichotomization. The discretization reduces measurement accuracy, underestimates the magnitude of the coefficients of bivariate relationships, and lowers statistical power   . Also, the artificial transformation of quantitative measures into groups may lead to biased coefficients and unreliable standard errors in multivariate models   .

Methodological pleas against dichotomization are not new. For example,  showed that dichotomizing one of the variables at it’s mean reduces the population correlation coefficient by 20% on average.  estimated the effects of dichotomization in the context of analysis of variance (ANOVA). Similarly,  argues that dichotomization leads to a loss of one-fifth to two-thirds of the variance that may be accounted for on the original variables.  showed that the transformation of quantitative measures into categories underestimates both effect sizes and statistical power. Table 1 summarizes scholarly work against dichotomization.

Literature against dichotomizatio
Author (year)Warning
“The use of the pseudo-orthogonal design biases the differences in means for the main effects relative to the differences in those means that would be obtained in a single-factor experiment” (p. 464).
“Dichotomizing one variable at the mean results in the reduction in variance accounted for to 0.647 r2; and dichotomizing both at the mean, to 0.405 r2” (p. 249).
“Analyses with categorized continuous variables required greater than 40% more patients for the same power as that achieved using continuous variables” (p. 138).
“Dichotomizing a continuous predictor variable can be conceptualized as adding an error of measurement to the variable. As a result, the effects of dichotomization are similar to the effects of random error of measurement” (p. 186).
“Dichotomization of continuous data is unnecessary for statistical analysis and in particular should not be applied to explanatory variables in regression models” (abstract).
“Dichotomizing a continuous variable is known to result in the loss of information, lower statistical power, and lower reliability” (abstract).
(Dichotomization) “(…) is harmful from the viewpoint of statistical estimation and hypothesis testing” (abstract).
“Modern regression models do not require categorization. In general, continuous variables should remain continuous in regression models designed to study the effects of the variable on the outcome of interest” (p. 3).
“Undesirable effects occur from dichotomization of both independent and dependent variables. The problem gets worse when multiple independent variables are split; for example, residual confounding is introduced, and spurious interaction effects may be seen” (p. 225)
“Simply dichotomizing continuous variables without previously referring to the original distributions by plotting them and checking consequences of dichotomization is a bad idea and should be discouraged” (p. 78).

Note: We reviewed 24 papers published in 20 journals from 1983 to 2017.

Another criticism against dichotomization comes from measurement literature   1. According to  , “dichotomizing adds errors of discreteness. That is, the amount of unmeasured true scores variance for the cases at each of the points of the dichotomy is necessarily greater than it would be for cases at each of the multiple points in the original scale” (p. 249). Simirlaly,  argue that the categorization of quantitative variables into groups is equivalent to add measurement error to the variable. Therefore, dichotomization increases the difference between true scores and measured values, which is likely to produce unreliable estimates. Figure 1 shows the relationship between dichotomization and measurement error2.

B and C have similar scores when X is measured continuously. However, the dichotomization leads to an inefficient aggregation of A and B vis-a-vis C and D. Comparatively, the least useless procedure is to split a normal variable at its mean, which reduces the variance of the original variables by a 20% on average. However, it is doubtful to find perfect normal distributions in practice. Therefore, depending on the shape of the distribution, categorization will lead to more significant information loss   . In short, the categorization of quantitative variables will always generate information loss, which in turn will reduce estimates efficiency. In some cases, in addition to inefficiency, dichotomization can lead to biased estimates, as we will show in the next section.

3. Replication

In this section, we replicate two secondary datasets to show some of the adverse consequences of dichotomizing quantitative variables. The first example comes from  . They created a hypothetical example to represent the relationship between

the number of errors made in a cognitive laboratory (X1), the speed of response during the task (X2), and the score on a standardized ability test (Y). Figure 2 shows the Pearson correlation coefficient among those variables.

To explore the impact of categorization,  dichotomized both independent variables at their respective medians (13). Then, they estimate a 2 × 2 ANOVA, which revealed an effect of X1 and X2 over the mean of Y. According to  , “the bivariate dichotomization of X1, and X2 has led to a situation in which the estimated effects of X1 and X2 on Y are biased” (p. 183). A simple linear regression on the effect of X2 on Y vanishes after we control for X1. In short, these results indicate that categorization may lead to misleading results.

The second example comes from  . He simulated five different scatterplots that yield an identical fourfold table when X and Y are dichotomized at cut point 0, misleadingly suggesting no association between the variables. Figure 3 replicates data from  .

Dichotomization leads us to overlook the true nature of the relationship between X and Y. According to  , “simply dichotomizing continuous variables without previously referring to the original distributions by plotting them and checking consequences of dichotomization is a bad idea and should be discouraged” (p. 3). These two examples show how dichotomization can lead scholars to wrong inferences.

4. Simulation

To stress our distrust on dichotomization, we employ basic simulation to show how the transformation of quantitative variables into categories produces inefficiency. First, we generate two normal variables (X and Y) correlated at.6 for a sample size of 300 cases. Then, we recode X at its mean (0) into two groups: below the average and above the average to produce a dummy variable (0 or 1). Figure 4 shows the distribution of X and its dichotomization cutpoint at 0.

Figure 5 shows the correlation between X and Y and X categorized and Y for all cases (n = 300) and for a small sample of observations (n = 30).

The true correlation coefficient is 0.600. By dichotomizing X at its mean, we observe a linear association of 0.475, which represents a 20.83% difference from the known parameter. For a small sample size (n = 30), the Pearson correlation using the original variables is 0.465, which is closer to the true parameter value compared to the estimate from the dichotomized model. In short, regardless of the

sample size, dichotomization will lead to information loss, which decreases estimates efficiency. Table 2 shows the estimates of two linear regression models.

Considering all cases (n = 300), the standard error of the dichotomized model is twice as large compared to the model using the original variables. For a bivariate linear regression, the coefficient of determination is calculated by the square of Pearson correlation coefficient (0.6), which is 36%. In the dichotomized model, we observe an r2 close to 23%, which underestimate the goodness of fit of the model. For n equals to 30, the categorization of the independent variable leads to the incorrect retention of the null hypothesis at 5% level (p-value = 0.052). Although our simulation deals with only two variables, the same reasoning applies to multiple linear regression, which is widely used in empirical research in both Human and Natural sciences  .

Now let’s consider a slightly more complicated case. We simulate the following model:

Y = 100 + 0.20 ∗ X 1 − 0.40 ∗ X 2 + ε (1)

where X1 follows a normal distribution (0, 1), X2 follows an exponential distribution (λ = 2) and ε has average value equals to zero and standard deviation equals to 1 for a population of 100 observations. Table 3 compares the results of a linear regression using original variables to a model when both independent variables are dichotomized at their means.

The dichotomized model displays a lower r2 and F statistic, suggesting poor

Sample size
30030
Level of measurement of XΒeta (Std. Error)tr2Βeta (Std. Error)tr2
Original0.600 (0.046)12.950.3600.437 (0.157)2.780.216
Dichotomized0.948 (0.102)9.310.2250.609 (0.300)2.030.128

Note: we estimated two linear regression models. The first one was estimated with both variables at their original level of measurement (continuous). The second model used X dichotomized at its mean (0).

Linear regression (original x dichotomized variables)
MeasurementModelβStd. Errorp-valueLowerUpper
Originalα100.120.1480.00099.83100.41
X10.4000.1000.0000.2020.598
X2−0.5270.1910.000−0.907−0.147
F = 11.465; r2 = 0.191
Dichotomizedα99.710.1820.00099.352100.07
X10.5430.2240.0170.0980.988
X2−0.2300.2330.325−0.6930.232
F = 3.924; r2 = 0.075

Source: authors.

goodness of fit. When variables are used at their original level of measurement, regression coefficients are unbiased estimates of the population parameters. However, when both variables are dichotomized at their means, X2 is no longer statistically significant which will lead us to retain the null hypothesis of no effect incorrectly. For public policy, the conclusion would be to cut resources. In medical research, the inference would be that the treatment has no impact on health. Figure 6 depicts the residual diagnostics from the dichotomized model.

5. Conclusions

Despite criticisms from the scholarly community, dichotomization still is a common practice in empirical research. Unfortunately, many researchers categorize quantitative variables before running data analyses. This is true from Biology to Psychology, from Medical research to Sociology. Before statistical software and computers development, categorization played an essential role in science by simplifying mathematical modeling. It is not the case anymore. Since we have more appropriate tools to deal with reality, there is no reason to transform quantitative measures into categories. More than 30 years ago,  argued that “scientific questions are better decided by empirical evidence than by methodological default” (p. 833).

Categorization usually leads to misleading results. It can deceive us by increasing inefficiency and affecting the probability of type I and type II errors. Dichotomization also generates biased coefficients since it can hide the correct functional form of the observed relationship. In some cases, when two or more independent variables are dichotomized, a truly null effect will likely reach statistical significance. The artificial transformation of quantitative variables into groups reduces the power of statistical tests and increase errors of discreteness. What will happen if both independent and dependent variables are categorized? Double dichotomization using the mean as cutpoint is equivalent to lose almost 1/2 of the sample cases  . In short, dichotomization leads to a systematic loss of information which has detrimental effects on the reliability of statistical estimates.

In sum, the recodification of quantitative variables as categorical is a poor methodological strategy, and scholars must stay away from it. Dichotomization undoubtedly simplifies data analysis, but the costs are too higher to bear. Today, categorization is neither appropriate nor justifiable. Continuous variables are as good as they are. Let’s be cool about it and leave quantitative variables alone.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Cite this paper

Fernandes, A., Malaquias, C., Figueiredo, D., da Rocha, E. and Lins, R. (2019) Why Quantitative Variables Should Not Be Recoded as Categorical. Journal of Applied Mathematics and Physics, 7, 1519-1530. https://doi.org/10.4236/jamp.2019.77103

Appendix Literature review per area
Author (year)Journal
Applied Psychological Measurement
Journal of Applied Psychology
British Journal of Cancer
American Journal of Epidemiology
Epidemiology
Psychological Bulletin
 Journal of Educational and Behavioral Statistics
Development and Psychopathology
Journal of Multivariate Analysis
Psychological Methods
Journal of Marketing Research
Journal of the American Statistical Association
British Medical Journal
 Statistics in Medicine
Neuroepidemiology
 Pharmaceutical Statistics