_{1}

In item response theory (IRT), the scaling constant D = 1.7 is used to scale a discrimination coefficient a estimated with the logistic model to the normal metric. Empirical verification is provided that Savalei’s [1] proposed a scaling constant of D = 1.749 based on Kullback - Leibler divergence appears to give the best empirical approximation. However, the understanding of this issue as one of the accuracy of the approximation is incorrect for two reasons. First, scaling does not affect the fit of the logistic model to the data. Second, the best scaling constant to the normal metric varies with item difficulty, and the constant D = 1.749 is best thought of as the average of scaling transformations across items. The reason why the traditional scaling with D = 1.7 is used is simply because it preserves historical interpretation of the metric of item discrimination parameters.

Two common families of models used in item response theory (IRT) are the normal and logistic distribution functions. These two models are both used extensively. The logistic model is used more in ongoing assessment programs, while the normal model tends to be used more in research studies. It is thought, with some justification, that these two models obtain coefficients estimates that are practically indistinguishable after a simple multiplicative scaling. Below, this claim is explained in detail and investigated more thoroughly.

Assume a set of test items j = 1 , ⋯ , J for subjects i = 1 , ⋯ , N , where the items are dichotomously scored: correct responses are scored Y i j = 1 , and incorrect responses Y i j = 0 . Also, assume there is a single latent variable θ_{i}, which is known as examinee ability or proficiency, that accounts for an examinee’s observed item responses. While θ can be multidimensional, only the unidimensional case is considered here.

With the two parameter normal ogive function (2PN), a correct response on item j presented to examinee i is modeled by

P i j ( Y i j = 1 | η i j ) = G ( η i j ) η i j = a j ( θ i − b j ) , (1)

where G is the cumulative normal distribution function defined as

G ( η ) = 1 2 π ∫ − ∞ η exp ( − t 2 / 2 ) d t . (2)

In this two-parameter IRT model, a j is a guessing parameter, and b j is an item difficulty parameter. The person parameter θ_{i} is defined above. In the two parameter logistic model (2PL), correct responses are modeled by

P i j ( Y i j = 1 | η i j ) = F ( η i j ) = exp ( η i j ) 1 + exp ( η i j ) , (3)

where again η i j = a j ( θ i − b j ) . Note that the parameterizations of both the 2PN and 2PL IRT models are in terms of a_{j}, b_{j}, and θ_{i}, and the interpretations of model coefficients as item discrimination, item difficulty, and person ability are identical. In short, both the 2PN and 2PL models for dichotomous items provide an estimate of the probability that an examinee will answer correctly, and both models are based on the same item and person parameters.

In this context, it has been known for some time that the logistic distribution is very similar to the normal distribution because conditional on θ, 1) both distributions are determined by a location and a scale parameter, and 2) both distributions are bell shaped. At the same time, the logistic distribution has heavier tails than the normal distribution. The question then arises of how well G(η) can be approximated with a scaled version of the logistic distribution F ( D η ) = F ( D a * ( θ − b ) ) , where D is known as the scaling constant. This produces the scaled discrimination a * = a / D , which is purportedly interpretable in the normal metric. Note also that D = a / a * , so that D can be conceptualized as the ratio of the logistic-metric discrimination to the normal-metric discrimination.

^{1}Camilli [

Several informal suggestions have included D = 1.814 [^{1}. Savalei [

K ( g , f ) = ∫ ln [ g ( x ) / f ( D x ) ] g ( x ) d x , (4)

^{2}Pingel [

where g and f are the normal and logistic density functions, and integration is over ℝ 1 . This results in the scaling constant D = 1.749. More recently, Pingel [^{2}. Minima of the expected value of other distance functions provide similar results. For example, minimizing the average absolute difference leads to D = 1.701. With this panoply of scaling options, the question arises “Which one is best?” The scaling constant D = 1.749 seems to work the best, but this answer is somewhat misleading as shown below.

While it is theoretically established than the use of a scaling constant results in a close match between F and G, the current paper provides empirical verification, and a new result. For this purpose, a simulation study was designed to compare estimates obtained with a logistic model (2PN) for data generated with a normal model (2PL). In theory, the estimated 2PL parameters scaled with D should be very close to the known 2PN parameters. In addition, the ability estimates of θ obtained with the 2PL model should closely match those of the known 2PN model. The steps in the simulation were as follows:

1) For 100 items, generate item parameters with discrimination with a ~ lognormal (0.25, 0.25) and intercepts b ~ normal (0, 0.85). These generating distribution give adequate approximations to observed empirical distributions of item parameter estimates.

2) For 100,000 persons, generate ability parameters with θ ~ normal (0, 1). A large sample size is used to minimize the effects of estimation errors on the estimation of a scaling coefficient.

3) Estimate 2PL model parameters (a, b, θ) using the EM algorithm with 61 quadrature points, and

a) Compare a ^ to the normal generating value of a for each item. If the scaling constant is accurate, it should be the case that D ≈ a ^ / a ∗ for all items.

b) Compare b ^ to the normal generating value of b for each item

c) Compare θ ^ to the normal generating value of θ for each person

The idea here is to obtain the ratio of the estimate of the 2PL a to its normal generating parameter. This ratio is the empirical scaling value D. This process should reveal which of the proposed scaling values is most accurate. Note that the b and θ parameters do not need to be scaled; their metric is typically defined by the identification restriction θ ~ normal (0, 1), which is employed in many IRT software packages.

For the purpose of this simulation, the R software was used to randomly generate item responses Y i j by (a) computing G ( η i j ) in Equation (1) from simulated parameters obtained from steps 1 and 2 above, (b) drawing a uniform random variate U[0, 1], and (c) setting Y i j = 1 if U < G ( η i j ) and Y i j = 0 otherwise. Parameter estimates for the 2PL model were obtained using flexMIRT [

To compare discriminations, the ratio was taken of the logistic estimate a ^ to its normal generating parameter a* for each item. In theory, this ratio should be close to D for all items. Across items, the median ratio was 1.751 and the mean ratio was 1.756. This is very close to Savalei’s [

The b parameter estimates on average differed from the 2PN generating values by 0.002 on a unit normal scale, with a minimum difference of −0.021 and a maximum difference of 0.062. This indicates the 2PL IRT model provides estimates that are empirically very similar to those of a 2PN model. To study the

ability parameter θ, estimated values were regressed on 2PN generating values. This resulted in an intercept of −0.003 and a slope of 0.981. A linear correlation of r = 0.989 was obtained. The plot (not shown) provided no evidence of nonlinearity.

The empirical conclusion based on these results is nearly the same as the theoretical expectation: there is little difference between the 2PN and the scaled 2PL estimates of item parameters for items having nonextreme values of b. In a simulation not shown, this result was also verified for IRT models for partial credit data, using the logistic model [

So far, this paper has omitted consideration of the most fundamental question: Why scale? The term “accuracy” implies the normal metric is the correct one for obtaining IRT parameters, but this fundamental assumption is rarely recognized let alone tested. The logistic function due to its heavier tails may even be preferable in situations involving noisy data. Ironically, one could even argue that D^{−}^{1} should be used to scale 2PN item parameters to the logistic metric. In short, there is no necessary relationship between scaling and the accuracy of the IRT model.

The best choice of scaling in logistic IRT models would be not to scale at all―an approach taken in some current IRT software packages such as flexMIRT. The sole rationale for scaling with D is to establish historical continuity in interpreting the magnitude of item parameter estimates. More than two generations have passed since Alan Birnbaum’s suggested use of scaling [

Camilli, G. (2017) The Scaling Constant D in Item Response Theory. Open Journal of Statistics, 7, 780-785. https://doi.org/10.4236/ojs.2017.75055