Effects of Differential Item Discriminations between Individual-Level and Cluster-Level under the Multilevel Item Response Theory Model

This study attempted to interpret differential item discriminations between individual and cluster levels by focusing on patterns and magnitudes of item discriminations under 2PL multilevel IRT model through a set of variety simulation conditions. The consistency between the mean of individual-level ability estimates and cluster-level ability estimates was evaluated by the correlations between them. As a result, it was found that they were highly correlated if the patterns of item discriminations were the same for both individual and cluster levels. The magnitudes of item discriminations themselves did not affect much on correlations, as far as the patterns were the same at the two levels. However, it was found that the correlation became lower when the patterns of item discriminations were different between the individual and cluster levels. Also, it was revealed that the mean of the estimated individual-level abilities would not be necessarily a good representation of the cluster-level ability, if the patterns were different at the two levels.


Introduction
Multilevel modeling has become a popular data analysis technique in psychological and educational measurement.Traditional psychometric models, such as classical test theory and item response theory (IRT) models, do not account for a nested structure of the data.Multilevel modeling becomes important when researchers analyze nested data, because it takes into account of both within and between cluster variations of the data.One of popular multilevel modeling techniques is a hierarchical generalized linear model (HGLM).However, when HGLM is applied to multilevel IRT [1], one limitation is that all item discriminations are assumed to be equal.In other words, the relationships between the observed measurement indicators and the latent factor are assumed to be equal for all items in a test, which is sometimes an unrealistic assumption.If item discriminations are allowed to freely vary, the model may more closely resemble the observed data.
IRT models define the relationship between observed item scores and latent constructs for dichotomous and polytomous item response data.An IRT framework has been extended to a multilevel data structure [1]- [3].A multilevel IRT model is desirable when item response data have been collected from a sample with a nested data structure.In addition to the benefit of modeling data variations both at between and within cluster levels, relationships between variables at different levels can be estimated better as well.
One popular form of an IRT model for dichotomously scored items is a 2-parameter logistic (2PL) model, where the probability of an individual correctly responding to an item depends on individual's ability, item difficulty, and item discrimination.The 2PL IRT model can be written as where ( ) P θ is the probability that individual i with ability i θ answers item j correctly, i θ is an ability lev- el for individual i, j a is an item discrimination parameter for item j, and j b is an item difficulty parameter for item j.The numerator and the second term in the denominator ( ) The 2PL IRT model also allows the item discriminations to vary freely across items in a test.Its extension to a multilevel IRT model has been investigated and documented by several authors [4]- [9].However, not much attention has been given to what it means when patterns and/or magnitudes of item discriminations are different between individual and cluster levels.For example, Fox [4] illustrated the importance of taking measurement errors from different sources into account by a multilevel analysis.Although the author presented a new approach to multilevel modeling with three real data sets of mathematics by using 2PL IRT model, difference in item discriminations between levels were beyond his focus.Fox [5] focused on measuring latent dependent and independent variables of a multilevel model, where manifest variables consisting of binary, ordinal, or graded responses.This extension made it possible to model relationships between observed and latent variables on different levels using dichotomous and polytomous IRT models.However, different item discrimination patterns between levels were not a focus of this study.Natesan [7] studied the accuracy and precision of the item parameter estimates of the 2PL multilevel IRT model by varying test lengths, sample sizes, correlation between the predictor variable and the ability parameter, and the distribution shape of the predictor variables interact to impact the accuracy and the precision.Again, differential item discrimination patters between levels were not a focus of this study.
Some authors investigated a cluster-level IRT modeling.For example, Mislevy [6] proposed a notion of the cluster-level IRT model by exploring the relationships between individual-level and cluster-level IRT models, as well as the parameter estimates under the cluster-level IRT model.Results showed that when item response data are gathered in a design of one item per scale from each individual, it is possible to define a cluster-level IRT model.The cluster-level ability estimate was analogous to the individual-level ability estimates.The clusterlevel IRT model parameters specified the probability of a correct response to a given item from an individual selected at random from a given cluster.The cluster-level IRT model parameter estimate was a straightforward generalization of an individual-level IRT model technique.Tate [8] studied whether the cluster-level IRT model was robust to typical violations of distributional assumptions under the two-parameter cluster-level IRT dichotomous model through a simulation study.Results showed that the estimated precision was always either approximately consistent with the actual precision or a conservative estimate of the actual precision.When the items were replaced to target high-ability schools, the conservatism increased as school ability moved away from the target ability range.Also, Tate [9] extended to a similar study with a polytomous model.Results were similar to findings from [8].It is notable that Tate [9] concluded that the estimate of cluster-level ability for a specific cluster could be viewed as the mean of the individual-level ability of all individuals in that cluster if the individual abilities within each cluster are normally distributed with a mean equal to the cluster-level ability, and the cluster-level ability is also normally distributed.However, no discussion was provided regarding the effect of differential item discriminations between individual and cluster levels.Our concern was whether the patterns and the magnitudes of the item discrimination between levels would affect the estimates of individual or cluster level abilities differently.
Assuming we fit a 1PL multilevel IRT model.The pattern of item discrimination would be exactly the same for both individual and cluster levels.In other word, it obtains only one pattern of the item discriminations for both levels (e.g., same pattern of item discrimination across both levels).The magnitudes of item discrimination would be exactly the same for both individual and cluster levels (e.g., 1.0 for all items across both levels).On the other hand, once the 2PL IRT model is applied to the multilevel model, the patterns and the magnitudes of the item discrimination could be different across levels.While the item discriminations may have the same patterns and the same magnitudes for both individual and cluster levels, the item discriminations may have the same patterns but different magnitudes between individual and cluster levels.The item discriminations may also have different patterns and different magnitudes between individual and cluster levels.However, effects of the difference in different patterns and magnitudes of item discrimination have not been demonstrated in literature yet with the multilevel IRT modeling perspective.
If differential patterns or magnitudes of item discriminations between levels affect estimates of individual and cluster level abilities differently, the conclusion that Tate [9] has made may not be always correct.Namely, estimated cluster-level ability should not always be viewed as the mean of the individual-level abilities in the cluster.Therefore, the mean individual level abilities may practically over-or under-estimates the cluster-level ability.In practice, it is not uncommon to estimate the cluster-level ability from the mean of the individual-level ability of all individuals in that cluster.For example, it is very common to evaluate school level performance by computing the mean of estimated student abilities in each school.For these reason, we attempted to investigate how the patterns and the magnitudes of the item discrimination between individual and cluster levels would behave under the 2PL multilevel IRT model.It was also our intention to provide some insights on how we should interpret the differential item discriminations between individual and cluster levels.
An item discrimination indicates a quality of an item, because it dictates how strongly each item correlates with the ability being measured.A higher discrimination corresponds to a greater correlation with the ability, which also leads to a higher scoring weight for the item.Therefore, individuals who answer items with high discriminations correctly would have a higher estimated ability than individuals who answer items with lower discriminations correctly, given the same raw scores.In other words, it matters not only how many items were answered correctly, but also which items were answered correctly.Under the 2PL single-level IRT model, the individual ability estimates are weighted by item discriminations, such that ( ) where i θ is an ability level for individual i, j a is an item discrimination parameter for item j, j b is an item difficulty parameter for item j, and j u takes the values 0 and 1 corresponding to incorrect and correct responses of individual i on item j (j = 1, 2, …, N).However, it may or may not be a case under the 2PL multilevel IRT model.
Since 2PL multilevel IRT model allows item discriminations to vary both at the individual and cluster levels, we hypothesized that different patterns of item discriminations between the levels affect patterns of scoring weights to be different between the levels.For this reason, our hypothesis was that the aggregated mean individual-level abilities θ .Note that literature on cluster-level IRT model [6] [8] [9]   has suggested that the use of the aggregated mean individual-level abilities with respect to patterns of item discriminations at the individual and cluster levels, as an attempt to provide an insight on how one should interpret differential item discrimination parameters between the levels.

Modeling
The 2PL multilevel IRT model was investigated under this study for both data generation and fitting the model.In IRT, the response outcome data are treated as categorical with binomial distributions.The idea of a 2PL multilevel IRT model is to measure each latent variable incorporated in the multilevel model with the IRT model.The 2PL multilevel IRT model can be written as where * ijk y is the latent continuous response variable with the observed response is the item discrimination parameter for individual level i, ( ) θ is the unobserved latent ability level for individual i, ( ) is the item discrimination parameter for the cluster level k, ( ) is the unobserved latent ability level for cluster k, j b is the item difficulty parameter for item j, and ijk e is the error term with the logistic distribution.Both ( ) were generated from standard normal distribution with the mean of zero and the standard deviation of 1.0.

Simulation Conditions
It was assumed that the test consisted of 12 items with item difficulties ranging from −2.0 to 2.0.These item difficulties were fixed across conditions.The biserial correlations between the ability estimate and the propensity for correct response were used to set up the item discriminations for individual and cluster levels.These five item discriminations (0.8, 1.0, 1.2, 1.6, and 2.2) represented biserial correlations of 0.40, 0.50, 0.55, 0.66, and 0.77 between the latent trait and the propensity to a correct answer.They created 16 sets of item discriminations classified into three patterns (see Table 1).The first pattern (Pattern A) was for conditions with the same patterns and the same magnitudes of discriminations for both levels.The second pattern (Pattern B) was for conditions with the same patterns but different magnitudes of discriminations between the levels, and the third pattern (Pattern C) was for conditions with different patterns and different magnitudes of discriminations between levels.Also, the number of clusters (2 levels; 50 and 100) and cluster size (2 levels; 50 and 100) were manipulated, and these three simulation factors were crossed and created a total of 16 × 2 × 2 = 64 simulation conditions.Based on these specifications, the dichotomous item response data were randomly generated and fit by the 2PL multilevel IRT model as shown in Equation (3).The Mplus software, using the Maximum Likelihood estimator with robust standard errors was used to estimate model parameters ( ( )   First, conditions with the same patterns and the same magnitudes of item discriminations for both levels (Pattern A), and conditions with the same patterns of item discriminations but the magnitude of item discriminations were higher for individual level (Sets 4, 7, and 8 under Pattern B), the correlations were very high.They were in the range of [0.757, 0.928].These results indicated that the patterns and the magnitudes of item discriminations between levels affected the estimates and the correlations between the aggregated mean individual-level abilities close to the estimated cluster-level ability ( ) the patterns of item discriminations are the same between the individual and cluster levels.However, if the patterns of item discriminations are different between the levels, we hypothesized that the aggregated mean individual-level abilevaluated these hypotheses by computing the Pearson's product moment correlation between the aggregated mean individual-level abilities due to inappropriate estimation of its standard errors.However, this study focused on the relationship between the aggregated mean individual-level abilities On the other hand,( ) )