Item Response Theory Modeling of High School Students’ Behavior in a High-Stakes Exam ()

1. Introduction
Admissions to higher education institutions in Brazil are traditionally made through an entrance exam called the “vestibular”. However, since 1995 a three-stage evaluation process has become adopted by some universities as an alternative to the vestibular. Here, we consider one such an evaluation taken by the University of Brasilia called the “Serial Evaluation Program”, or PAS. In particular, we take recently publicly available data for the third stage of PAS for the years 2006 to 2008 and focus on the exam given on 7 December 2008 for 10,822 last-year high school participants.
The third stage of PAS’s exam involves two sections: 1) a foreign language section; 2) a general knowledge section that considers Portuguese, math, physics, biology, chemistry, the arts, philosophy, geography, history, literature and sociology. Here, we concentrate on the second section and its true or false questions composed of 100 items. The dataset is available at Figshare (https://doi.org/10.6084/m9.figshare.5882377.v1).
This is a high-stakes environment for the applicants [1] [2] [3] because the exam payoff means entering a top university. Under such circumstances, participants are expected to behave strategically [4] .
This work considers item response theory [5] to model the participants’ behavior. In psychometrics, item response theory (IRT) constitutes a set of methodologies that allow the estimation of intangible individual characteristics (or latent features), such as intelligence, personality traits, emotional states, proficiency and risk taking [5] .
In particular, we postulate here the probability of a high-school participant to correctly answer an item on the exam depends on both the intrinsic characteristic of such an item, such as its degree of difficulty, and the participant’s proficiency on the subject the item refers to. Acting strategically, the participant may also either provide an incorrect response or leave the question blank. Here, leaving the question blank is strategically better because answering incorrectly is a loss. When facing difficult questions, we also assume the participant makes a decision taking into account both intrinsic difficulty and intuitive latent features that we call “propensity”.
Leaving a question blank may also mean a participant’s low proficiency regarding the item as well as the propensity to avoid a loss accruing from answering incorrectly. Our model aims to recover information regarding the roles the latent features of proficiency and propensity play in a decision.
Section 2 introduces a model of proficiency and propensity based on item response theory. Section 3 analyzes the data using the model and shows the results found. And Section 4 concludes the study.
2. An IRT Model of Proficiency and Propensity
Consider a group of n participants who take part in an exam made up of I items. Let
be a dummy variable that takes on value 1 if individual j does answer item i (where
), or takes on value 0 if individual j does not answer item i. In addition, let
be another dummy such that
if item i is correctly answered by individual j, and
if item i is not correctly answered or left blank by individual j.
Figure 1 displays the possible paths for result
. First, individual j decides whether or not to answer item i. If individual j decides to answer, his or her answer may end up correct or incorrect. If he or she decides not to answer, then
. Thus, if
then
with probability 1.
Table 1 shows the joint probability distribution of variables
and
. Their conditional probabilities are:
(1)
(2)
. (3)
The probabilities P in (1), (2) and (3) are obviously related to i and j, but subscripts have been omitted for notational convenience.
Setting
, the joint probability distribution can be written as
, (4)
where
.
The conditional probabilities in (2) and (3) capture the trade-off faced by individual j of either responding to item i incorrectly or leaving the item blank. These two possibilities refer to the event
. However, treating missing data as incorrect is the least desirable way to account for missing-not-at-random responses in large-scale surveys [6] , because a participant tends to leave blank those items he or she considered difficult. To be in control, the participant manages to pick those items that match his or her proficiency. Incorrect answers and nonresponses have the same payoff, but considering nonresponses as the same as incorrect answers bias proficiency estimates [6] [7] .
To remedy this deficiency, we consider the approach initiated by Knott et al.
![]()
Figure 1. Possible paths for result
.
![]()
Table 1. Joint distribution between
(j does respond = 1; does not = 0) and
(correct = 1; incorrect = 0).
[8] and Albanese and Knott [9] , and followed by many [10] - [16] . We introduce in our model a bivariate latent feature
, where
is what we called propensity in the previous section, and
is proficiency. Propensity precisely means responses to the items convey information regarding the participants who are more prone to answer incorrectly rather than not to answer. Our strategy of modeling allows us to incorporate nonresponses explicitly into an analysis. In particular, we consider the terms in Equation (4) to be described by two-parameter logistic equations [8] [9] :
(5)
(6)
where
and
are parameters related to the discriminating power of a participant, and
and
are difficulty parameters related to item i [17] [18] . Here, subscript 1 refers to propensity, while subscript 2 refers to proficiency. The latent feature
is the proficiency of participant j, and
is the propensity of participant j to answer incorrectly.
Equations (5) and (6) give a precise meaning to the latent features. Propensity is defined exactly by Equation (5), while proficiency is defined by Equation (6). Propensity is related to the conditional probability of an incorrect response against the nonresponse option. Thus, propensity refers to making a mistake by choosing the incorrect response rather than making a mistake by leaving an item blank. Of note, a risk is involved while choosing, and thus risk taking is implicitly considered in propensity.
By considering an item incorrect, a participant 1) may provide an incorrect response or 2) may leave the item blank. A high propensity means the participant picks the former. Because propensity is defined based on a probability conditional to the space of incorrect items, here the correct decision is not to answer.
Propensity and proficiency latent features are usually considered in models of “nonignorable nonresponses” [6] [7] . Here, we consider a two-dimensional IRT model to deal with such nonignorable nonresponses in tests with dichotomous items. While the propensity dimension provides information about omitted behavior, the proficiency dimension is related to a candidate’s ability.
Considering Equations (1)-(3), the latent variables
and
refer to the logit functions:
(7)
. (8)
Substituting the one-dimensional logistic Equations (5) and (6) into (4) yields the bidimensional model:
. (9)
This IRT model of proficiency and propensity is noncompensatory [18] , in that the low proficiency of participant j in answering an item correctly,
, cannot be compensated by his or her propensity,
. We estimate the item-related parameters―
,
,
,
―by maximum likelihood, whereas the latent features―
,
―are estimated by the expected a posteriori method [17] [18] . All the scripts were built using the R language (https://cran.r-project.org/).
In item response theory, the items are usually evaluated taking into account their adhesion to an adjusted model [19] . In particular, for the joint distribution in Table 1 of an item i its chi-square statistics are given by
, (10)
where
,
,
are the aggregates of the estimates of the probabilities in model (9), and
,
,
are the corresponding empirical frequencies, that is, the ratio between the number of occurrences and the total number of participants. Under the null hypothesis that the model fits the joint distribution in Table 1, the chi-square statistics (10) have 2 degrees of freedom as they depend on two random variables.
In particular, to assess the similarity between the expected and observed fractions
in a set of I items, for either
(nonresponses) or
(correct responses), we consider the Pearson correlation measures
, (11)
where
.
Next, we show the analysis of data and the results from model (9).
3. Results
Figure 2(a) shows a funnel-shaped dispersion between the total of unanswered items,
, and the total of correct responses,
, for a participant. As
rises, the variability of
plummets. Figure 2(b) shows a triangle-shaped dispersion between the total of unanswered items,
, and the total of incorrect responses,
. While the distributions of
and
are roughly sinusoid, the distribution of
reveals the concentration of zeros (only 12.5 percent of participants responded all the items). The variability of correct and incorrect responses for the participants who did not leave items blank is high, thus suggesting they are likely to present a larger propensity
.
Figure 3 shows the percentage of nonresponses for each item considering the four groups of disciplines, as in Table 2. As can be seen in Figure 3, the disciplines in groups II and III of Table 2 show more nonresponses (p-value = 0.0005; Kruskal-Wallis test, d.f. = 3). For this reason, model (9) will consider such a fact.
Regarding the percentage of incorrect responses relative to all incorrect responses for the groups, that is,
, (12)
Figure 4 reveals absence of pattern (p-value = 0.16; Kruskal-Wallis test, d.f. = 3). Figure 5 shows the dispersion of the empirical values of conditional probabilities
![]()
Figure 2. (a) Dispersion between the total of unanswered items and the total of correct responses; (b) dispersion between the total of unanswered items and the total of incorrect responses. Their respective marginal distributions are also shown.
![]()
Table 2. Groups of disciplines in the exam.
![]()
Figure 3. Percentage of nonresponses, by group. Groups II and III show more nonresponses.
![]()
Figure 4. The percentage of incorrect responses relative to all incorrect responses for the groups does not have a pattern.
![]()
Figure 5. Dispersion of the empirical values of
and
.
and
. Nonresponses
are expected to drop as correct responses
increase, as seen, for instance, for items 70, 71, 72, 73, 83 and 84 (highlighted in Figure 5). However, for the other items highlighted, this did not occur and incorrect responses were more common than nonresponses. For instance, item 35 presented only 5.1 percent of nonresponses and 76.3 percent of incorrect responses (and 18.6 percent of correct responses).
Tables 3-6 show the parameter estimates of the items from marginal maximum likelihood. They also show the observed percentage frequencies along with the expected percentage frequencies from our adjusted model. The
statistics and their corresponding p-values are also shown.
Figure 6 shows the dispersion between the discriminating power parameter and the difficulty parameter regarding propensity, by group. Most responses to the items allow a reasonable discriminating power, that is,
, and present a
low degree of difficulty (
). This result suggests those with lower propensity to respond incorrectly prefer not to respond (that is, they give nonresponses).
Figure 7 shows the dispersion between the discriminating power parameter and the difficulty parameter regarding proficiency, by group. Now, responses to the items allow less power to discriminate the most proficient participants, apart from items 20, 77, 78, 82, 83, 95, 96, 118 and 119, where
and
.
To illustrate this, first consider items 83 and 84 of group III, whose responses are “correct”. The parameter estimates for item 83 were
,
,
and
. For item 84, they were
,
,
and
. Thus, as for propensity, responses to the items allow
for good discriminating power and have positive difficulty parameters. This suggests the responses to the items convey information regarding the participants who are more prone to respond incorrectly rather than not to respond.
Table 7 compares the expected joint percentage distribution of
and
from our model (9) with its empirical joint distribution (in parentheses). There is poor adjustment to model (9) for items 83 (
= 19.15; p-value < 0.001) and 84 (
= 63.90; p-value < 0.001), if taken in isolation. However, if considered together with the other items from Group III, items 83 and 84 do not deviate significantly from the expected lines in Figure 9 and Figure 10.
As another example, consider items 70-75 from group II, where item 70 is “incorrect” and the remaining are “correct.” Parameter estimates for these items are presented in previous Table 4. As for proficiency, such items have moderate discriminating power (
) and positive difficulty (
). Regarding propensity, the items show high discriminating power (
) and the difficulty parameters are
.
Table 8 shows the expected joint distributions from model (9) and the empirical
![]()
Figure 6. Dispersion between the discriminating power parameter and the difficulty parameter regarding propensity, by group.
![]()
Figure 7. Dispersion between the discriminating power parameter and the difficulty parameter regarding proficiency, by group.
![]()
Table 7. Joint percentage distribution of
(responded = 1; didn’t respond = 0) and
(correct = 1; incorrect = 0): Expected from model (9) and empirical (in parentheses). Items 83 and 84.
![]()
Table 8. Joint percentage distribution of
(responded = 1; didn’t respond = 0) and
(correct = 1; incorrect = 0): Expected from model (9) and empirical (in parentheses). Items 70 - 75.
ones. Again, apart from item 70, the items did not appear to adhere to the model (
; p-value < 0.002). However, both fractions of observed correct responses (
) and observed nonresponses (
) fall near their corresponding expected lines given by model (9) (Figure 9 and Figure 10).
Figure 8 summarizes the
statistics distances between the observed distributions and those expected from model (9) for the items, by group. Horizontal dashed lines divide those 52 items for which the model is better adjusted (9 items from Group I; 12 from II; 10 from III; and 21 from IV). For all the 52 items beneath the lines,
with p-values less than 1 percent. However, perhaps apart from the Group I items in Figure 9, there are more than 52 items whose observed frequencies
and
are similar to the expected frequencies
![]()
Figure 8.
distances (with d.f. = 2) between the observed distributions and those expected from model (9) for the items, by group. The items whose distances are statistically null fall below the horizontal dashed lines (critical value of 9.21 at the significance level of 1 percent), and thus are well adjusted to model (9).
from the model,
and
, with
(Figure 9 and Figure 10).
A slightly different picture emerges from the Group I nonresponses (Figure 9), where the fractions of nonresponses,
, fall above the expected ones,
.
Figure 11 shows the joint distribution between
and
, by group. It suggests the existence of at least two types of participants. The clusters of dots at the top refer to the participants who do not leave items blank (
). Overall, less proficient participants (low
) are less likely to respond incorrectly and then leave an item blank (low
). However, proficiency changes over after a threshold and then propensity tends to the modal region of the distributions.
![]()
Figure 9. Dispersion between the observed
and the expected
frequencies of nonresponses, by group (corresponding Pearson correlation coefficients
in parentheses).
Figure 12 and Figure 13 show the relationship between score, S, and the latent variables
and
, by group. Score and propensity present low negative linear correlation (Figure 12). Solid lines show conditional mean values
that are adjusted nonparametrically using the LOESS method. From a threshold (say,
), participants with higher propensities score lower. However, before this threshold is reached, expected scores lie on a plateau around which dispersions are funnel shaped. Moreover, and as expected, participants who are more proficient tend to lie above the solid line.
Figure 13 shows scoring and proficiency are positively correlated and nonlinear. Participants who are more proficient tend to score more and at a higher intensity (slope) than that of those who score lower. As expected, participants with higher propensities tend to lie below the solid line. For a given level of proficiency,
, participants with higher propensities tend to score lower. However,
![]()
Figure 10. Dispersion between the observed
and the expected
frequencies of correct answers, by group (corresponding Pearson correlation coefficients
in parentheses).
as
rises, the dispersion of S lessens, and this dampens the effect of
. Yet, as
is reduced,
impacts S more, and
tends to flatten.
4. Conclusions
This work considers item response theory to model 10,822 Brazilian high school students’ behavior in a high-stakes exam that may enable them to enter a top university. We put forward a model based on item response theory that highlights the role of latent features that we call “proficiency” and “propensity”.
The key strategic decision of a participant is to either risk an incorrect response or leave the question blank. Leaving the question blank is strategically better, because responding incorrectly is a loss. A participant then decides by taking into account both intrinsic difficulty and the latent feature of propensity.
Leaving a question blank may also reflect the participant’s low proficiency regarding the item as well as the propensity to avoid the loss accruing from responding
![]()
Figure 11. Dispersion between the latent features
and
, by group. Solid red lines show conditional mean values
that are adjusted nonparametrically using the LOESS method, and
shows the total of unanswered items.
incorrectly. Our model aims to recover information regarding the role the latent features―proficiency and propensity―play in a decision.
In the model we set, propensity is defined exactly by Equation (5), while proficiency is defined by Equation (6). Propensity means propensity to respond incorrectly rather than not to respond. And (low) proficiency of responding correctly cannot be compensated by the propensity of responding incorrectly.
We estimate by maximum likelihood (using the language R) the parameters
![]()
Figure 12. Dispersion between score S and propensity
, by group. For each group, the linear correlations between S and
are, respectively, −0.20, −0.03, −0.17 and −0.14. Solid lines show conditional mean values
that are adjusted nonparametrically using the LOESS method.
![]()
![]()
Figure 13. Dispersion between score S and proficiency
, by group. For each group, the correlations between S and
are, respectively, 0.91, 0.67, 0.60 and 0.76. Solid lines show conditional mean values
that are adjusted nonparametrically using the LOESS method.
related to the discriminating power of a participant and the parameters of difficulty related to an item. Proficiency and propensity are estimated by the expected a posteriori method.
Based on the chi-squared distances, 52 items out of 100 proved to be a good fit to the model. For each group, the overall adhesion of data to our adjusted model was evaluated by the Pearson correlation coefficient (
). Both responding correctly and propensity showed a strong agreement with the adjusted model (Figure 9 and Figure 10), with
.
This suggests the decision of responding or not and also the decision of responding correctly or not in a group of items can be described by a two-dimensional logistic model, even if there are imperfections coming from an item-by-item adjustment.
Refraining from responding is found to depend on both the characteristics of the items and the latent features of the participants. In particular, the least proficient participants prefer to leave an item blank rather than respond it incorrectly.
Scoring on the exam and propensity present a low negative linear correlation. However, scoring and proficiency are positively correlated although nonlinear. Thus, for a given level of proficiency, after a threshold is reached, students with higher propensities score lower.
Acknowledgements
We acknowledge financial support from Cebraspe, CNPq and Capes.