The Use of Item Response Theory in Survey Methodology: Application in Seat Belt Data

Problem: Several approaches to analyze survey data have been proposed in the literature. One method that is not popular in survey research methodology is the use of item response theory (IRT). Since accurate methods to make prediction behaviors are based upon observed data, the design model must overcome computation challenges, but also consideration towards calibration and proficiency estimation. The IRT model deems to be offered those latter options. We review that model and apply it to an observational survey data. We then compare the findings with the more popular weighted logistic regression. Method: Apply IRT model to the observed data from 136 sites within the Commonwealth of Virginia over five years collected in a two stage systematic stratified proportional to size sampling plan. Results: A relationship within data is found and is confirmed using the weighted logistic regression model selection. Practical Application: The IRT method may allow simplicity and better fit in the prediction within complex methodology: the model provides tools for survey analysis.

The sampling methodology used to collect data has a two stage design associated with primary sampling unit (PSU) strata from 15 counties and secondary sampling units (SSU) from 136 road segments within the counties, under National Highway Transportation Safety Authority (NHTSA) guidelines [1]. If sampling weights are ignored, then the model parameter estimates can be biased [2]. In fact, since the sample is collected from a two stage stratified sampling design, standard underlying assumptions of parametric statistical models may be violated, and guidelines based on the statistical design cannot be ignored. [3] [4] and [5] have given suggestions for such complex methodologies. Other authors have applied the methodology to studies. Our intent is to apply the seat belt sampling methodology to predict the seatbelt usage. [6] [7] and [8] have used such methodologies and they concluded that females are more likely to wear seatbelts than males. The relationship between vehicle type and seatbelt use has been explored by [9] [10] and [11] who concluded that seatbelt use in pickup trucks is lower than other passenger vehicles. [12] suggested that passenger and driver use are related. [13] asserts that the seatbelt use is increased in those states within the United States that have primary seatbelt enforcement laws and actively enforce seatbelt use. Studies have also explored relationships between race, socio-economic status, age, rural/urban environments, law enforcement type (primary, secondary), the amount of fines, and the type of road traveled (primary, secondary, tertiary). [14] employed a multivariate approach using the aforementioned factors along with cultural variables to explain the differences in seatbelt use between states using self-reported information, direct observation, and crash reports. However, the validity of self-reported seatbelt use in surveys is questionable compared to observed seatbelt usage [15]. While the methodology is simple to describe, the challenge is found in the statistical analysis tool used to make prediction, especially in the presence of behavioral variables, such as driver gender, vehicle type, traffic volume, road segment length, weather conditions, driver cellphone use, passenger presence, lane, and passenger seatbelt use. The goal is to get meaningful information that can be translated into quantitative measures. [16] and [17] propose the addition of a score variable due to the measurement of concern. Those researchers have incorporated latent traits of data in a score function.
The manuscript presents a comparison of the popular logistic regression presented here along suggestion of the Item Response Theory (IRT) model, and its simple version called the Rasch model [18].
Moreover, ignoring weights may lead to imperfection in the sample (as departing from the reference population) and serious bias in latent variable models [19]. To avoid that problem, we apply a weight function. [14] cautioned about the use of other factors to develop more effective countermeasures for increasing seatbelt use. We propose the weighted logistic and IRT models after variable selections and compare the findings. The manuscript is organized as follows. In Section 2, we present background of data, then build the reference model in Section 3. American Journal of Operations Research In Section 4, the weighting scales are built into the models. The IRT model is presented. We end with a conclusion in Section 5. The weighting was added so that information from the whole population would be captured. If the selection mechanism is not informative, the parameter estimates will remain consistent regardless of the weights, and weights should be excluded from the model [20]. Moreover, if the strata sample sizes are large enough, the parameter estimates are unbiased. In sampling surveys, it is not always possible to determine whether the weights are informative. However, the observations should reflect the sampling weights to avoid biased sampling.

Overview of Data
The data collected includes the following observed binary data: driver seat belt use (yes, no), driver gender (female, male), passenger present (yes, no), passen-

Unweighted Analysis and Results
Generalized linear models are usually considered in the investigation of the data. First, a classic linear model was suggested to obtain a general relationship between the response (driver seatbelt use) and predictive variables. However, use of a linear model on binary responses is not recommended [21], since predicted values may be outside of the domain of the response variable. From this point forward, a classic model also known as classical test theory (CTT) is considered. We consider first fitting a logistic model to the data.

Logistic Model
In this model, p = P(Y = 1) is the probability that the driver is wearing a seat belt, and 1 − p = P(Y = 0) is the probability that the driver is not wearing a seatbelt. The initial model is: + β r X r + β g X g + β s X s + β l X l + β c X c + β w X w + β pp X pp + β ps X ps where β 0 denotes the intercept of the model, X v denotes Vehicle Type (car, truck, SUV, van, or mini-van), X r denotes Road Classification for VMT (low, average, high), X g denotes Driver Gender (male/female), X s denotes the road segment length in mile, X l denotes Lane in which vehicle observed (right to left), X c denotes Driver Cell Phone Use (yes/no), X w denotes Weather (clear, light rain, cloudy, foggy, or clear but wet), X pp denotes Passenger Present (yes/no), X ps denotes Passenger Seatbelt Use (yes/no). This notation is used consistently throughout this manuscript. The weights ij w are obtained as Analysis of the effects of weather on seatbelt use revealed inconsistent associations between seatbelt use and weather severity for the five years. Further, the selection process does not identify weather as significant for any combined data. Hence, weather has been removed from the model and the analysis repeated. Analysis of the predictor variables reveals a high correlation (Spearman's correlation coefficient, 0.94, 0.0001 s r p value = − < ) between road segment length and road class which indicates a confounding condition. Other correlations are less than 0.15 and do not indicate the presence of other confounding effects. As a result, road segment length was removed from the model and the analysis performed again. combined data, all remaining predictors are significant at p = 0.01, while passenger presence is removed due to a p-value > 0. 15 indicates significant evidence exists (p < 0.0001) to support the claim that the models are not explained solely by the intercept (i.e. the response is not a constant) for all four presented models which is consistent with the Wald Test results in Table 1.
Computational efficiency is measured by Akaike Information Criterion (AIC) numbers [22], displayed in Table 3, which assess the goodness of fit of the model: smaller numbers indicate a better fit. AIC is defined as follows: where p is the number of parameters in the model, SS r is the residual sum of squares, and N is the number of observations in the dataset. The results of the AIC for logistic regression performed on the significant variables identified during the selection process are in the 10 thousands. Since the intercept alone is not a sufficient explanation of the model, we use the values for intercept and covariance. The AIC numbers obtained for individual years are approximately 30% lower than those obtained by [14]; however, the combined data is significantly higher. The significantly higher numbers for the combined data indicate a significant amount of variation in the model, or a less than optimum fit.

Variable Standardization and Reclassification
Since vehicle types are listed in no particular order, vehicle type is reclassified to indicate size of the vehicle which negatively correlates to driver seatbelt use: i.e.
in general, the drivers of larger vehicles tend to wear seatbelts less often than drivers of smaller vehicles as suggested in [9]. Preliminary analysis of the data appears to support this hypothesis, so smaller vehicle types are given a larger value to indicate that the driver is more likely to wear a seatbelt.

Model Fitting after Standardized and Reclassified Variables
The logistic selection process with p = 0.15 for entry and retention in the model is performed on the reclassified and standardized variables. The significant variables indicated prior to standardization in 3.2 above remain significant ( Table 6).
The model fit statistics are comparable to the previous analysis ( Table 7). The global null hypothesis test indicates that the model is not sufficiently described solely by the intercept (Table 8). All variables selected are significant (p-value < 0.0001) for all datasets analyzed. In this analysis, it is reasonable to select the model fit by the combined 2012-2016 data:   = β 0 + β v X v + β r X r + β g X g + β l X l + β c X c + β pp X pp .
The variable significance is displayed in Table 6, and the fit estimates are shown in Table 7. The AIC and SC numbers remain undesirably large ( Table 8) and indicate that reclassification and standardization are not sufficient actions to improve model fit. Therefore, we investigate the cause for the poor model fit.
In all the previous sections, the AIC, BIC and log likelihood have been used as best measures of goodness fit for the most parsimonious models. They turn out to be high, which is an evidence of over-dispersion, which could be an indication there is more variability in the data than expected from the fitted model, which is an indication of a poor fit. Since the sample size is large, the corrected AIC does not lead us to better improvements. Variables have been selected for each dataset and the selection process results in similar models. We will use these criteria as comparisons when adding the weights to the models considered in the next section.

Weights
In all of the above analyses, the weights associated with the data were ignored.
However, driver seat belt behavior is intricate and quite certainly involves non-collected data. Ignoring sample weights leads to inflated standard errors and biased estimates [2]. [3] provide guidelines for data analysis under weighted and designed data which reduces bias that would result in over sampled strata.
The weights are stratum size and length of road segments. The inclusion of weights results in a significantly different model than selected in Section 3 above as inferred by [5]. Additionally, the goodness of fit criteria is significantly reduced (improved). The sampling plan for the data in this manuscript was developed as a joint effort between two of the authors (N. Diawara and B.E. Porter) and NHTSA. Therefore, in order to correct for bias due to stratum size and length of road segment, we included the weight designed for this analysis in our model, in accordance with NHTSA requirements [1] as:

Model Fitting: Weighted Logistic Regression
Prior to performing analysis on the reclassified and standardized variables, the 75 th percentiles for the weighted reclassified variables is determined. The weighted third quartile values are the same as the unweighted values listed in Table 5.
The selection process using the weighted logistic regression model and the  Table 9). There appears to be an increasing significance in the prediction of driver seat belt use by cell phone use (p > 0.15 to p ≈ 0.05) over time. The model is significant as indicated by the global null hypothesis test in Table 10.
There is significant decrease in the AIC when the weights are added to the model, matching in [24] that, in the context of behavioral ecology, a simple controlled model does not show all the complexity of the data. Table 11

Model Selection: Weighted Logistic Regression
The final model selected for the 2012-2016 aggregate data is where β 0 , β v , β g , β c , and are the estimates calculated using the weights.
As expected, the combination of the data results in an improvement in the to all drivers that can be described by a score function. We applied such a model based on specified traits that reflect the dichotomy of the data such as gender, and made comparisons. We then compare the efficiency and effectiveness of the overall indicators by computing goodness of fit statistics.

Model
Because the model requires consideration of several conditions, the Rasch model is considered, as it provides a tool to analyze characteristics even when they are latent. Such a model can be included in the class IRT in the framework proposed by [17]. Driving habits can be seen as a variable which depends on many factors.
Our primary focus is on seat belt use and indicators which give additional information to evaluate seat belt use. We propose to extend the theory of logistic regression to include characteristics associated with driver seatbelt use which is translated into the driver's condition as an associated score. In such a context, the Rasch model ( [18] [25]) is an option where we can include each driver's behavior regarding seat belt use. One main concern is the associated measurement of the score. That score is based on the qualitative information to be translated into quantitative measure. Using ideas from [26], we develop a score function that can be used to build the sensitive attributes and behaviors of drivers. As mentioned in [27], the bias reduction is achieved through appropriate weight adjustments.
A score function is built using a linear combination of significant predictor variables. The proposed score attempts to capture the features of vehicle type American Journal of Operations Research driven, driver gender, passenger presence, and driver cellphone use. Those features can alter the probability of seat belt use and they can be seen as sufficient statistics for the response (See [16]). In our case, due to the logistic analysis on driver seat belt use, we propose to use a score function composed of driver gender, vehicle type, passenger presence, and handheld cellphone use as follows: S = X g + X v + X pp + X c where X g = driver gender (male = 0 and female = 1), X v = size of vehicle driven standardized by the 3 rd quartile (1/3 = SUV/Van/Truck, 2/3 = Minivan, and 1 = car), X c = passenger presence (present = 1 and not present = 0), and X c = driver cellphone use (no = 0 and yes = 1).
The final model is

Results
The logistic regression analysis yields parameter estimates (standard error)  (Table 12).
The AIC values (Table 13) are comparable to the AIC values in the weighted logistic analysis shown in 4.2.1 indicating a satisfactory fit of the model. The model is significant as indicated by the global null hypothesis test given in Table  14. The odds ratio estimate and its confidence interval are provided in Table 15. Figure 2 shows the regression line and 95% confidence limits for predicted probability of seatbelt use versus the weighted score function. The narrow confidence band and the linear upward trend also indicate a satisfactory fit of the model to the data. All such results conform with the findings by [27] in the bias reductions even in the nonresponse situation, and provide an improvement on their suggested approach.   The present IRT model offers many more advantages than the classical test theory (CTT) methods developed in Section 3. The model is parsimonious and allows driver seat belt behavior to be easily estimated from scaled psychometric item measures under a weighted design model.

Conclusions
Driver seatbelt use in the Commonwealth of Virginia may be satisfactorily described using driver gender, vehicle type, passenger presence, and cellphone use in a multivariate logistic model using weights designed specifically for the dataset. However, prediction of seatbelt behavior is more appropriate using item response theory. As such, we have endeavored to build a score function considering driver gender, vehicle type driven, passenger presence, and cellphone usage by applying the IRT model with weights within the model. Fitting a weighted model results in significant improvements in goodness of fit statistics, such as AIC numbers, by factor of approximately 20.
We suggest that a weighted IRT model is more appropriate and it may also potentially include other factors. Such a model could be used to develop programs and more applications of the IRT models.