^{1}

^{2}

^{*}

^{3}

^{*}

^{2}

Item response theory (IRT) is a modern test theory that has been used in various aspects of educational and psychological measurement. The fully Bayesian approach shows promise for estimating IRT models. Given that it is computation- ally expensive, the procedure is limited in practical applications. It is hence important to seek ways to reduce the execution time. A suitable solution is the use of high performance computing. This study focuses on the fully Bayesian algorithm for a conventional IRT model so that it can be implemented on a high performance parallel machine. Empirical results suggest that this parallel version of the algorithm achieves a considerable speedup and thus reduces the execution time considerably.

Item response theory (IRT) provides measurement models that describe a probabilistic relationship between correct responses on a set of test items and a latent trait. With many advantages (see [

Parameter estimation offers the basis for theoretical advantages of IRT and has been a major concern in the application of IRT models. While the inference of items and persons on the responses is modeled by distinct sets of parameters, simultaneous estimation of these parameters in IRT models results in statistical complexities in the estimation task, which have made estimation procedure a primary focus of psychometric research over decades [11-14]. Recently, because of the availability of high-computing technology, the attention is focused on fully Bayesian estimation procedures, which offer a number of advantages over the traditional method (see e.g. [15,16]). Albert [

High performance computing (HPC) employs supercomputers and computer clusters to tackle problems with complex computations. HPC utilizes the concept of parallel computing to run programs in parallel and achieve a smaller execution time or communication time, which is affected by the size of the messages being communicated between computers. With parallel computing, many largescale applications and algorithms utilize Message Passing Interface (MPI) standard to achieve better performance. The MPI standard is an application programming interface (API) that abstracts the details of the underlying architecture and network. Some examples of applications that use MPI are crash simulations codes, weather simulation, and computational fluid dynamic codes [

In view of the above, parallel computing can potentially help reduce time for implementing MCMC with the 2PNO IRT model, and as the size of data and/or chain increases, the benefit of using parallel computing would increase. However, parallel computing is known to excel at tasks that rely on the processing of discrete units of data that are not interdependent. Given the high data dependencies in a single Markov chain for IRT models, such as the dependency of one state of the chain to the previous state, and the dependencies among the data within the same state, the implementation of parallel computing is not straightforward. The purpose of this study is hence to overcome the problem and develop a high performance Gibbs sampling algorithm for the 2PNO IRT model using parallel computing. This paper focuses on all-to-one and one-to all broadcast operations. The aim is to achieve a high speedup while keeping the cost down. The cost of solving a problem on a parallel system is defined as the product of parallel runtime and the number of processing elements used.

The remainder of the paper is organized as follows. Section 2 reviews the 2PNO IRT model and the Gibbs sampling procedure developed by Albert [

The 2PNO IRT model provides a fundamental framework in modeling the person-item interaction by assuming one ability dimension. Suppose a test consists of k multiple-choice items, each measuring a single unified ability,. Let represent a matrix of n examinees’ responses to k dichotomous items, so that is defined as

. The probability of person i obtaining correct response for item j can be defined as

where and denote item parameters, denotes the continuous person trait parameter, and denotes the unit normal cdf.

The Gibbs sampler involves updating three sets of parameters in each iteration, namely, an augmented continuous variable (which is positive if _{ }and negative if), the person parameter, and the item parameters, where from their respective full conditional distributions, namely,

where, assuming, and (see [17,22]).

Hence, with starting values and, observations can be simulated from the Gibbs sampler by iteratively drawing from their respective full conditional distributions specified in Equations (2), (3) and (4). To go from _{ }to, it takes three transition steps:

1) Draw;

2) Draw;

3) Draw.

This iterative procedure produces a sequence of, l = 0, ··· , L. To reduce the effect of the starting values, early iterations in the Markov chain are set as burn-ins to be discarded. Samples from the remaining iterations are then used to summarize the posterior density of item parameters and ability parameters. The algorithm takes less than 13 minutes for a 2000- by-10 dichotomous (0-1) data matrix and 10,000 total iterations when implemented in Fortran using the Microsoft Powerstation 4.0 compiler and the IMSL Fortran numerical library [

The study was performed using the Maxwell Linux cluster, a cluster with 106 processing nodes. Maxwell uses the message passing model via the MPICH MPI framework implementation. One of the 106 nodes acted as the root node, while the rest of the nodes acted as slave nodes. The root node was responsible for generating and partitioning the matrix y, transmitting the submatrices, updating and broadcasting θ, execution time recording, as well as the same duties as the slave nodes.

Each node on the cluster has an Intel Xeon dual CPU quad-core processor clocked at 2.3 GHz, 8 GB of RAM, 90 TB storage, and Linux 64bit operating system. MPICH allows the user to choose how many nodes to use before the execution of a program so that various number of processing nodes may be used in every execution.

When decomposing a problem for parallel computation, the first decomposition method considered is the domain decomposition. In domain decomposition, the data associated with the problem are decomposed and a set of computations is assigned to them [

With this approach, the first processing node, P_{0}, receives a sub matrix, , of size n × g that corresponds to the elements of the y matrix from y_{0,0} to y_{n}_{–}_{1, g–1, }where and P is the number of processing nodes. The second processing node, P_{1}, receives a sub matrix of y, , of size n × g that corresponds to the elements of the y matrix from y_{0, g} to y_{n}_{–1, 2g–1} and so forth. Consequently,

each processing node updates the Gibbs samples as in the serial algorithm, but with a smaller input data set. That is, instead of operating on the whole input matrix y, they operate on a part of it of size n × g.

Decompositions of Z, α, and γ are depicted in _{0} updates a block of Z, , from Z_{0,0} to Z_{n}_{–1, g–1}, a block of α, , from α_{0} to α_{g}_{–1}, and a block of γ, , from to γ_{g}_{–1}.

Since θ is of size n × 1 (a column vector), it is not decomposed. However, a problem arises with the update of θ. For simplicity, consider the update of the first element of θ, which requires the updated α, γ, and the first row of Z. Yet, any given processing node has only a part of α, γ, and the first row of Z. The solution is to assign one of the processing nodes (e.g., the root) to update θ and broadcast it to the rest of the units. The naïve approach to update θ would be to have all the units send their part of α, γ and Z to the root so that it has the complete Z, α and γ to update θ from Equation (3) and then broadcast θ to the rest of the nodes. A problem with this approach is that the data communicated are too large, which causes the parallel algorithm to take a longer execution time than the serial algorithm.

A better approach is one that minimizes the communication cost. This can be achieved by having every node to calculate and and send ψ_{i} and τ to the root for it to update q from

This way, each processing node is sending a vector of size n + 1 to the root and one message of size n is broadcasted by the root. The total data transferred between all the nodes by this approach is

.

As a comparison, the total data transferred between all the nodes by the naïve approach is

which equals lP(2n + 2) when g = 1, lP(3n + 4) when g = 2, and so forth. When g > 1, the total data transferred using the naïve approach are considerably more than that of the proposed approach (n is usually in the order of thousands).

The proposed algorithm was implemented in ANSI C and MPI with utilization of the GNU Scientific Library (GSL) [

In order to investigate the benefits of the proposed parallel solution against its serial counterpart, four experiments were carried out in which sample size (n), test length (k), and number of iterations (l) varied as below:

n =2000, k = 50, l = 10,000 n =5000, k = 50, l = 10,000 n =2000, k = 100, l = 10,000 n =2000, k = 50, l = 20,000.

In all these experiments, one (representing the serial algorithm) to nine processing nodes were used to implement the Gibbs sampler. Their performances were evaluated using four metrics in addition to the execution time. These metrics are the total overhead, relative speedup, relative efficiency, and cost:

The total overhead can be calculated as

where P is the number of available processing nodes, T_{S} is the fastest sequential algorithm execution time and T_{P} is the parallel algorithm execution time.

Relative speedup is the factor by which execution time is reduced on P processors and it is defined as

Efficiency describes how well the algorithm manages the computational resources. More specifically, it tells us how much time the processors spend executing important computations [

The definition of cost of solving a problem on a parallel system is the product of parallel runtime and P. Consequently, cost is a quantity that reveals the sum of individual processing node runtime.

Results from the four experiments are summarized in Figures 3-7. Note that the values plotted represent the average of ten runs. As expected, the execution time decreased as the number of processing nodes increased in all the experimented conditions (see

In terms of efficiency and cost, the algorithm performed better using two to five processing nodes (see Figures 4 and 5). When using up to seven nodes, the communication overhead (see

five or seven processing nodes were used (see

It is noted that the overhead increased as the number of processing nodes increased and it reached the maximum with eight or nine processing nodes. This is because in the parallel algorithm, the overhead of communication is a result of nodes sending ψ and τ to the root and then the root broadcasting θ to the rest of the nodes

in every iteration. Note that the total data transferred between all the nodes during execution is lP(2n + 1). The biggest part of idling occurs when the root waits to receive ψ and τ from all the slave nodes and when the slave nodes wait for the root node to calculate θ and broadcast it to them. The communication overhead increases more than the computation speedup when a certain amount of processors are used (ranges from four to seven processors in the experiments performed). As a result, the speedup does not increase with increasing processor count, and consequently, the cost increases dramatically.

Furthermore, a close examination of

This study developed a high performance Gibbs sampling algorithm for the 2PNO IRT model with the purpose of achieving a lower execution time possible using the available hardware (Maxwell cluster). The algorithm was implemented using the ANSI C programming language and the message passing interface. Experiments were performed to evaluate its performance with various dataset sizes or iteration lengths. Results indicated that the parallel algorithm (for the given problem size) performed better, in terms of efficiency and cost, using two to five processing nodes. On the other hand, the algorithm had the smallest execution time when nine processing nodes were used.

The design of a parallel 2PNO IRT model has proved to be justifiable. Given the high data dependencies for such problems, the solution initially appeared to be non-trivial. By using domain decomposition, we managed to avoid communication for the state dependencies. Nevertheless, communication in every iteration of the Markov chain cannot be avoided because of the data dependencies within the state. By modifying the serial algorithm, the size of the data communicated in every iteration was managed to be reduced to make a speedup possible.

This study achieved parallelization through a columnwise decomposition and the use of all-to-one and oneto-all broadcast schemes. Further studies can be undertaken to increase the speedup and the efficiency, and minimize the cost and the total overhead. For example, the data may be decomposed differently or an all-to-all broadcast scheme may be adopted in order to achieve smaller communication overhead.

The pseudo code for updating the values of Z, ψ, τ, θ, α, and γ is shown below. First of all, Z is updated through the function update_Z. Then, update_PSI_TAU is called to update ψ and τ and MPI_Gather is called to send ψ and τ to the root. The root receives ψ and τ and calls update_TH to update θ and afterwards broadcasts θ by calling MPI_Bcast. Finally, α and γ are updated from a function call to update_A_G. In order to reduce communication overhead, ψ and τ are sent in the same message. To achieve that, an array of size n +1 is set up, where the first n entries consist of the elements of ψ and entry n +1 consists of τ (the name of this array in the code is PSI_TAU_array).

// Start iteration:

for (m = 0; m < l; m++){

count++;

update_Z(Z, y, TH, A, G, r);

update_PSI_TAU(PSI_TAU_array, Z, A, G);

MPI_Gather (PSI_TAU_array, n+1MPI_DOUBLE, PSI_TAU_rec, n+1, MPI_DOUBLEROOT, MPI_COMM_WORLD);

if (rank == ROOT){ double TAU_array[size]; int ind = 0;

// Retrieve PSI and TAU from PSI_TAU_rec:

for (j = 0; j < size; j++){

for (i = 0; i < n+1; i++){

if (i < n)

gsl_matrix_set(PSI_matrix, i, jPSI_TAU_rec[ind++]);

else TAU_array[j] = PSI_TAU_rec[ind++];

}

}

update_TH (TH, THV, TAU_array, PSI_matrix, count, r);

// Transfer TH data into a buffer so that it can be broadcasted:

for (i = 0; i < n; i++){

TH_array[i] = gsl_vector_get(TH, i);

}

}

MPI_Bcast (TH_array, n, MPI_DOUBLE, ROOT, MPI_COMM_WORLD);

// Transfer TH received to a vector structure:

for (i = 0; i < n; i++){

gsl_vector_set (TH, i, TH_array[i]);

}

update_A_G(A, G, AV, GV, Z, TH, unif, count, r, p);

} // end iteration