Parallel Computing with a Bayesian Item Response Model

Item response theory (IRT) is a modern test theory that has been used in various aspects of educational and psychological measurement. The fully Bayesian approach shows promise for estimating IRT models. Given that it is computationally expensive, the procedure is limited in practical applications. It is hence important to seek ways to reduce the execution time. A suitable solution is the use of high performance computing. This study focuses on the fully Bayesian algorithm for a conventional IRT model so that it can be implemented on a high performance parallel machine. Empirical results suggest that this parallel version of the algorithm achieves a considerable speedup and thus reduces the execution time considerably.


Introduction
Item response theory (IRT) provides measurement models that describe a probabilistic relationship between correct responses on a set of test items and a latent trait.With many advantages (see [1]), it has been found useful in a wide variety of applications in education and psychology (e.g.[2][3][4]) as well as in other fields (e.g.[5][6][7][8][9][10]).
Parameter estimation offers the basis for theoretical advantages of IRT and has been a major concern in the application of IRT models.While the inference of items and persons on the responses is modeled by distinct sets of parameters, simultaneous estimation of these parameters in IRT models results in statistical complexities in the estimation task, which have made estimation procedure a primary focus of psychometric research over decades [11][12][13][14].Recently, because of the availability of high-computing technology, the attention is focused on fully Bayesian estimation procedures, which offer a number of advantages over the traditional method (see e.g.[15,16]).Albert [17] applied Gibbs sampling [18], one of the most efficient Markov Chain Monde Carlo (MCMC [19,20]) algorithms, to the two-parameter normal ogive (2PNO) [21] model.Since a large number of iterations are needed for the Markov chain to reach convergence, the algorithm is computationally intensive and requires considerable amount of execution time, especially with large datasets (see [22]).Hence, achieving a speedup, and thus reducing the execution time, will make it more practical for researchers or practitioners to implement IRT models using Gibbs sampling.
High performance computing (HPC) employs supercomputers and computer clusters to tackle problems with complex computations.HPC utilizes the concept of parallel computing to run programs in parallel and achieve a smaller execution time or communication time, which is affected by the size of the messages being communicated between computers.With parallel computing, many largescale applications and algorithms utilize Message Passing Interface (MPI) standard to achieve better performance.The MPI standard is an application programming interface (API) that abstracts the details of the underlying architecture and network.Some examples of applications that use MPI are crash simulations codes, weather simulation, and computational fluid dynamic codes [23] to name a few.
In view of the above, parallel computing can potentially help reduce time for implementing MCMC with the 2PNO IRT model, and as the size of data and/or chain increases, the benefit of using parallel computing would increase.However, parallel computing is known to excel at tasks that rely on the processing of discrete units of data that are not interdependent.Given the high data dependencies in a single Markov chain for IRT models, such as the dependency of one state of the chain to the previous state, and the dependencies among the data within the same state, the implementation of parallel computing is not straightforward.The purpose of this study is hence to overcome the problem and develop a high performance Gibbs sampling algorithm for the 2PNO IRT model using parallel computing.This paper focuses on all-to-one and one-to all broadcast operations.The aim is to achieve a high speedup while keeping the cost down.The cost of solving a problem on a parallel system is defined as the product of parallel runtime and the number of processing elements used.
The remainder of the paper is organized as follows.Section 2 reviews the 2PNO IRT model and the Gibbs sampling procedure developed by Albert [17].Section 3 illustrates the approach taken in this study to parallelize the serial algorithm.In Section 4, the performance of the developed parallel algorithm is investigated and further compared with that from serial implementation.Finally, a few summary remarks are given in Section 5.

Model and the Gibbs Sampler
The probability of person i obtaining correct response for item j can be defined as where j  and j  denote item parameters, i  denotes the continuous person trait parameter, and denotes the unit normal cdf.


The Gibbs sampler involves updating three sets of parameters in each iteration, namely, an augmented continuous variable ij Z (which is positive if ), the person parameter i 0 y   , and the item parameters j  , where where and   1 j p   (see [17,22]).
Hence, with starting values can be simulated from the Gibbs sampler by iteratively drawing from their respective full conditional distributions specified in Equations ( 2), ( 3) and (4).To go from it takes three transition steps: This iterative procedure produces a sequence of To reduce the effect of the starting values, early iterations in the Markov chain are set as burn-ins to be discarded.Samples from the remaining iterations are then used to summarize the posterior density of item parameters  and ability parameters  .
The algorithm takes less than 13 minutes for a 2000by-10 dichotomous (0-1) data matrix and 10,000 total iterations when implemented in Fortran using the Microsoft Powerstation 4.0 compiler and the IMSL Fortran numerical library [22].For a longer chain with 50,000 iterations, it takes about 60-90 minutes for each execution.With every execution taking more than 12 minutes on a single computer, using this algorithm with large datasets is computational expensive.This further limits the use of IRT models under fully Bayesian framework in various applications.

Methodology
The study was performed using the Maxwell Linux cluster, a cluster with 106 processing nodes.Maxwell uses the message passing model via the MPICH MPI framework implementation.One of the 106 nodes acted as the root node, while the rest of the nodes acted as slave nodes.The root node was responsible for generating and partitioning the matrix y, transmitting the submatrices, updating and broadcasting θ, execution time recording, as well as the same duties as the slave nodes.Each node on the cluster has an Intel Xeon dual CPU quad-core processor clocked at 2.3 GHz, 8 GB of RAM, 90 TB storage, and Linux 64bit operating system.MPICH allows the user to choose how many nodes to use before the execution of a program so that various number of processing nodes may be used in every execution.

Parallelism with the Gibbs Sampler
When decomposing a problem for parallel computation, the first decomposition method considered is the domain decomposition.In domain decomposition, the data associated with the problem are decomposed and a set of computations is assigned to them [24].Domain decom-position is a great fit for the 2PNO IRT algorithm since the input and intermediate data can easily be partitioned as illustrated in Figures 1 and 2.
With this approach, the first processing node, P 0 , receives a sub matrix, 0 P y , of size n × g that corresponds to the elements of the y matrix from y 0,0 to y n-1, g-1, where g k  P and P is the number of processing nodes.The second processing node, P 1 , receives a sub matrix of y, 1 P y , of size n × g that corresponds to the elements of the y matrix from y 0, g to y n-1, 2g-1 and so forth.Consequently,  each processing node updates the Gibbs samples as in the serial algorithm, but with a smaller input data set.That is, instead of operating on the whole input matrix y, they operate on a part of it of size n × g.
Decompositions of Z, α, and γ are depicted in Figure 2, where we see that each processor is updating a block of Z, α, and γ from Equations ( 2) and ( 4), respectively, where j = 1,•••, g.For instance, P 0 updates a block of Z, 0 P Z , from Z 0,0 to Z n-1, g-1 , a block of α, 0 P  , from α 0 to α g-1 , and a block of γ, 0 P  , from 0  to γ g-1 .
Since θ is of size n × 1 (a column vector), it is not decomposed.However, a problem arises with the update of θ.For simplicity, consider the update of the first element of θ, which requires the updated α, γ, and the first row of Z. Yet, any given processing node has only a part of α, γ, and the first row of Z.The solution is to assign one of the processing nodes (e.g., the root) to update θ and broadcast it to the rest of the units.The naïve approach to update θ would be to have all the units send their part of α, γ and Z to the root so that it has the complete Z, α and γ to update θ from Equation ( 3) and then broadcast θ to the rest of the nodes.A problem with this approach is that the data communicated are too large, which causes the parallel algorithm to take a longer execution time than the serial algorithm.
A better approach is one that minimizes the communication cost.This can be achieved by having every node and send ψ i and τ to the root for it to update  from 2 2 1 ~, 1 1 This way, each processing node is sending a vector of size n + 1 to the root and one message of size n is broadcasted by the root.The total data transferred between all the nodes by this approach is As a comparison, the total data transferred between all the nodes by the naïve approach is which equals lP(2n + 2) when g = 1, lP(3n + 4) when g = 2, and so forth.When g > 1, the total data transferred using the naïve approach are considerably more than that of the proposed approach (n is usually in the order of thousands).

Implementation
The proposed algorithm was implemented in ANSI C and MPI with utilization of the GNU Scientific Library (GSL) [25].To achieve the parallel computation as illus-trated in the previous section, the MPI_Gather and MPI_Bcast routines were used for collective communications.See the Appendix for part of the source code of the parallel algorithm in updating the model parameters.

Performance Analyses
In order to investigate the benefits of the proposed parallel solution against its serial counterpart, four experiments were carried out in which sample size (n), test length (k), and number of iterations (l) varied as below:  n =2000, k = 50, l = 10,000,  n =5000, k = 50, l = 10,000,  n =2000, k = 100, l = 10,000,  n =2000, k = 50, l = 20,000.In all these experiments, one (representing the serial algorithm) to nine processing nodes were used to implement the Gibbs sampler.Their performances were evaluated using four metrics in addition to the execution time.These metrics are the total overhead, relative speedup, relative efficiency, and cost:  The total overhead can be calculated as where P is the number of available processing nodes, T S is the fastest sequential algorithm execution time and T P is the parallel algorithm execution time. Relative speedup is the factor by which execution time is reduced on P processors and it is defined as  Efficiency describes how well the algorithm manages the computational resources.More specifically, it tells us how much time the processors spend executing important computations [24].Relative efficiency is defined as  The definition of cost of solving a problem on a parallel system is the product of parallel runtime and P. Consequently, cost is a quantity that reveals the sum of individual processing node runtime.

Results and Discussion
Results from the four experiments are summarized in Figures 3-7.Note that the values plotted represent the average of ten runs.As expected, the execution time decreased as the number of processing nodes increased in all the experimented conditions (see Figure 3).In terms of efficiency and cost, the algorithm performed better using two to five processing nodes (see Figures 4 and 5).When using up to seven nodes, the communication overhead (see Figure 6) is sufficiently low in order to not affect the overall speedup (see Figure 7).The algorithm had the smallest execution time when   five or seven processing nodes were used (see Figure 3).When nine processing nodes were used, the communication overhead reached the highest, which caused a relatively higher total execution time and lower speedup.
It is noted that the overhead increased as the number of processing nodes increased and it reached the maximum with eight or nine processing nodes.This is because in the parallel algorithm, the overhead of communication is a result of nodes sending ψ and τ to the root and then the root broadcasting θ to the rest of the nodes  in every iteration.Note that the total data transferred between all the nodes during execution is lP(2n + 1).The biggest part of idling occurs when the root waits to receive ψ and τ from all the slave nodes and when the slave nodes wait for the root node to calculate θ and broadcast it to them.The communication overhead increases more than the computation speedup when a certain amount of processors are used (ranges from four to seven processors in the experiments performed).As a result, the speedup does not increase with increasing processor count, and consequently, the cost increases dramatically.
Furthermore, a close examination of Figure 7 indicates that the experiments input matrix sizes 2000 × 50, 5000 × 50, and 2000 × 50 (with number of iterations l = 20,000), follow identical paths.The common input value of these experiments is the number of items, k.The plot for the experiment with input matrix size 2000 × 100 shows that the algorithm maintains a higher speedup compared to the other experiments.Even though the experiment with input matrix size 2000 × 50 has smaller input size, the experiment with input matrix 2000 × 100 maintains a higher speedup over all the processors.The same pattern is observed from Figure 4.In particular, the plot for the experiment with input matrix size 2000 × 100 shows that the algorithm maintains a higher efficiency compared to the other experiments where k = 50.These are because the size of the messages communicated in every iteration from the slave nodes to the root, and from the root to the slave nodes, depends only on n.As k increases, the message size and communication overhead remain unaffected.Because of this, the processors spend more time performing computations and hence the efficiency and speedup increase.

Conclusions
This study developed a high performance Gibbs sampling algorithm for the 2PNO IRT model with the purpose of achieving a lower execution time possible using the available hardware (Maxwell cluster).The algorithm was implemented using the ANSI C programming language and the message passing interface.Experiments were performed to evaluate its performance with various dataset sizes or iteration lengths.Results indicated that the parallel algorithm (for the given problem size) performed better, in terms of efficiency and cost, using two to five processing nodes.On the other hand, the algorithm had the smallest execution time when nine processing nodes were used.
The design of a parallel 2PNO IRT model has proved to be justifiable.Given the high data dependencies for such problems, the solution initially appeared to be non-trivial.By using domain decomposition, we managed to avoid communication for the state dependencies.Nevertheless, communication in every iteration of the Markov chain cannot be avoided because of the data dependencies within the state.By modifying the serial algorithm, the size of the data communicated in every iteration was managed to be reduced to make a speedup possible.
This study achieved parallelization through a columnwise decomposition and the use of all-to-one and oneto-all broadcast schemes.Further studies can be undertaken to increase the speedup and the efficiency, and minimize the cost and the total overhead.For example, the data may be decomposed differently or an all-to-all broadcast scheme may be adopted in order to achieve smaller communication overhead.

Figure 1 .
Figure 1.The input y matrix mapped for five processing units.

Figure 2 .
Figure 2. The Z matrix, and α and γ vectors mapped for five processing units.

Figure 3 .
Figure 3. Execution time of the algorithm using one through nine processors in all the experiments.

Figure 4 .
Figure 4. Relative efficiency of using parallel algorithm over the serial algorithm in all the experiments.

Figure 5 .
Figure 5. Cost of the algorithm using one through nine processors in all the experiments.

Figure 6 .
Figure 6.Total overhead of using parallel algorithm over the serial algorithm in all the experiments.

Figure 7 .
Figure 7. Relative speedup of using parallel algorithm over the serial algorithm in all the experiments.