Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs *

As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.


Introduction
Sparse matrix-vector multiplication (SpMV) is an essential operation in solving linear systems and eigenvalue problems.For many iterative methods, the fraction of the execution time of SpMV may be more than 80% in the total time, so the study of its performance has attracted a lot of attention.Right now, the GPU has been from a graphics accelerator to a computing device with a broad spectrum of purposes, due to the characteristics of the multi-thread, high memory bandwidth.It can solve massively parallel problems and obtain very high performance.However, how to predict the execution time of SpMV on GPUs accurately is still a big challenge.
In order to improve the performance of SpMV on GPUs, Vazquez et al. [4] presented a new sparse storage format, termed ELLR-T, which is improved from ELL format, to achieve higher performance.Monakov et al. [5] put forward a sliced ELL format and used auto-tuning to find the optimal configuration for batter performance.Zheng and Gu [6] proposed bisection ELL (BiELL) and bisection JAD (BiJAD) format based on ELL and JAD format for optimizing SpMV on GPUs.Especially for irregular matrices, BiELL and BiJAD format can get greater load balance and higher performance by adjusting the elements storage location of matrix and make the zero-padding less.Meanwhile, they realized CG and GMRES method with BiELL and BiJAD format on GPU [7].Choi et al. [8] designed blocked ELL format and select proper parameters for better performance.Guo et al. [9] proposed an auto-tuning framework that can select the parameters of SpMV kernels to obtain the optimal performance on GPUs.
Besides studying how to improve the performance of SpMV on the GPUs, there are also many performance models focusing on performance prediction.Resios [10] proposed a parameterized analytical model to estimate execution time and identify potential bottlenecks in programs.Dinkins [11] put forward a model for predicting SpMV performance using memory bandwidth requirements and data locality.Werkhoven et al. [12] gave an analytical performance model that includes PCIe transfers and overlapping computation and communication to predict the execution time for CPU-GPU data transfers.Baghsorkhi et al. [13] presented a model to predict the performance of GPU applications based on the string and work flow graph.Guo et al. [9] showed a performance modeling and optimizing analysis to predict and optimize SpMV performance on GPUs.A simple analytical GPU model to predict the execution time of massively parallel programs was given by Hong et al. [14].Schaa et al. [15] presented a model to accurately estimate the execution time of GPU applications by varying the configurations.In a word, most of the performance prediction researches analyze and predict the execution time from the perspective of the machine itself, which focus on the physical properties and parameters of GPUs.There has few performance prediction models were established from the view of mathematics.
In this paper, we present two new improved models based on [16] and get better prediction accuracy.Our models mainly consist of two phases: generating parameters and fitting for prediction.In the first phase, some benchmark ma-trices will be generated according to a GPU's architecture features and four different sparse matrix storage formats, then SpMV with these benchmark matrices are implemented on the GPU to obtain the execution times.We will establish two parametric models according to the results of the benchmark matrices and predict the execution time of the SpMV kernels with a given target matrix on the GPU by our models in second phase.
The performance prediction models are essentially at the statistical point of view to predict the execution time of different SpMV kernels on GPUs.Firstly, the execution time of the benchmark matrices with different parameters is required, and then fitting the prediction functions according to the execution time of the benchmark matrices and two parameters.Finally, the estimated execution time of a target matrix will be got after putting two parameters into the prediction functions.
Compared with [16]'s model, the important difference of this paper is generating benchmark matrices.In [16], the number of non-zero elements per row ( NZ P ) of the benchmark matrices is set to a fixed value.While in many practical problems, such as in computer science or mathematics, there are many matrices are in irregular structure, which NZ P is different largely.So if NZ P of bench- mark matrices generated is fixed, the performance prediction model will go awry to some extent and then the accuracy of prediction will be decreased.Keeping this in mind that we generated two kinds benchmark matrices whose NZ P in the uniform distribution or normal distribution, respectively, and then establish its own performance prediction model.In the numerical experiments, we used our model to estimate the execution time of different SpMV kernels with 30 matrices, these matrices are from the University of Florida Sparse matrix collection [17].The experimental results show that the average prediction error of our models are two to three times lower than [16] at least, some matrix even tens of times lower.
The remainder of this paper is organized as follows: Section 2 gives some preliminaries and Section 3 shows the details of the performance prediction model.
Experimental results and analyses are reported in Section 4. Finally, some conclusions and future works are stated in Section 5.

Preliminaries
Firstly, we state in brief the GPU architecture and CUDA (Compute Unified Device Architecture) programming model.Traditionally, GPUs have been especially designed to handle the computation for computer graphics in real-time.
Today, they are increasingly being exploited as general-purpose attached processor to speed-up computations in image processing, physical simulations, data mining, linear algebra, etc. [3].It is suitable for processing massively parallel tasks, which have high density and simple branching logic.CUDA introduced by NVIDIA is similar in style to a single program multiple data (SIMD) software model [18].CUDA programs on the host (CPU) invoke a kernel which runs on the device (GPU).All threads within a block are executed concurrently on a ar-chitecture [14].In addition, when a multiprocessor is given one or more thread blocks to execute, it partitions them into groups of 32 parallel threads termed warp.
Four sparse matrix storage formats used in our model are described below.
The CSR is probably the most popular format for storing general sparse matrices [19].To parallelize the SpMV for a matrix in the CSR format, a simple scheme called CSR-S (CSR scalar) kernel [2] is to assign each thread by one row.The main drawback of this scheme is that the pattern of memory access is un-coalesced, so it shows not very efficient.A better approach, termed CSR-V (CSR vector) is proposed in [2] and modified in [3] to realize the memory access contiguously.[2] assign a warp (32 threads) to each row, while [3] assign each row a half-warp (16 threads).In this approach, since all threads within a warp or a half-warp access non-zero elements of one row, these accesses are more likely to belong to the same memory segment, so the chance of coalescing should be higher.In addition, CSR-V kernel in [3] will be used in our model.ELL format is efficient if the maximum number of non-zeros per row is not substantially different from the average.However, when the number of non-zeros varies largely between rows, excess padded zeros decrease the performance of SpMV.The JAD can be viewed as a generalization of ELL format which removes the assumption on the fixed-length rows [3].Compared to the ELL format, there are fewer zero-padding in the JAD format, so the performance of JAD will be better than ELL when implement SpMV on GPUs.Other details of these sparse matrix storage formats can be consulted in [19].

The Performance Prediction Model for SpMV
The work-flow of our model is similar to [16], which contains two phases.The main difference is the criteria for generating the benchmark matrices, which we described in follows.

Phase One: The Establishment of Fitting Function for Performance Prediction
Firstly, we give the definition of matrix strip.The strip of a matrix is a maximum sub-matrix that can be handled by a GPU with a full load of thread blocks within one iteration.Let SM N be the number of streaming multiprocessors for a NVIDIA GPU, HW N be the number of half-warps per multiprocessor and T N denote the number of threads per multiprocessor.Then, the size of strip for CSR-V, CSR-S, ELL and JAD format can be computed as follows: Secondly, we state the criteria for generating benchmark matrices.• The number of rows ( R ): where I is a positive integer, S is the size of strip defined for four formats (in different sub-index) as above.
• The number of non-zero elements per row ( NZ P ): In [16]'s model, each row has the same NZ P for the benchmark matrices.
However, NZ P will not be same exactly for a target matrix in practical situa- tions.So we combine with the characteristics of the matrix structure and let NZ P to meet a certain distribution.Two distributions will be adopted: normal and uniform.In addition, the mean of the distribution will be treated as NZ P .The benchmark matrices meet two kinds of distribution can be generated by Matlab.
In addition, we assume that then on-zero elements are in single precision (float).
• The number of columns ( C ): For the sake of simplicity, the benchmark matrices generated in our numerical experiments will be square.Obviously, it should be assumed that Thirdly, we set parameters of benchmark matrices.
In order to get more accurate fitting functions in our models, a series of benchmark matrices will be generated according to the above criteria.
Because of out of memory occurred in Matlab when P is too large, so we make 3072 R as the maximum value, but it does not affect the accuracy of the performance prediction.T is same as that of [16].
( ) ( ) ( ) ( ) where M × denotes a benchmark matrix of dimension R C × ; C V is a ran- dom vector of length C ; α and β are the number of executions and α β < .
φ is the execution time for each time the benchmark matrix be executed.For a target matrix with R N rows and NZ N non-zero elements, the number of strips I and the number of non-zero elements per row NZ P with four formats can be compute as follows: Let D be the set consisting of the number of non-zero elements in each row of the target matrix.Then NZ P is set to be mode of D for CSR-V matrix, while it is the maximum value of D for CSR-S, ELL, JAD matrices.

Phase Two: Prediction Based on the Performance Model
According to the statistics methods, we fit the performance function of SpMV for different storage formats, which based on three parameters of benchmark matrices: I , NZ P and B T .After the performance function obtained, we can estimate the execution time of SpMV for a target matrix T T by substituting two parameters I and NZ P of the target matrix into it.

CSR-V Format
After a large amount of experiments and fitting, we found that for CSR-V matrices, the relationship between B T and NZ P is different when NZ P is smaller or larger than the number of maximum threads per block (1024 for GeForce GTX 540 M).Therefore, the performance fitting function is obtained by the following method.

NZ T P
For the benchmark matrices with the same number of strips, we establish the relationship between NZ P and the execution time B T for SpMV.The fitting functions of two distributions for the number of strips 40 are shown in Figure 1.
As can be seen, the relationship between NZ P and B T is approximately linear, so we make ( ) • Establish the function ( )

E I
For the benchmark matrices with same NZ P , we establish the relationship between I and the execution time E (i.e.B T in the above) of the bench- mark matrices for SpMV.The fitting functions of two distributions for 64

NZ P =
and 2048 are shown in Figure 2. The relation between I and E is also approximately linear, so we make ( ) • Estimate the execution time of a target matrix For a target matrix, we need to calculate two parameters according to the Equation ( 6) and D : the number of non-zero elements per row 0 P and the number of strips 0 I , then derive ( ) 0 T P and ( ) 0 E I from above functions, respectively.In order to combat the effects of the difference about functions when the number of non-zero elements per row is smaller or larger than 1024, another execution time 0 t of any previously tested benchmark matrix whose NZ P is set to be the number of non-zero elements per row in ( ) E I .At this point, estimated execution time of the target matrix in CSR-V format is ( ) ( )  ).

CSR-S Format
• Establish the function ( )

NZ T P
For the benchmark matrices with the same I , we establish the relationship between NZ P and B T for SpMV.The fitting functions of two distributions for 5 I = are shown in Figure 3.As showed in the Figure, the relationship between NZ P and B T is approximately linear, so we make ( ) ( ) ( ) ( 1 I can be any arbitrary value within the range of I ).
• Establish the function ( )  1 P , which can be any arbitrary value within the range defined NZ P .The fitting functions of two distributions for 32 NZ P = are shown in Figure 5.We can see that the relationship between I and E is approximately linear, so we make ( ) ( ) ( ) , and the intercept function ( ) ( ) ( ) • Estimate the execution time of a target matrix Given a target matrix, we need to calculate two parameters according to the Equation ( 6) and D : the number of non-zero elements per row 0 P and the number of strips 0 I , then derive ( ) 0 f I and ( ) 0 g I from above functions, respectively.At this moment, estimated execution time of the target matrix in CSR-S format is ( ) ( ) ( ) After getting the performance function of CSR-S format, we find that the relationship between dependent variables B T and two variables NZ P , I is saddle surface in the functional image, that is to say, when I is fixed, the relationship between B T and NZ P is linearity and vice versa, which coincided with the 3D fitting image we get in Matlab, as shown in Figure 6, which the darker of the colors means the smaller of the values.

ELL and JAD Formats
Note that, the granularity of ELL and JAD format is the same as CSR-S format, which assigns one thread to each row to implement SpMV on GPUs.Therefore, fitting the performance function of ELL and JAD format is done in a similar way with CSR-S format, except the functional expressions.In addition, the 3D images obtained by the Matlab can be fitted with the performance functions and need not be repeated here.

Experimental Results and Analyses
The experiments are performed on NVIDIA GeForce GTX 540 M with 1 GB global memory, the operating system is a 64-bit Linux with CUDA 6.5 driver.
We evaluated our performance prediction model on 30 matrices with each sparse matrix storage format, respectively.These matrices are square real matrices from the University of Florida Sparse matrix collection [17].Moreover,  in order to compare with [16]'s model, we also implement the model in [16] whose NZ P is fixed in benchmark matrices.The estimated time and the perfor- mance difference rate of the 30 target matrices in four sparse matrix storage for-mats with three models are given, which is convenient to compare with each other.
We define the performance difference rate for different model as For CSR-V format, the performance difference rate of SpMV in three models of 30 matrices is shown in Figure 7.We can see that r D in [16]'s model is When implement SpMV on GPU with CSR-S format, the performance difference rate in three models of 30 matrices is shown in Figure 8.The average r D in [16]'s model, our uniform and normal model is 5.44%, 3.21% and 3.52%, respectively.The average factor for the prediction accuracy of uniform model higher than that of [16]'s model is 1.69, while for normal model, it is 1.54.
In addition, the prediction accuracy of [16]'s model is higher than that of normal model on 15 matrices (such as bcsstk16), while for uniform model the better number is 12 (such as bips98_1142).The better number for normal model vs. uniform model is 13 (such as bayer09) in 30 cases.
Figure 7.The performance difference rate of SpMV on CSR-V matrices.
The similar results for ELL matrices are given in Figure 9.It says the average r D of three models are 5.79%, 3.53% and 3.26%, respectively.The average

Conclusion and Future Work
Aiming at the better performance model of SpMV on GPU based on statistics and [16], we have presented two new models, which consider the structure of matrices.We predict four SpMV CUDA kernels: CSR-V, CSR-S, ELL and JAD.
The numerical result shows that the prediction accuracy of our models is higher than that of [16]'s model.In the future, we will extend our performance prediction model to other SpMV with different storage formats on different kinds of GPUs.In addition, we will propose a new performance model to predict the execution time of a class of iterative methods on heterogeneous parallel machines.
A benchmark matrix is only determined by R and NZ P .Since R S I = × , where S is fixed for a certain sparse matrix format, we just need change the value of I to get different benchmark matrices.Due to NZ P in the benchmark matrices fol- lows two kinds of distributions, so it is determined by the mean of each distribution based on the distribution density P .Then combine the value of I and P , we can obtain a benchmark matrix.• The number of strips ( I ):  CSR-V: Let 1, 2, 3, , 9,10,15, 20, 25, , 45, 50 I = In our experimental platform, the size of matrix strip for CSR-V format is fewer than other formats, in order to predict the performance accurately, we increase the value of I to 50. CSR-S, ELL, JAD: Let 1, 2, 3, , 9,10 I = • The distribution density ( P ): formula for calculating average the execution time of benchmark matrices B

Figure 2 .
Figure 2. The fitting functions of two distributions for

ForForFigure 3 .
Figure 3.The fitting functions of two distributions for 5 I = .

Figure 4 .
Figure 4.The fitting functions for I versus ( ) f I .

Figure 5 .
Figure 5.The fitting functions of two distributions for

Figure 6 .
Figure 6.The 3D fitting images of two distributions for B T .

Figure 8 .
Figure 8.The performance difference rate of SpMV on CSR-S matrices.

Figure 9 .
Figure 9.The performance difference rate of SpMV on ELL matrices.

Figure 10 give
Figure10give the results for JAD matrices.r D of three models are 5.97%, 3.94% and 4.28% on average, respectively.The average improved factor for normal and uniform model over[16]'s model are 1.46 and 1.35.The better number for [16]'s model vs. normal model is 9:21, while for uniform model it is 11:19, while for normal model vs. uniform model it is 16:14.The execution time of four SpMV kernels in three model on 30 matrices is shown in Figures 11-14.There is large difference in the execution time for all matrices in different storage formats.So we put the execution time into two figures: the shorter and the longer in (a) and (b), respectively.Almost all of the estimated time of the matrices in four different storage formats is greater than the actual measured time.The possible reasons are that we take the number of strips I by rounding up to an integer.

Figure 10 .
Figure 10.The performance difference rate of SpMV on JAD matrices.

Figure 11 .
Figure 11.The comparison of estimated and measured time on CSR-V matrices.

Figure 12 .
Figure 12.The comparison of estimated and measured time on CSR-S matrices.

Figure 13 .
Figure 13.The comparison of estimated and measured time on ELL matrices.

Figure 14 .
Figure 14.The comparison of estimated and measured time on JAD matrices.