^{1}

^{2}

^{*}

^{3}

As one of the most essential and important operations in linear algebra, the performance prediction of sparse matrix-vector multiplication (SpMV) on GPUs has got more and more attention in recent years. In 2012, Guo and Wang put forward a new idea to predict the performance of SpMV on GPUs. However, they didn’t consider the matrix structure completely, so the execution time predicted by their model tends to be inaccurate for general sparse matrix. To address this problem, we proposed two new similar models, which take into account the structure of the matrices and make the performance prediction model more accurate. In addition, we predict the execution time of SpMV for CSR-V, CSR-S, ELL and JAD sparse matrix storage formats by the new models on the CUDA platform. Our experimental results show that the accuracy of prediction by our models is 1.69 times better than Guo and Wang’s model on average for most general matrices.

Sparse matrix-vector multiplication (SpMV) is an essential operation in solving linear systems and eigenvalue problems. For many iterative methods, the fraction of the execution time of SpMV may be more than 80% in the total time, so the study of its performance has attracted a lot of attention. Right now, the GPU has been from a graphics accelerator to a computing device with a broad spectrum of purposes, due to the characteristics of the multi-thread, high memory bandwidth. It can solve massively parallel problems and obtain very high performance. However, how to predict the execution time of SpMV on GPUs accurately is still a big challenge.

In 2003, Bolz et al. [

In order to improve the performance of SpMV on GPUs, Vazquez et al. [

Besides studying how to improve the performance of SpMV on the GPUs, there are also many performance models focusing on performance prediction. Resios [

In this paper, we present two new improved models based on [

The performance prediction models are essentially at the statistical point of view to predict the execution time of different SpMV kernels on GPUs. Firstly, the execution time of the benchmark matrices with different parameters is required, and then fitting the prediction functions according to the execution time of the benchmark matrices and two parameters. Finally, the estimated execution time of a target matrix will be got after putting two parameters into the prediction functions.

Compared with [

The remainder of this paper is organized as follows: Section 2 gives some preliminaries and Section 3 shows the details of the performance prediction model. Experimental results and analyses are reported in Section 4. Finally, some conclusions and future works are stated in Section 5.

Firstly, we state in brief the GPU architecture and CUDA (Compute Unified Device Architecture) programming model. Traditionally, GPUs have been especially designed to handle the computation for computer graphics in real-time. Today, they are increasingly being exploited as general-purpose attached processor to speed-up computations in image processing, physical simulations, data mining, linear algebra, etc. [

Four sparse matrix storage formats used in our model are described below. The CSR is probably the most popular format for storing general sparse matrices [

The work-flow of our model is similar to [

Firstly, we give the definition of matrix strip. The strip of a matrix is a maximum sub-matrix that can be handled by a GPU with a full load of thread blocks within one iteration. Let N S M be the number of streaming multiprocessors for a NVIDIA GPU, N H W be the number of half-warps per multiprocessor and N T denote the number of threads per multiprocessor. Then, the size of strip for CSR-V, CSR-S, ELL and JAD format can be computed as follows:

S CSR-V = N S M × N H W (1)

S CSR-S = N S M × N T (2)

S ELL = S JAD = S CSR-S (3)

Secondly, we state the criteria for generating benchmark matrices.

・ The number of rows ( R ):

R = S × I (4)

where I is a positive integer, S is the size of strip defined for four formats (in different sub-index) as above.

・ The number of non-zero elements per row ( P N Z ):

In [

・ The number of columns ( C ):

For the sake of simplicity, the benchmark matrices generated in our numerical experiments will be square. Obviously, it should be assumed that C > P N Z .

Thirdly, we set parameters of benchmark matrices.

In order to get more accurate fitting functions in our models, a series of benchmark matrices will be generated according to the above criteria. A benchmark matrix is only determined by R and P N Z . Since R = S × I , where S is fixed for a certain sparse matrix format, we just need change the value of I to get different benchmark matrices. Due to P N Z in the benchmark matrices follows two kinds of distributions, so it is determined by the mean of each distribution based on the distribution density P . Then combine the value of I and P , we can obtain a benchmark matrix.

・ The number of strips ( I ):

➢ CSR-V: Let I = 1 , 2 , 3 , ⋯ , 9 , 10 , 15 , 20 , 25 , ⋯ , 45 , 50

In our experimental platform, the size of matrix strip for CSR-V format is fewer than other formats, in order to predict the performance accurately, we increase the value of I to 50.

➢ CSR-S, ELL, JAD: Let I = 1 , 2 , 3 , ⋯ , 9 , 10

・ The distribution density ( P ):

➢ CSR-V: Let P = 4 R , 8 R , 16 R , ⋯ , 512 R , 1024 R , 1536 R , 2048 R , 2560 R , 3072 R

Because of out of memory occurred in Matlab when P is too large, so we make 3072 / R as the maximum value, but it does not affect the accuracy of the performance prediction.

➢ CSR-S, ELL, JAD: Let P = 4 R , 8 R , 16 R , ⋯ , 512 R , 1024 R

Finally, the formula for calculating average the execution time of benchmark matrices T B is same as that of [

T B = ∑ j = 1 β ϕ ( ( M R × C ) × V C ) − ∑ j = 1 α ϕ ( ( M R × C ) × V C ) β − α (5)

where M R × C denotes a benchmark matrix of dimension R × C ; V C is a random vector of length C ; α and β are the number of executions and α < β . ϕ is the execution time for each time the benchmark matrix be executed. For a target matrix with N R rows and N N Z non-zero elements, the number of strips I and the number of non-zero elements per row P N Z with four formats can be compute as follows:

I CSR-V = ⌈ N R S CSR-V ⌉ , I CSR-S = ⌈ N R S CSR-S ⌉ , I ELL = ⌈ N R S ELL ⌉ , I JAD = ⌈ N R S JAD ⌉ (6)

Let D be the set consisting of the number of non-zero elements in each row of the target matrix. Then P N Z is set to be mode of D for CSR-V matrix, while it is the maximum value of D for CSR-S, ELL, JAD matrices.

According to the statistics methods, we fit the performance function of SpMV for different storage formats, which based on three parameters of benchmark matrices: I , P N Z and T B . After the performance function obtained, we can estimate the execution time of SpMV for a target matrix T T by substituting two parameters I and P N Z of the target matrix into it.

After a large amount of experiments and fitting, we found that for CSR-V matrices, the relationship between T B and P N Z is different when P N Z is smaller or larger than the number of maximum threads per block (1024 for GeForce GTX 540 M). Therefore, the performance fitting function is obtained by the following method.

・ Establish the function T ( P N Z )

For the benchmark matrices with the same number of strips, we establish the relationship between P N Z and the execution time T B for SpMV. The fitting functions of two distributions for the number of strips 40 are shown in

・ Establish the function E (I)

For the benchmark matrices with same P N Z , we establish the relationship between I and the execution time E (i.e. T B in the above) of the benchmark matrices for SpMV. The fitting functions of two distributions for P N Z = 64 and 2048 are shown in

・ Estimate the execution time of a target matrix

For a target matrix, we need to calculate two parameters according to the Equation (6) and D : the number of non-zero elements per row P 0 and the number of strips I 0 , then derive T ( P 0 ) and E ( I 0 ) from above functions, respectively. In order to combat the effects of the difference about functions when the number of non-zero elements per row is smaller or larger than 1024, another execution time t 0 of any previously tested benchmark matrix whose P N Z is set to be the number of non-zero elements per row in E ( I ) . At this point, estimated execution time of the target matrix in CSR-V format is

T 0 = T ( P 0 ) t 0 × E ( I 0 ) .

・ Establish the function T ( P N Z )

For the benchmark matrices with the same I , we establish the relationship between P N Z and T B for SpMV. The fitting functions of two distributions for I = 5 are shown in

・ Establish the function f (I)

For sets of benchmark matrices with different number of strips, we establish the relationship between the number of strips I and the coefficient of the linear functions f ( I ) in T ( P N Z ) . The fitting functions of two distribution are shown in

・ Establish the function E ( I ) = f ( I ) × P 1 + g (I)

For Like the fitting function T ( P N Z ) , we establish the relationship between the number of strips I and the execution time E (i.e. T B in the above) of the benchmark matrices with the same number of non-zero elements per row

P 1 , which can be any arbitrary value within the range defined P N Z . The fitting functions of two distributions for P N Z = 32 are shown in

・ Estimate the execution time of a target matrix

Given a target matrix, we need to calculate two parameters according to the Equation (6) and D : the number of non-zero elements per row P 0 and the number of strips I 0 , then derive f ( I 0 ) and g ( I 0 ) from above functions, respectively. At this moment, estimated execution time of the target matrix in CSR-S format is T ( P 0 ) = f ( I 0 ) × P 0 + g ( I 0 ) .

After getting the performance function of CSR-S format, we find that the relationship between dependent variables T B and two variables P N Z , I is saddle surface in the functional image, that is to say, when I is fixed, the relationship between T B and P N Z is linearity and vice versa, which coincided with the 3D fitting image we get in Matlab, as shown in

Note that, the granularity of ELL and JAD format is the same as CSR-S format, which assigns one thread to each row to implement SpMV on GPUs. Therefore, fitting the performance function of ELL and JAD format is done in a similar way with CSR-S format, except the functional expressions. In addition, the 3D images obtained by the Matlab can be fitted with the performance functions and need not be repeated here.

The experiments are performed on NVIDIA GeForce GTX 540 M with 1 GB global memory, the operating system is a 64-bit Linux with CUDA 6.5 driver. We evaluated our performance prediction model on 30 matrices with each sparse matrix storage format, respectively. These matrices are square real matrices from the University of Florida Sparse matrix collection [

in order to compare with [

We define the performance difference rate for different model as

D r = | estimatedtime − mesuredtime | mesuredtime (7)

For CSR-V format, the performance difference rate of SpMV in three models of 30 matrices is shown in

Furthermore, there are three cases for the prediction accuracy of [

When implement SpMV on GPU with CSR-S format, the performance difference rate in three models of 30 matrices is shown in

In addition, the prediction accuracy of [

The similar results for ELL matrices are given in

better number for the factor of normal and uniform model vs. [

The execution time of four SpMV kernels in three model on 30 matrices is shown in Figures 11-14. There is large difference in the execution time for all matrices in different storage formats. So we put the execution time into two figures: the shorter and the longer in (a) and (b), respectively. Almost all of the estimated time of the matrices in four different storage formats is greater than the actual measured time. The possible reasons are that we take the number of strips I by rounding up to an integer.

Aiming at the better performance model of SpMV on GPU based on statistics and [

In the future, we will extend our performance prediction model to other SpMV with different storage formats on different kinds of GPUs. In addition, we will propose a new performance model to predict the execution time of a class of iterative methods on heterogeneous parallel machines.

Wang, R.X., Gu, T.X. and Li, M. (2017) Performance Prediction Based on Statistics of Sparse Matrix-Vector Multiplication on GPUs. Journal of Computer and Communications, 5, 65-83. https://doi.org/10.4236/jcc.2017.56005