^{1}

^{2}

^{*}

The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.

The Sparse matrix vector multiplication (SpMV) is a key operation in for a variety of computation science, such as in many iterative methods for solving linear systems ( A x = b ), image processing, simulation and so on. It is very important to improving the performance of the SpMV.

GPU including many Stream Processors, and many threads can simultaneously calculate multiple groups of data, with high computational power and very high memory bandwidth. It has high parallelism. GPU has many different types of memory, such as shared memory, texture memory, global memory, local memory and so on. Different memory access speed is also different, our computing will be greatly improved if we reasonably use them. The GPU architecture and CUDA programming model can see in [^{5}), with non-zero elements components is low (e.g. ≤5%). In order to improve the computational efficiency, it is important to make changes to find a suitable matrix storage format and calculation method.

There are many storage formats related to sparse matrix, such as CSR, ELL, HYB, BiELL and so on. In [

This paper is an optimization of ELLPACK-R format, we call it PELLR. The remainder of the paper is organized as follows. Section 2 gives some existing sparse matrix storage formats. Related works and our new PELLR format are described in Section 3, and Section 4 presents some numerical results. The conclusions are stated in Section 5.

In this section, some basic sparse matrix storage formats are described. For a clearer understanding, let’s use a simple model. A sparse matrix A is represented by

Coordinate (COO) storage format is the most direct and simple scheme for a

sparse matrix [

• Real array A [ ] contains the non-zero entries row by row in any order.

• Integer array J [ ] is made of the corresponding column indices for each non-zero entry in A [ ] .

• Integer array I [ ] is made of the corresponding row indices for each non-zero entry in A [ ] .

The calculation of SpMV based on COO format is not suitable for GPU structure when the matrix is stored in disorder. In this case, the multi-threads will access data and write vector in discontinuous way. On the other hand, this format will occupy more memory than that of CSR format, which will be introduced in next subsection.

The compressed sparse row (CSR) format is the most practical format to store sparse matrices [

• Real array A [ ] of size of n n z contains the non-zero entries row by row.

• Integer array J [ ] of size of n n z is made of the corresponding column indices for each non-zero entry in A [ ] .

• Integer array I [ ] is made of the start pointer of each row in A [ ] and J [ ] . The size of I [ ] is N + 1 , I [ N + 1 ] = n n z + 1 . The number of non-zeros of the ith row can be expressed as I [ i + 1 ] − I [ i ] .

There are two basic ways to implement SpMV on GPU based on CSR format: CSR scalar (CSRS) and CSR vector (CSRV). CSRS calculates one row by one thread. Since the non-zero values and column indices are stored row by row in A [ ] and J [ ] , so all threads access data in discontinuous way. This is why its performance is poor on GPUs.

The CSRV format is proposed in [

Ellpack format (ELL, in brief) is well suited to vector architectures [

• Real two dimension array A [ ] contain the non-zeros entries.

• Integer two dimension array J [ ] is made of the column indices for each non-zero entry in A [ ] .

Each row of ELL format with the number of non-zeros less than K needs to be padded with zeros. For ease of the calculation on GPUs, it is a common way to write a two dimension array as a one dimensional array, column by column. Then A [ i + j × N ] represents element of the ith row and the j + 1 th ( j = 0 , 1 , ⋯ , K − 1 ) column.

ELL can be considered as an approach to fit a sparse matrix in a regular data structure similar to a dense matrix. When the numbers of non-zeros in each row are almost equal, the zeros need to be padded will be less, which leads to a high performance of the implementation of SpMV on GPUs. On the other hand, when difference of the number of non-zeros between rows is large, more zeros need to be padded, which will decrease the performance.

ELLPACK-R format (ELLR, in brief) made some changes and optimizations on ELL format. It consists of three one dimension arrays, A [ ] , J [ ] , and r l [ ] .

• A [ ] and J [ ] are same as ELL format.

• Integer array r l [ ] contains the numbers of non-zeros per row. The size of r l [ ] is N (i.e. the number of rows of the matrix).

These three arrays are represented in

1) The coalesced global memory access, thanks to the column-major ordering used to store the matrix elements [

2) Non-synchronized execution between different blocks of threads.

3) The reduction of the waiting time or unbalance between threads of one warp [

4) Homogeneous computing within the threads in the warps.

BiELL format is a bisection ELL format [

• Real array A [ ] and integer array J [ ] store the non-zeros and corresponding column indices column by column and warp by warp.

• Integer array I [ ] contain the starting pointers of the first element in each group.

• Integer array p e r m [ ] records the order of rows.

A simple example is given in

The main advantage of the BiELL format is that it balances the workload of different threads in a warp, so reduces the waiting time. By using bisection technique, the non-zero elements in a group are equally allocated to different threads. This reduces the number of zeros to be padded and the number of iterations.

The hybrid format (HYB) is a combination of the ELL and COO formats. The purpose of the HYB is to store the non-zeros of a given number per row in the ELL data structure and the remaining entries in the COO format [

The jagged diagonal (JAD) format [

of non-zeros of each row, then stored the non-zeros in jagged diagonals. It consists of four arrays, A [ ] , J [ ] , I [ ] and p e r m [ ] .

• Real array A [ ] and integer array J [ ] store the non-zeros and its corresponding column indices jagged diagonal by jagged diagonal.

• Integer array I [ ] contains the starting position of the first element in each jagged diagonals.

• Integer array p e r m [ ] records the order of rows.

JAD reduces the number of zeros to be padded, which leads to a better performance than the ELL format.

The bisection JAD (BiJAD) format is a bisection of JAD, which is an optimized and improved version of JAD on GPUs. BiELL sorts each row in a warp, while BiJAD sorts all the rows. The BiJAD format may decrease the padding zeros compared with BiELL format; however, when the results are permuted back to the origin order, the pattern memory accessed may not be coalescent [

In order to optimizing the SpMV on GPUs, we propose a new format, PELLR format. It is based on the permutation of row for ELLR format.

PELLR format sorts the rows based on the number of non-zeros of each row, then stored the non-zeros in ELL format. It consists of four one dimension arrays, A [ ] , J [ ] , r l [ ] and p e r m [ ] .

• A [ ] , J [ ] and r l [ ] are same as ELLR format.

• Integer array p e r m [ ] records the order of rows.

The size of r l [ ] is N (i.e. the number of rows of the matrix), which purposes to easy theory analysis. In the actual calculation, we can take the size of r l [ ] as N w (much less than N, see in follows), which reduce the need of memory.

PELLR mainly optimizes the third character of ELLR format. For a sparse matrix of size N × M , the difference of non-zeros of each row may be very large. GPU calculation is based on a warp as a whole. It’s going to happen frequently that a row consist of little non-zeros (e.g., <5), while another row consist many non-zeros (e.g., >20) belong to a warp. This creates extra unnecessary computational workload. Our idea is to sort the whole row according to the number of non-zeros in each row, so the rows with more non-zero elements would be arranged together, and the rows with fewer elements would be grouped together. This will reduce unnecessary calculations and obtain an optimized version of storage format. An example is given in

Now, we give some analysis and compare of ELLR and PELLR. In order to better describe how to sum the total work amount, we use the following denotes:

• n n z : the total number of non-zeros.

• N: the number of rows.

• r l : an array for the number of non-zeros of each row.

• w a r p : has 32 threads, w a r p = 32 .

• N w : ⌊ N + 32 − 1 w a r p ⌋ , the number of w a r p for the matrix. ⌊ ⌋ means to take an integer.

• N i t e r : the total number of iterations.

• N p : the number of computations.

• b i : an array of size of w a r p contains the number of non-zero elements in each row in the ith warp. In the final warp, we take zero for the row that doesn’t exist. We have the following relation

b i = r l [ w a r p × ( i − 1 ) + 1 , ⋯ , w a r p × i ] , i = 1 , ⋯ , N w .

Then we can deduce that the number of iterations and work amount are:

N i t e r = ∑ i = 1 N w max { b i [ 1 ] , ⋯ , b i [ 32 ] } (1)

N p = ∑ i = 1 N w ( ∑ j = 1 w a r p b i [ j ] + ∑ j = 1 w a r p ( b i [ j ] − 1 ) ) = ∑ i = 1 N w ∑ j = 1 w a r p ( 2 b i [ j ] − 1 ) (2)

For a matrix, we can use these two expressions to obtain the amount of computation and the number of iterations. For the ELLR and PELLR formats, it can

be judged from equations of Equation (1) and Equation (2) that the total number of iterations has changed, but the amount of calculation has not changed. For the matrix given in

b 1 , e l l r = [ 2 , 3 , 3 , 4 , 4 , 4 , 2 , 4 ] , b 1 , p e l l r = [ 7 , 4 , 4 , 4 , 4 , 4 , 3 , 3 ] , b 2 , e l l r = [ 2 , 3 , 2 , 3 , 2 , 3 , 2 , 2 ] , b 2 , p e l l r = [ 3 , 3 , 3 , 3 , 3 , 3 , 3 , 3 ] , b 3 , e l l r = [ 2 , 2 , 7 , 3 , 3 , 3 , 3 , 3 ] , b 3 , p e l l r = [ 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ] , b 4 , e l l r = [ 4 , 3 , 0 , 0 , 0 , 0 , 0 , 0 ] , b 4 , p e l l r = [ 2 , 2 , 0 , 0 , 0 , 0 , 0 , 0 ] .

So

N i t e r , e l l r = 18 , N i t e r , p e l l r = 14 , N p , e l l r = N p , p e l l r .

We reduce the number of iterations by a new permutation of the rows. In this example, PELLR only needs 14 iterations, less than 18 iterations needed by ELLR.

Our experiments are run on a personal computer equipped with NVIDIA Quadro P600; the operating system is a 64-bit Linux with CUDA 10.0 driver. The SDK and CUDA Toolkit, CUSPARSE [

All the test matrices in our experiments are real square matrices collected from Matrix Market and the university of Florida sparse matrix collection. Information about the matrix is listed in

N: the matrix row size.

n n z : the non-elements number for matrix.

a v e : the average of non-zeros per row, a v e = n n z / N .

σ : the standard deviation of the number of non-zeros elements per row. max − min : is the difference between the maximum and minimum of non-zeros elements per row.

Performance in GFlops is calculated as 2 × n n z / T , where T is the wall time of SpMV calculated on the GPU. In order to improve the performance of SpMV, we used texture memory to store the vector x for the SpMV kernels. This memory is bound to the global memory and plays the role of a cache level within the memory hierarchy [

In

Matrix | N | nnz | ave | σ | max-min |
---|---|---|---|---|---|

rdb2048l | 2048 | 12,032 | 2.9 | 0.34 | 2 |

dw2048 | 2048 | 10,114 | 4.9 | 0.51 | 5 |

dw8192 | 8192 | 41,746 | 5.1 | 0.61 | 5 |

qh1484 | 1484 | 6110 | 4.1 | 1.60 | 11 |

mhd4800b | 4800 | 27,250 | 5.7 | 2.00 | 9 |

s3dkt3m2 | 90,449 | 1,921,955 | 21.2 | 2.39 | 38 |

s3dkq4m2 | 90,449 | 2,455,670 | 27.1 | 2.67 | 44 |

sherman3 | 5005 | 20,033 | 4.0 | 2.70 | 6 |

gemat12 | 4929 | 33,044 | 6.7 | 3.00 | 42 |

Insp3937 | 3937 | 25,407 | 6.5 | 3.10 | 10 |

mhd3200a | 3200 | 68,026 | 21.0 | 5.80 | 32 |

utm5940 | 5940 | 83,842 | 14.0 | 6.30 | 29 |

bcsstk24 | 3562 | 159,910 | 45.0 | 11.00 | 42 |

msc23052 | 23,052 | 1,154,814 | 50.1 | 11.60 | 166 |

bcsstk36 | 23,052 | 1,143,140 | 49.6 | 12.20 | 170 |

e20r4000 | 4241 | 131,430 | 31.0 | 15.00 | 54 |

e40r5000 | 17,281 | 553,562 | 32.0 | 16.00 | 54 |

cavity25 | 4562 | 131,735 | 29.0 | 17.00 | 61 |

boneSo1 | 127,224 | 6,715,152 | 52.8 | 17.64 | 69 |

memplus | 17,758 | 126,150 | 7.1 | 22.00 | 572 |

for matrices with small σ ( < 3 ) , PELLR format has no obvious advantage to ELLR format, such as the matrix dw2048, qh1484 and s3dkt3m2.

The matrices memplus and lnsp3937 are special. The structure of memplus is shown in

From a large amount of experiments, we can make a general remark that PELLR format is faster than ELLR for almost all matrices, and when the matrix has the parameters of σ > 10 , a v e > 20 and max − min > 10 , PELLR format is faster than ELLR format about a factor of 1.5.

We have compared the performance PELLR format with HYB format, the results are give in

Some matrices have special results due to their structures. msc23052 and bcsstk36 have large max − min (166 and 170, respectively), the ratio is only 1.2. memplus (see

In

to note matrices lnsp3937 and mhd4800b. Their σ and max − min are 3.1 and 2, and 10 and 9, respectively. Since they are relatively small, the number of iterations per warp for BiELL format will not reduce very much, and the time of the judgment statement (in SpMV kernel on GPU) is not covered, so PELLR has obvious advantage in this cases. For matrix memplus, which structure is seen in

In

1) Overall, the PELLR format is optimal in most cases. And then the JAD and ELLR formats also turned out pretty well, the CSRV format is relatively poor.

2) The performance of CSRV and cuCSR is sensitive to the a v e . In general, CSRV is poorer than cuCSR in most cases. CSRV will performance well when a v e is large (>16), such as boneSo1 and bcsstk24.

3) HYB is not as good as we thought for our test matrices. But it can be found good performance when a v e and σ are bigger, such as bcsstk36 and msc23052, and its performance is best for matrix memplus.

4) For almost matrices, PELLR and JAD are most outstanding, which in turn they are the best case.

5) All statements in the previous experiments is obtained again.

We proposed a permutated ELLR format by sorting the rows based on the number of non-zeros (or the length) of each row. This preprocessing makes the rows of almost equal length together. So the number of the iterations is reduced and the performance of SpMV can be improved, the speed up achieved about 1.5 in our experimental results. Furthermore, we deduced the formulation of the number of iterations and the work amount, which can be used to evaluate the performance of SpMV. In our experiments results, the performance of PELLR format is best in most cases. The performance comparison of different matrix formats is given, and some special cases are explained.

The PELLR format improves ELLR format in performance of SpMV, but it also adds increased storage memory. This format stores two more arrays than ELL format, r l [ ] and p e r m [ ] . In the future, we want to reduce the memory of PELLR format and how to choose an optimal storage format for A sparse matrix by the matrix parameters and Equation (1).

This work was supported in part by Science Challenge Project, No. TZ2016002 and NSF of China (61472462, 11671049, 11601033).

The authors declare no conflicts of interest regarding the publication of this paper.

Wang, Z.Q. and Gu, T.X. (2020) PELLR: A Permutated ELLPACK-R Format for SpMV on GPUs. Journal of Computer and Communications, 8, 44-58. https://doi.org/10.4236/jcc.2020.84004