PELLR: A Permutated ELLPACK-R Format for SpMV on GPUs

The sparse matrix vector multiplication (SpMV) is inevitable in almost all kinds of scientific computation, such as iterative methods for solving linear systems and eigenvalue problems. With the emergence and development of Graphics Processing Units (GPUs), high efficient formats for SpMV should be constructed. The performance of SpMV is mainly determinted by the storage format for sparse matrix. Based on the idea of JAD format, this paper improved the ELLPACK-R format, reduced the waiting time between different threads in a warp, and the speed up achieved about 1.5 in our experimental results. Compared with other formats, such as CSR, ELL, BiELL and so on, our format performance of SpMV is optimal over 70 percent of the test matrix. We proposed a method based on parameters to analyze the performance impact on different formats. In addition, a formula was constructed to count the computation and the number of iterations.

The calculation of SpMV based on COO format is not suitable for GPU structure when the matrix is stored in disorder. In this case, the multi-threads will access data and write vector in discontinuous way. On the other hand, this format will occupy more memory than that of CSR format, which will be introduced in next subsection.

CSR Format
The compressed sparse row (CSR) format is the most practical format to store sparse matrices [13]. It can also represent as three one dimension arrays. Let N and nnz be the number of rows and the total number of non-zeros of the sparse matrix, respectively. J , so all threads access data in discontinuous way. This is why its performance is poor on GPUs.
The CSRV format is proposed in [5] and modified in [7] to realize the memory access contiguously. Unlike CSRS, it uses a half-warp (generally, 16 threads) to compute each row. All of the threads in a half-warp access the memory contiguously, so the chance of coalescing will be higher. Each thread of a half-warp compute partial result and stored in share memory and it's need to reduce the partial result to sum up when all calculation finished [8].

ELL Format
Ellpack format (ELL, in brief) is well suited to vector architectures [5]. For a N M × matrix with a maximum of K non-zeros every row, the ELL format stores the non-zero elements in each row to the left side in a dense N K × array. The corresponding column indices are store in another two dimension array. represents element of the ith row and the ) column. ELL can be considered as an approach to fit a sparse matrix in a regular data structure similar to a dense matrix. When the numbers of non-zeros in each row are almost equal, the zeros need to be padded will be less, which leads to a high performance of the implementation of SpMV on GPUs. On the other hand, when difference of the number of non-zeros between rows is large, more zeros need to be padded, which will decrease the performance. is N (i.e. the number of rows of the matrix).

ELLR Format
These three arrays are represented in Figure 2. This format have some advantage, which can be seen in [6] for more detail. Some of its characters are listed in the following: 1) The coalesced global memory access, thanks to the column-major ordering used to store the matrix elements [6].
2) Non-synchronized execution between different blocks of threads.
3) The reduction of the waiting time or unbalance between threads of one warp [6]. Figure 3 give an example of ELLR, which assume a warp consists 8 threads. The darker areas are the non-zero elements that need to be computed. The lighter areas are the waiting times. The number of iterations needed in a warp equals to the largest number of non-zeros in rows (e.g., the first warp needs 4 iterations in Figure 3). It reduces many iterations compare with ELL format.  4) Homogeneous computing within the threads in the warps.

BiELL Format
BiELL format is a bisection ELL format [8]. Based on ELL format, it sort the rows in a warp according to the numbers of non-zeros in each row, then group the rows in a warp by bisection technique. The detailed process can be seen in [8]. records the order of rows. A simple example is given in Figure 4, which assume there is 4 threads in a warp [8].
The main advantage of the BiELL format is that it balances the workload of different threads in a warp, so reduces the waiting time. By using bisection technique, the non-zero elements in a group are equally allocated to different threads. This reduces the number of zeros to be padded and the number of iterations.

HYB Format
The hybrid format (HYB) is a combination of the ELL and COO formats. The purpose of the HYB is to store the non-zeros of a given number per row in the ELL data structure and the remaining entries in the COO format [5]. How to select the storage portion of the ELL is a difficult point in the HYB format. One way profitable to do this is to add the Kth column to ELL structure if at least one third of the matrix rows contain K (or more) non-zeros. The other way is to select the average number of non-zeros of each row to storage in ELL format.

JAD and BiJAD Format
The jagged diagonal (JAD) format [7] [14] sorts the rows based on the numbers JAD reduces the number of zeros to be padded, which leads to a better performance than the ELL format.
The bisection JAD (BiJAD) format is a bisection of JAD, which is an optimized and improved version of JAD on GPUs. BiELL sorts each row in a warp, while BiJAD sorts all the rows. The BiJAD format may decrease the padding zeros compared with BiELL format; however, when the results are permuted back to the origin order, the pattern memory accessed may not be coalescent [8].

Our New Format: PELLR
In order to optimizing the SpMV on GPUs, we propose a new format, PELLR format. It is based on the permutation of row for ELLR format. PELLR format sorts the rows based on the number of non-zeros of each row, then stored the non-zeros in ELL format. It consists of four one dimension ar- GPU calculation is based on a warp as a whole. It's going to happen frequently that a row consist of little non-zeros (e.g., <5), while another row consist many non-zeros (e.g., >20) belong to a warp. This creates extra unnecessary computational workload. Our idea is to sort the whole row according to the number of non-zeros in each row, so the rows with more non-zero elements would be arranged together, and the rows with fewer elements would be grouped together. This will reduce unnecessary calculations and obtain an optimized version of storage format. An example is given in Figure 5. Now, we give some analysis and compare of ELLR and PELLR. In order to better describe how to sum the total work amount, we use the following denotes:  nnz : the total number of non-zeros.
 N: the number of rows.
For a matrix, we can use these two expressions to obtain the amount of computation and the number of iterations. For the ELLR and PELLR formats, it can  (1) and Equation (2) that the total number of iterations has changed, but the amount of calculation has not changed. For the matrix given in Figure 5 iter ellr iter pellr p ellr p pellr We reduce the number of iterations by a new permutation of the rows. In this example, PELLR only needs 14 iterations, less than 18 iterations needed by ELLR.

Numerical Result
Our experiments are run on a personal computer equipped with NVIDIA Quadro P600; the operating system is a 64-bit Linux with CUDA 10.0 driver. The SDK and CUDA Toolkit, CUSPARSE [15], are used for programming. All the programs are based on CUDA-ITSOL [16] released by Li and Saad.
All the test matrices in our experiments are real square matrices collected from Matrix Market and the university of Florida sparse matrix collection. Information about the matrix is listed in Table 1. The parameters used in the table as follows: N: the matrix row size. nnz : the non-elements number for matrix. ave : the average of non-zeros per row, ave nnz N = . σ : the standard deviation of the number of non-zeros elements per row. max min − : is the difference between the maximum and minimum of non-zeros elements per row.
Performance in GFlops is calculated as 2 nnz T × , where T is the wall time of SpMV calculated on the GPU. In order to improve the performance of SpMV, we used texture memory to store the vector x for the SpMV kernels. This memory is bound to the global memory and plays the role of a cache level within the memory hierarchy [3]. In Figure 6, the experimental results of ELLR and PELLR formats are presented. The ordinate is the ratio of the performance of PELLR format against that of ELLR format. The abscissa is arranged based on σ of the matrices (and the axes in the following figures are arranged in the same way). It can be seen that the performance of PELLR format is better than that of ELLR format for all matrices. The benefits are significant for matrices with large parameters of ( ) 10 σ > , which means that the difference of the number of non-zero between rows is large, such as matrix bcsstk24, cavity25 and e40r5000. In this case, PELLR format is faster than ELLR format about 1.5 times. On the other hand, Journal of Computer and Communications for matrices with small ( ) 3 σ < , PELLR format has no obvious advantage to ELLR format, such as the matrix dw2048, qh1484 and s3dkt3m2. The matrices memplus and lnsp3937 are special. The structure of memplus is shown in Figure 7, which the dots in different colour represent non-zero elements. It has a σ of 22, but the ratio of performance is only has 1.1. Another fact for it is 7.1 avg = is small and max min 572 − = is really large, which means only few rows has many non-zeros entries in the matrix. The Figure 7 also proves this view of point. We can observe that there are many dots focus on the diagonals, the upper and the left sides of the matrix. So this matrix has only few rows in the front which have a lot of non-zero elements. So advantages of permutate are not distinct. The matrix lnsp3937 has 3.1 σ = and max min 10 − = , but the ratio of performance reaches 4.5. This is because the rows which have almost equal length are scattered, permutation groups the together and the number of iterations is reduced dramatically. From a large amount of experiments, we can make a general remark that PELLR format is faster than ELLR for almost all matrices, and when the matrix has the parameters of 10 σ > , 20 ave > and max min 10 − > , PELLR format is faster than ELLR format about a factor of 1.5.
We have compared the performance PELLR format with HYB format, the results are give in Figure 8. The program for HYB format we used is from CUSPARSE library [15]. We see again that the PELLR format has a significant advantage over HYB format. But in this case, the trend of ratio is different. With the increase of σ , the advantage of PELLR format trends to decrease. For matrices with small 10 σ < , the average factor of speedup is approximately 4. This is because HYB format use a factor of 1/3 (default in CUSPARSE) to separate the ELL and COO. What the defect of HYB in this case is comes from COO. While for large 10 σ > , the ratio is approximately 1.6, which shows that the effect of COO in this case is not so apparent. Some matrices have special results due to their structures. msc23052 and bcsstk36 have large max min − (166 and 170, respectively), the ratio is only 1.2. memplus (see Figure 7) has max min 572 − =, the ratio of 0.7 shows that PELLR is slower than HYB and the advantage of permutation is overwhelmed.
In Figure 9, we present the experimental results for PELLR and BiELL formats. The PELLR format still has advantages for most cases (17 vs. 3). It is worth   to note matrices lnsp3937 and mhd4800b. Their σ and max min − are 3.1 and 2, and 10 and 9, respectively. Since they are relatively small, the number of iterations per warp for BiELL format will not reduce very much, and the time of the judgment statement (in SpMV kernel on GPU) is not covered, so PELLR has obvious advantage in this cases. For matrix memplus, which structure is seen in Figure 7, the first warp of BiELL will reduce many iterations, so the ratio is 0.8172.
In Figure 10, PELLR and BiJAD are compared. Both formats sort all the rows in a whole. So the performances of two formats are nearest. The difference is BiJAD divides each warp into six groups to reduces the number of iterations, while PELLR format only relies on [ ] rl . So it is important that how many iterations are reduced in the warp in BiJAD SpMV kernel. If the number of iterations reduced is less, the effect of the judgment statement may be not covered and the benefits obtained may be less. Therefore, for these matrices, the experimental results depend on the characteristics of each matrix. For our 20 test matrices, PELLR format is advantageous in 15 cases (75%), while the highest ratio is only 1.3. Figure 11 shows a comparison of PELLR, HYB, BiELL, and BiJAD formats. We found that the PELLR, BiELL, and BiJAD formats are better than the HYB format in most cases, especially when parameter σ is relatively small (<10). For our test matrices, we find the performance of HYB format is superior to that of other formats only in the case of memplus. Figure 12 gives the comparison of more formats, which are CSRV, JAD, ELL, ELLR, PELLR, HYB and cuCSR. The cuCSR format is CSR format provided by CUSPARSE. What we can see from Figure 12 is listed as follows:   1) Overall, the PELLR format is optimal in most cases. And then the JAD and ELLR formats also turned out pretty well, the CSRV format is relatively poor.
2) The performance of CSRV and cuCSR is sensitive to the ave . In general, CSRV is poorer than cuCSR in most cases. CSRV will performance well when ave is large (>16), such as boneSo1 and bcsstk24.
3) HYB is not as good as we thought for our test matrices. But it can be found good performance when ave and σ are bigger, such as bcsstk36 and msc23052, and its performance is best for matrix memplus. 4) For almost matrices, PELLR and JAD are most outstanding, which in turn they are the best case. 5) All statements in the previous experiments is obtained again.

Conclusions
We proposed a permutated ELLR format by sorting the rows based on the number of non-zeros (or the length) of each row. This preprocessing makes the rows of almost equal length together. So the number of the iterations is reduced and the performance of SpMV can be improved, the speed up achieved about 1.5 in our experimental results. Furthermore, we deduced the formulation of the number of iterations and the work amount, which can be used to evaluate the performance of SpMV. In our experiments results, the performance of PELLR format is best in most cases. The performance comparison of different matrix formats is given, and some special cases are explained.
The PELLR format improves ELLR format in performance of SpMV, but it also adds increased storage memory. This format stores two more arrays than