^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

The paper describes an efficient direct method to solve an equation
*Ax* =
*b*, where
*A* is a sparse matrix, on the Intel
^{®} Xeon Phi
^{TM} coprocessor. The main challenge for such a system is how to engage all available threads (about 240) and how to reduce OpenMP
^{*} synchronization overhead, which is very expensive for hundreds of threads. The method consists of decomposing A into a product of lower-triangular, diagonal, and upper triangular matrices followed by solves of the resulting three subsystems. The main idea is based on the hybrid parallel algorithm used in the Intel
^{®} Math Kernel Library Parallel Direct Sparse Solver for Clusters [1]. Our implementation exploits a static scheduling algorithm during the factorization step to reduce OpenMP synchronization overhead. To effectively engage all available threads, a three-level approach of parallelization is used. Furthermore, we demonstrate that our implementation can perform up to 100 times better on factorization step and up to 65 times better in terms of overall performance on the 240 threads of the Intel
^{®} Xeon Phi
^{TM }coprocessor.

This paper describes a direct method for solving the equation ^{®} Xeon Phi^{TM} coprocessors. The objective is to utilize effectively all available threads. Additionally, it is very important to reduce OpenMP^{*} [^{®} Math Kernel Library Parallel Direct Sparse Solver for Clusters [

Direct sparse solver typically rely on three stages/steps: reordering, factorization and solve. The reordering step includes METIS [

In this chapter, we optimize algorithms for the last two stages of the direct solver. For most matrices investigated the second stage brings noticeable improvement for the overall performance. So we focus our investigation on this part of the solver.

Consider sparse matrix A. We begin by performing a permutation of initial matrix to reduce the number of nonzero elements in the matrix L in LDU decomposition. Then neighboring columns with identical nonzero structure are formed into a super-node [

Algorithm 1. Simple parallel factored.

This approach shows good performance on a small number of threads, but when their number increases, the performance is reduced due to multiple OpenMP synchronization points.

During the reordering step we compute a dependency tree which allows us to factor the matrix in parallel.

We can use the dependency tree to perform the factorization as described in Algorithm 2.

Because we store the initial matrix in super-nodal format, during factorization step we deal with dense matrices. The dense matrices can be large, so using BLAS level 3 [

Now we formulate a three-level approach to parallelization based on the three algorithms considered earlier. All OpenMP threads are divided into independent groups which are responsible for parallelization according to the dependency tree and all synchronizations take place only inside each group of threads. This will be the first level of parallelization. Then inside each group we can apply Algorithm 1. This will be the second level of parallelization. Finally, inside each group we unite two or four (depending on the maximum number of available threads) threads into one group to perform BLAS level 3 functionality in parallel. This will be the third level of parallelization. This three-level approach allows us to utilize all threads effectively. In particular, this distribution of threads between the different groups reduces the OpenMP synchronization overhead that historically negatively impacted performance.

We now consider the last stage of direct sparse solver, namely the solving of the systems with lower-triangular L, diagonal D and upper triangular U matrices. Similar to the implementation of parallel factorization algorithm, we use two-level parallelization for this case. As before, we distribute all of the leaf nodes of the dependency tree between threads and compute the elements corresponding to these nodes. Then we use respective threads from the child nodes to compute unknowns that correspond to their parent node in the tree. As a result more and more threads are involved in computations of a node as we proceed closer to the top of the tree. This composes the second level of parallelization in our algorithm. To utilize effectively threads collaboratively working on nodes up the tree we apply parallelization algorithm that is similar to that used when we start computations for the whole tree. The scheme of the algorithm on the example of the lower triangular matrix L is shown in

The idea described above comes from the left-looking 1D directed acyclic graph approach. First, we distribute the nodes of the tree among super-nodes. Next each thread starts to handle its own super-node.

Algorithm 2. LDU decomposition based on the dependency tree.

To prevent multiple synchronizations near the top of the tree we apply a right-looking approach to the root. Thus, to compute unknowns corresponding to any of the nodes of the tree we first update these values with the computed values at the children level.

The platform used for the experiments in this paper is an Intel Xeon Phi coprocessor. The system is equipped with 16 GB GDDR5 memory and includes a 61-core coprocessor running at 1.23 GHz. In this work, we used 60 cores to test the solver, leaving the remaining core to run the operating system and other software.

Sparse matrices used in our performance evaluation are taken from The University of Florida Sparse Matrix Collection [

On the ^{6}, NNZ = 39 × 10^{5}; dielFilterV3real N = 11 × 10^{5}, NNZ = 45 × 10^{6}; CoupCons3D N = 4 × 10^{5}, NNZ = 22 × 10^{6}; and torso3 N = 25 × 10^{4}, NNZ = 44 × 10^{5}. The Chart shows that our implementation can achieve up to 100 times better performance during the factorization step using the full coprocessor and up to 65 times improvement in the overall performance. Currently, during the solution step we can effectively utilize about 60 cores (4 threads each). Adding additional threads does not yield any improvement.

In this paper, we presented an efficient implementation of the Intel MKL PARDISO for the Intel Xeon Phi coprocessor. The implementation uses a three-level parallelization approach that allows for a more optimal utilization of all of the available cores. The first level is based on the sparsity of the initial matrix and, as a result, on the sparsity of the factorized one. The second level is related to the dependency tree which can be calculated at the reordering step using METIS [

The Intel Math Kernel Library now includes several new features such as the Schur complement of a sparse matrix [

AlexanderKalinkin,AntonAnders,RomanAnders, (2015) Intel^{®} Math Kernel Library PARDISO^{*}for Intel^{®} Xeon Phi^{TM} Manycore Coprocessor. Applied Mathematics,06,1276-1281. doi: 10.4236/am.2015.68121