^{1}

^{*}

^{1}

^{*}

This research covers the Intel^{?}^{ }Direct Sparse Solver for Clusters, the software that implements a direct method for solving the Ax = b equation with sparse symmetric matrix A on a cluster. This method, researched by Intel, is based on Cholesky decomposition and could be considered as extension of functionality PARDISO from Intel^{?}^{ }MKL. To achieve an efficient work balance on a large number of processes, the so-called “multifrontal” approach to Cholesky decomposition is implemented. This software implements parallelization that is based on nodes of the dependency tree and uses MPI, as well as parallelization inside a node of the tree that uses OpenMP directives. The article provides a high-level description of the algorithm to distribute the work between both computational nodes and cores within a single node, and between different computational nodes. A series of experiments shows that this implementation causes no growth of the computational time and decreases the amount of memory needed for the computations.

The paper describes a direct method based on Cholesky decomposition for solving the equation Ax = b with sparse symmetric matrix A. The positive-definite matrix A can be represented in terms of LL^{T }decomposition; in case of an indefinite matrix, the decomposition is LDL^{T}, where the diagonal matrix D can be amended with extra “penalty” for additional stability of the decomposition. To achieve an efficient work balance on a large number of processes, the so-called “multifrontal” approach to Cholesky decomposition is proposed for the original matrix.

The multifrontal approach was proposed in the papers [1,2] and successfully used in implemented algorithms [2-9]. The decomposition algorithm implementation consists of several stages. The initial matrix is subject to a reordering procedure [10-14] to represent it in the form of a dependency tree. The number of existing reordering algorithm is quite huge and the main aim of implementation such algorithms is to achieve good width of tree (to obtain better scalability) and increase fill-in of computed L matrix. We choose metis algorithm [^{T}. Then a factorization of the permuted matrix in the LL^{T} form takes place [4-5,7].

This work is devoted to Intel^{®} Direct Sparse Solver for Clusters package. This package implements parallelization based on nodes of the dependency tree using MPI as well as parallelization inside the node of tree using OpenMP directives. A variant of MPI-parallelization of Cholesky decomposition can be found in [3,7].

However, as it has been written before, such algorithms are based on representing initial matrix as “elimination-tree” representation (or assembly tree in terms of [^{®} MKL [

The paper is organized as follows: Section 2 provides a brief description of the reordering step and describes parallel algorithms to distribute the work between both computational nodes and cores within a single node. Section 3 briefly describes the algorithm distributing a tree node between different computational nodes and proposes our version of handle elements of factorized matrix by different processes with using OpenMP approach on each process. Section 4 demonstrates a series of experiments that shows dependence working time of implemented algorithm of number of processes/threads and dependence peak of memory size needed for a process during algorithm working of number of processes.

There are 2 general ways of solving system of linear equations—iterative and direct algorithms. The iterative algorithms can be more perspective but to implement they effectively one need to know information about matrix, like spectral analyses or differential equation which correspondent to system. Commonly this information can be achieved. On the other hand direct method could be applied for any matrix and spectral property of matrix can depend on obtained solution only. The Cholesky decomposition is the one of the direct methods.

In a general case, the algorithm of Cholesky decomposition can be presented in the following way:

L = A for j = 1,size_of_matrix

{for j = 1,i

{L(i,j) = L(i,j)-L(i,k)L(k,j), k = 1,j-1(1)

if (i==j) L(i,j) = sqrt(L(i,j)

if (i>j) L(i,j) = L(i,j)/L(j,j)

}

}

Here A is a symmetric, positive-define matrix and L is the resulted lower-triangular matrix. If the initial matrix has a lot of zero elements (such kind of a matrix is called sparse), this algorithm can be rewritten in a more efficient way called multifrontal approach.

Suppose we have a sparse symmetric matrix A

A reordered matrix is essentially more convenient for computations than the initial one since Cholesky decomposition can start simultaneously from several entry points (for the matrix from

To be precise, a non-zero pattern of the matrix L in Cholesky decomposition is calculated at the symbolic factorization step before the main factorization step (stage). At this stage, we know the structure of the original matrix A after the reordering step and can calculate the non-zero pattern of the matrix L. At the same stage, the original matrix A stored in the sparse format is appended with zeros so that its non-zero pattern matches completely that of the matrix L. Henceforth, we will not distinguish the non-zero patterns of the matrices A and L. Moreover, it should be mentioned here that elements of the matrix L in the rows 3 and 6 can be computed only after the respective elements in the rows 1, 2 and 4, 5 are computed. The elements in the 7th row can be computed at last. This allows us to construct the dependency tree [3,4,7,8]—a graph where each node corresponds to a single row of the matrix and each graph node can be computed only if its (nodes on which it depends) are computed. Deeper algorithm with pseudo code of distribution nodes of tree between processes can be finding in [

To compute elements of LL^{T} decomposition within a node Z of the dependency tree, it is necessary to compute the elements of LL^{T} decomposition in the nodes on which Z depends (its sub-trees). We call an update procedure to bring already computed elements to the node Z (actually, this procedure takes the elements from the subtrees of Z multiplied by themselves and subtracted from elements of Z, so the main problem here is in a different

non-zero pattern of Z and its sub-trees. The 4th line of the algorithm (1) completes this operation with different L(i,k) being placed into different nodes of the tree, the last step is to calculate LL^{T} decomposition of elements placed into node Z. Algorithm (2) describes an implementation of the above mentioned piece of the global decomposition in terms of a pseudo-code:

for current_level = 1, maximal_level

{Z = node number that will be updated by this process for nodes of the tree with level smaller than the current_level

{prepare an update of Z by multiplying elements of LL^{T} decomposition elements lying within the current node (2) by themselves;}

send own part of update of Z from each process to process which stores Z;

on process that stores Z compute LL^{T} elements of Z}

Consider in details an implementation of the algorithms called “compute elements of Z” and “prepare an update”.

Each tree node in turn can be represented as a sub-tree or as a square symmetric matrix. To use algorithm (2), one needs to implement LL^{T} decomposition of each node on a single computational process. To compute Cholesky decomposition of each node of the tree, a similar algorithm could be implemented on a single computational process with several threads. If the number of threads is equal to the number of nodes of the tree on the bottom level, then LL^{T} factors for each node from the very bottom level of the tree are calculated by a single thread, on the next level each node can be calculated by 2 threads, on the next one by 4 threads, etc. However, such an algorithm becomes inefficient for the top nodes of the initial tree. Instead, Olaf Schenk suggested some modification of the standard algorithm of Cholesky decomposition for sparse symmetric matrices. The standard LL^{T} decomposition for sparse symmetric matrices can be described as follows:

for row = 1, size_of_matrix

{for column = 1, row

{S = A[row, column]

for all non-zero elements in the row before the column,

S = S – A[row, element]*A[element, column];(3)

If column

else A[row, column] = S^{1/2};

}

}

In the paper [

for column = 1, supr

{if column is not computed - wait, otherwise do

{A[supr,*] = A[supr,*] – A[column, *]*A[column, column];}

}

A[supr, supr] = (A[supr, supr])^{1/2};(4)

// 1/2 means Cholesky decomposition of a dense matrix that can be computed with Lapack functions A[supr, supr + k, ∙∙∙, n] = A[supr, supr + k,∙ ∙∙, n] * inv (A[supr, supr)], where inv(B) mean inverse matrix B^{−}^{1};

In his paper, Olaf Schenk applies the algorithm to the entire matrix. In this paper, we apply it to the tree nodes only. Namely this provides an efficient Cholesky decomposition for a dependency tree node of the initial matrix.

The aforementioned Algorithm (4) describes an implementation of the procedure в “compute elements” in terms of the algorithm (2). The procedure “prepare an update” differs only in a sense that each process modifies zero columns with the structure similar to A[supr,*] rather than A[supr,*] itself. After each process computes its update performing simple summation, we obtain A[supr,*] from which the computed elements of LL^{T} decomposition have been already deduced.

Thus, we have described the main steps of algorithm (2). It has a number of drawbacks, however. First, the number of elements of the matrix L in each process differs significantly (for instance in the ^{T}-factors between processes more evenly, we propose an algorithm described in the next Section.

To avoid issues with uneven distribution of the elements of the matrix L, we propose a distribution method as in

_{1} supernodes belongs to the first process, the next group of m_{2} supernodes—to the second process, etc. Here m_{1} + m_{2} + ∙∙∙ + m_{n} = m. Note that the numbers m_{1}, m_{2}, ∙∙∙, m_{n} may vary that allows one to adjust them in order to provide better performance of the overall algorithm. Then Algorithm 2 is modified so that each process computes its part of the tree node Z. However, this idea does not provide a solution to the problem of keeping processes busy during the computations. Moreover, the problem becomes even bigger since parallel computation of the elements of the matrix L in a single tree node is virtually impossible—almost all supernodes in a single tree node are normally dependent on each other.

Note that each process can be executed by a modern computational node and, therefore, the node can consist of several dozens of individual computational threads.

For individual processes to send/receive the data and carry out the computations, we designate one thread in each process to be a “postman”. A “postman” is a thread responsible for data transfer between the processes. Let us consider how algorithm 3 changes.

As it was stated in the Section 2, the Algorithm “prepare an update” is a modification of the Algorithm 4. As was mentioned before, to calculate LL^{T} factors from node Z of the elimination tree we need to calculate all LL^{T} factors from its sub-trees and take them into account during computations of LL^{T} factors of node Z. In general case, however, the node of the tree Z and its sub-tree are stored on different computational processes, so we cannot do it straightforwardly (node Z and its sub-tree have different non-zero pattern, for example). To resolve the issue, the following algorithm is proposed: Each computational process i allocates matrix Z_{i} with the same non-zero pattern as the matrix corresponding to the node Z and fills it in with zero elements. Then, all elements from the subtrees stored on the process i are taken into account in the matrix Z_{i} as if Z_{i} is Z (4th line in the Algorithm 1). Further, the computed matrices Z_{i} are collected on the required process. Considering that we separated a postman thread, there is no need to compute an update first, and send it later, so computations and data transfers can be interleaved.

if thread is a postman thread

{Open a recipient to get updates for supr in A_loc

{if supernode supr is computed, send it to the respective process;

else wait until supernode is computed;

}

Else(5)

{create A_loc(all elements in Z)

for supr in A_loc

{ if column on the current process A_loc[supr,*] = A_loc[supr,*] – A[column, *] * A[column, column];

}

}

It is apparent that having decreased the number of the threads involved in the computations we increased the total time of “update” computations in each process. Nevertheless, the experiments with a big number of threads show that the computational time increases insignificantly. It is also important to note that despite of the increase of the computation time to do the necessary “update”, the transfer of the computed pieces between processes overlaps with these computations. Therefore, the total time spent in the new algorithm is less compared to the Algorithm 4.

It is much more interesting to consider a more complicated algorithm of computing the elements of the matrix L in a single node of the tree provided that this node is distributed among several processes. Let the elements of the node Z including the elements of the dependent sub-trees (children) that are not processed yet be distributed as it is shown in the

It can be clearly seen that with such a distribution only the process 0 can start the computations. Thus, the first supernode can only be computed within it. However, after the supernode computation is done, it can be used by the computational threads of the process 0 for other (dependent) supernodes, second it can be sent by the postman thread to other processes, which in turn can use it in their (dependent) supernodes (see the

With this example, it is apparent that if each process has more than 3 computational threads, some threads will simply lack a supernode to process, so making a postman out of one computational thread will have little effect on the overall efficiency if the thread count is big enough. Of course, one supernode can be processed with several threads, but this is not going to be considered in this paper.

All numerical experiments in this paper were carried on

the Infiniband^{*}-linked cluster consisting of 16 computational nodes; each node contains two Intel^{®} Xeon^{®} X5670 processors (12 cores in total) with 48Gb of RAM per node. The variable number of the computational threads within a node is created, i.e. only part of totally available threads is working within the nodes in the most cases.

For this experiment, we selected either 7-diagonal matrix resulted from the approximation of a Helmholtz equation on a uniform grid with a positive coefficient (specific Helmholtz coefficient value is not crucial here since it has no effect on the matrix structure and only the accuracy of the solution obtained depends on it), or the matrix generated from the oil-filtration problem. The number of degrees of freedom (NDOF) for the first matrix is about 398 K elements, for the second one is about 1.7 M, and the number of nonzero elements (NNZ) in each matrix is 15.7 M and 12 M, respectively.

The exact solution corresponds to unit vector – array with all elements equal to 1. Need to underline that performance of whole algorithm doesn’t depend on choosing rhs.

Figures 6-7 show acceleration of Intel^{®}Direct Sparse Solver for Clusters code on a different number of processes compared to the same program launched on 1 MPI process with 2 OpenMP threads on different matrices. The colored lines correspond to a different number of OpenMP threads used in the code. It can be seen from the Figures that the execution time reduces in all cases with the increase of the number of threads and processes. It can be readily seen that sometimes even super-linear acceleration takes place depending on the number of OpenMP threads that can be easily explained from the nature of the algorithm one thread is used to send& receive data and rather often it falls out of the computations. For example, in the case of 2 threads, one thread is the postman and the other is the computational one, in

the case of 4 threads, one thread is the postman and 3 others are computational ones.

Based on the Figures, it can be concluded that it is recommended to exploit the computational threads on the node to the maximum and it is not recommended to have several computational processes per one node. From the

We use the same matrices as before for the testing purposes. In the Figures 8-9, the maximal memory size needed for a computational node depending on the number of nodes is presented.

Memory size that Intel^{®} Direct Sparse Solver for Clusters needs for every computational node barely depends on the number of computational threads on it. Therefore, we present the data for the number of threads equal to 12 only. It is apparent that the memory size that every process needs demonstrates inverse dependence on the number of processes (for the second matrix, the memory required decreased 5 times for 16 processes vs. 1 process). If the computational cluster has insufficient memory per node, it is still possible to solve the system of linear equations using Intel^{®} Direct Sparse Solver for Clusters package increasing the number of nodes in the cluster. An example of this will be shown in the following paragraph.

In this paragraph, we chose a matrix of size 5.8 M with more than half a billion non-zero elements for the experiment. Thanks to Intel^{®} Direct Sparse Solver for Clusters memory scaling, we can solve this system on 8+ MPI processes. Note that almost 40 GB of memory is required per process in case of 8 MPI processes. For 16 processes, 29 GB of memory is only required (see

In Figures 10-11, the comparison of computational time with the configuration 8 MPI x 2 OpenMP is presented. It is clear that even for such a big matrix size and big number of MPI processes, Intel^{®} Direct Sparse Solver for Clusters shows good scalability both in terms of OpenMP threads and MPI processes.

Within the frameworks of multifrontal approach, we proposed an efficient algorithm implementing all stages of Cholesky decomposition inside the node of the dependency tree for all processes on a distributed memory machine. This approach is implemented in Intel^{® }Direct Sparse Solver for Clusters package, and numerical

experiments show good scaling in computational time— proportional to the number of computational nodes used and the number of threads within them. Such scaling can be achieved only by scaling all substeps of Cholesky decomposition. The main issue on this algorithm, factorization of node of tree by several processors, has been implemented via approach presented in this paper. Besides, this algorithm reduces the requirement for the memory size used by the algorithm on a single node when the number of processes grows. The experiments made the confirmation of it.

The authors would like to thank Sergey Gololobov for providing feedback that improved both style and content of the paper.