Parallelizing a Code for Counting and Computing Eigenvalues of Complex Tridiagonal Matrices and Roots of Complex Polynomials

A code developed recently by the authors, for counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, which lie in a given region of the complex plane, is modified to run in parallel on multi-core machines. A basic characteristic of this code (eventually pointing to its parallelization) is that it can proceed with: 1) partitioning the given region into an appropriate number of subregions; 2) counting eigenvalues in each subregion; and 3) computing (already counted) eigenvalues in each subregion. Consequently, theoretically speaking, the whole code in itself parallelizes ideally. We carry out several numerical experiments with random complex tridiagonal matrices, and random complex polynomials as well, in order to study the behaviour of the parallel code, especially the degree of declination from theoretical expectations.


Brief Description of the Method
A code for "counting and computing the eigenvalues of a complex tridiagonal matrix, as well as the roots of a complex polynomial, lying in a given region of the complex plane" (CCE) has been developed by the authors in [1].For convenience, we use here the symbols and abbreviations adopted in [1].
The motivation for studying this problem arises in several issues of physics (e.g.quantum field theories, oscillating neutron stars, gravitational waves, modern cosmological theories).
Given a complex tridiagonal matrix A of order n with characteristic polynomial   p  ( [1], Equations (1), ( 2)), and a compact region with closure     prescribed by the simple closed contour , C   C , the "argument principle" (see e.g.[2], Chapter 10, Section 10.3) implies that the number N of the roots lying in , counted with their multiplicities, is given by (cf.[1], Equation (3)) where it is assumed that is followed in the positive direction and C   p  has no roots on .Evaluating the contour integral (1) is equivalent to solving the complex IVP ( [1], Equation (4)) along .We assume for simplicity that is a rectangle with its "perimeter" defined by five vertices (the first vertex coincides with the last one, since this contour is closed; cf.[1], Equation (3)), , , ; x y x x y x x y y x y y x y without loss of generality, 0, 0.
x y x iy To solve the complex IVP defined by Equations ( 2) and ( 4), and thus find , we use the Fortran package dcrkf54.f95:a Runge-Kutta-Fehlberg code of fourth and fifth order modified for the purpose of solving complex IVPs, which are allowed to have high complexity in the definition of their ODEs, along contours prescribed as continuous chains of straightline segments; interested readers can find full details on dcrkf54.f95 in [3].This package contains the subroutine DCRKF54, used in this study with KIND = 10, i.e. with high precision, and with input parameters as given in [1] (Section 3.1).
To compute the roots of   Consequently, the sub-rectangle and its perimeter are written as As discussed in [1] (Sections 3.4, 4), instead of the elements of a tridiagonal matrix, CCE can readily accept the coefficients of a complex polynomial, eventually of very high degree, with several roots of high multiplicity, as well as with closely spaced roots.

Theoretical Expectations
The main program of the code, so-called CCEDRIVER, examines if partitioning the given rectangle is necessary and sets accordingly the logical variable PARTI-TION.In this case, CCEDRIVER proceeds with partitioning by calling an appropriate subroutine, socalled PARTITIONING.Furthermore, COUNTING is the subroutine which calculates by Equations ( 7) and (8) the perimeter ij of a sub-rectangle ij , calls repeatedly DCRKF54 to integrate along ij , and thus computes the number ij of the roots lying in ij (cf.Equation ( 6)); CNTALL is its driving subroutine which repeats the counting of the roots for all sub-rectangles.Next, ROOTFINDING is the subroutine which calls for a sub-rectangle ij the code that implements the cooperation of the double-precision and the multiple-precision NRs (so-called NR2; [1], Section 3.3), and thus computes the ij roots lying in ij  ; RTSALL is its driving subroutine which repeats rootfinding for all sub-rectangles.

NMR  
Apparently, CCEDRIVER proceeds first with the counting of the roots by calling CNTALL or COUNT-ING (dependent on the value assigned to PARTITION) and second with the rootfinding by calling RTSALL or ROOTFINDING.Consequently, the execution inside CCEDRIVER proceeds in serial.This has not to be the case, however, inside CNTALL and RTSALL.Since counting the roots and rootfinding are processes repeated independently for each sub-rectangle ij  , the corresponding subroutines can run in parallel on multi-core machines, with each available thread processing a part of the sub-rectangles.
2  Thus, theoretically, it is expected that the code can be efficiently parallelized inside the subroutines CNTALL and RTSALL with respect to the sub-rectangles to be processed.Such theoretical expectations are carefully tested here by carrying out numerical experiments with random complex tridiagonal matrices and high-degree polynomials as well.

OpenMP Parallelization
The Open Multi-Processing (OpenMP; http//openmp.org/) is an Application Program Interface (API), jointly defined by a group of major computer hardware and software vendors, supporting shared-memory parallel programming in C/C++ and Fortran, for platforms ranging from desktops to supercomputers.Several compilers from various vendors or open source communities implement the OpenMP API.Among them, the GNU Compiler Collection (GCC; http://gcc.gnu.org/)includes a Fortran 95 compiler, so-called "gfortran" (http://gcc.gnu.org/fortran/), licensed under the GNU General Public License (GPL; http://www.gnu.org/licenses/gpl.html).The official manual of gfortran can be found at http://gcc.gnu.org/onlinedocs/gcc-4.7.0/gfortran.pdf.The GCC 4.7.xreleases, including corresponding gfortran 4.7.xreleases, support the OpenMP API Version 3.1.The official manual of this version can be found at http://www.openmp.org/mp-documents/OpenMP3.1.pdf.This OpenMP version is used here by gfortran.
To enable the processing of the OpenMP directive sentinel !$OMP, gfortran is invoked with the "-fopenmp" option.If so, then all lines beginning with the sentinels !$OMP and !$ are processed by gfortran.
To The sentinel !$ followed by three dots, !$..., denotes code (omitted here) to be compiled only when OpenMP is invoked.The PARALLEL and END PARALLEL directives define a parallel construct.The structured block of code enclosed in a parallel construct is executed in parallel by the machine's available threads.The code placed directly in a parallel construct is its "lexical extend", while, in turn, the lexical extend plus all the code called by it is the "dynamic extend" of the parallel construct; for instance, COUNTING and all code called by it lie in the dynamic extend of the parallel construct shown above.Several data-sharing attribute clauses can be linked to a PARALLEL directive, like DEFAULT, PRIVATE, FIRSTPRIVATE, REDUCTION, etc. Three parenthesized dots, next to their names, denote clauses linked to such directives; for instance, we mostly declare DE-FAULT(SHARED), indicating that all variables in a parallel construct are to be shared among the available threads, except, eventually, for variables (or common blocks) named explicitly in a PRIVATE directive.
The DO and END DO directives define a loop worksharing construct, which distributes the loop computations over the available threads; so, each thread computes part of the required iterations.Several clauses are permitted to be linked to the DO directive, like SCHEDULE, PRIVATE, etc. Three parenthesized dots next to their names denote clauses linked to such directives.For instance, we write SCHEDULE (STATIC[,CHUNK]), in-dicating that all iterations are to be divided into pieces of size CHUNK and then to be statically assigned to the available threads (if CHUNK is not specified, then the iterations are evenly divided contiguously among the threads).On the other hand, the mostly used SCHED-ULE(DYNAMIC[,CHUNK]) indicates that all iterations are to be divided into pieces of size CHUNK (if not present, then by default CHUNK = 1), and then to be dynamically assigned to the available threads, so that when a thread finishes one chunk, it is dynamically assigned another.
To parallelize the subroutine RTSALL, we follow a similar procedure.Parallel programming details are not analysed in this paper.We have mentioned only OpenMP directives that have been used in our code; certain significant issues, mainly concerning data environment, should be carefully taken into account by interested readers.

Computational Environment and Software Used
Our computer comprises an Intel ® Core TM i7-950 processor with four physical cores.This processor possesses the Intel Hyper-Threading Technology, which delivers two processing threads per physical core.gfortran has been installed in this computer by the TDM-GCC "Compiler Suite for Windows" (http://tdm-gcc.tdragon.net/),which is free software distributed under the terms of the GPL.To calculate values of the Fortran intrinsic function SQRT, our computer needs in double precision (i.e.KIND = 8) and almost the same time in high precision (i.e.KIND = 10), and in the precision of 256 digits with the MPFUN90 System (available in http://crd-legacy.lbl.gov/~dhbailey/mpdist;licensed under the Berkeley Software Distribution License found in that site).In the numerical experiments of this study, we use an efficient algorithm for computing the coefficients of the characteristic polynomial of a general square matrix [6], abbreviated to CPC as in [1] (Section 3).Thus, in all computations involving polynomials, CCE makes use of the time-saving Horner's scheme.In [1] (Section 3.3), we have compared the eigenvalues computed by CCE with respective results of an efficient and very fast code, abbreviated to CXR ([1], Section 3), solving general complex polynomial equations ( [7]; this Fortran code, named cmplx_roots_sg.f90,can be found in http://www.astrouw.edu.pl/~jskowron/cmplx_roots_sg/; it is licensed under the GNU Lesser GPL, Version 2 or later, or under the Apache License, Version 2; details on these licences are given in the documents "LICENSE" and "NOTICE", found in the same site).The CPC algorithm, as well as the subroutine CMPLX_ROOTS_GEN with its subsidiaries CMPLX_LAGUERRE, CMPLX_LAGUERRE2 NEWTON, and SOLVE_Quadratic_EQ (these four contained in cmplx_roots_sg.f90),have been modified so that to work in MPFUN90's multiprecision.
The task of the present study is to compare execution times when CCE runs in serial and in parallel (the accuracy of CCE has been tested and verified in [1]).This task is accomplished by performing several numerical experiments with random complex tridiagonal matrices, and random complex high-degree polynomials as well.

Numerical Experiments and Comparisons
First, we perform numerical experiments with random complex tridiagonal matrices, constructed in the MPFUN 90's multiprecision environment, as described in [1] (Section 3.3).We study tridiagonal matrices of order n = 128, 256, 400, and 512.In this work, the region is assumed to be the "primary rectangle" , containing all n roots, eventually determined by the numerical results of the CXR code.In all cases examined, the primary rectangle is partitioned into sub-rectangles.Table 1 shows execution times for CPC, CXR, CN-TALL-Serial, RTSALL-Serial, CCE-Serial, CNTALL-Parallel(S), RTSALL-Parallel(S), CCE-Parallel(S), CN-TALL-Parallel(D), RTSALL-Parallel(D), and CCE-Parallel(D).In these names, the label "Serial" denotes execution times when CCE runs in serial; the label "Parallel(S)" denotes execution times when CCE runs in parallel under the loop-construct directive SCHEDULE(STA-TIC); and the label "Parallel(D)" denotes execution times when CCE runs in parallel under the loop-construct directive SCHEDULE(DYNAMIC).
To compare results of Table 1, called "present", with corresponding results of [1] (Table 2), called in turn "previous", we link: 1) the "present" group labeled "CN-TALL-Serial" with the "previous" column labeled "DC-RKF54()", and 2) the "present" group labeled "RT-SALL-Serial" with the "previous" column labeled "NR2 ()".We remark that the "present" times spent on counting the roots are greater than the "previous" ones for the cases n = 128, 256, and 400, since now  is partitioned into 16,384 sub-rectangles; while, for instance, in the "previous" case n = 128,  is partitioned into just 144 sub-rectangles.
On the other hand, the "present" time spent on counting the roots for the case n = 512 becomes less than the "previous" one.Apparently, counting the roots on much more sub-rectangles turns to be time-saving for large n, since the perimeters of the 16,384 sub-rectangles, along which the integral ( 1) is now computed, decrease dramatically in length; accordingly, the "present" times spent on such computations become less than the "previous" ones.Furthermore, working with much more subrectangles turns to be in favour of computing the roots, since, now, each sub-rectangle contains a significantly less number of roots to be computed, and feeding the Newton-Raphson code with random guesses ([1], Section 3.2) is proved to be highly efficient.Thus all "present" times spent on computing the roots are less than the "previous" ones.Likewise, the "present" total execution times become less than the "previous" ones for the cases .400 n  Next, the time ratio CCE-Serial/CCE-Parallel(D) seems to be significant for checking the degree that theoretical expectations are fulfilled in the implementation of OpenMP parallelization on multi-core machines.In particular, since this time ratio takes values larger than 4 and up to ~5 (with the exception of the case n = 128), it becomes apparent that there is an almost equal sharing of the computational volume among the 4 cores available in the processor of our computer.Apparently, the additional gain is due to the hyper-threading technology possessed by this processor.Comparing execution times for Parallel(S) and Parallel(D) processing, we find that the corresponding values deviate slightly each other, i.e. the corresponding time ratio deviates slightly from unity.It is worth remarking, however, that in all cases examined the (even small) gain is on the side of Parallel(D); in simple words, it seems better to leave the machine to decide how to share the computational work.Second, we accomplish numerical experiments with random complex polynomials of degree n = 400, 512, 768, 1024, and 1512.They are constructed in the MPFUN90's multiprecision environment as described in [1], with the clarification that the n successively derived random complex numbers 1 are assigned to the corresponding polynomial coefficients of the degree-n polynomial , , n a a    n p z , z   ; while, as usual, the    2, we arrive at conclusions similar to those above.We emphasize on the fact that CCE running in parallel is proved ~13 times faster than the CXR code in the case of a degree-400 complex polynomial, and increases gradually up to ~30 times faster than CXR in the case of a degree-1512 Copyright © 2013 SciRes.AM


Concerning the results of Table parallelize the subroutines CNTALL and RTSALL, we apply to them several OpenMP directives.Concerning CNTALL, for instance, we write the code SUBROUTINE CNTALL(...)