Graphical Processing Unit Based Time-Parallel Numerical Method for Ordinary Differential Equations

On-line transient stability analysis of a power grid is crucial in determining whether the power grid will traverse to a steady state stable operating point after a disturbance. The transient stability analysis involves computing the solutions of the algebraic equations modeling the grid network and the ordinary differential equations modeling the dynamics of the electrical components like synchronous generators, exciters, governors, etc., of the grid in near real-time. In this research, we investigate the use of time-parallel approach in particular the Parareal algorithm implementation on Graphical Processing Unit using Compute Unified Device Architecture to compute solutions of ordinary differential equations. The numerical solution accuracy and computation time of the Parareal algorithm executing on the GPU are demonstrated on the single machine infinite bus test system. Two types of dynamic model of the single synchronous generator namely the classical and detailed models are studied. The numerical solutions of the ordinary differential equations computed by the Parareal algorithm are compared to that computed using the modified Euler’s method demonstrating the accuracy of the Parareal algorithm executing on GPU. Simulations are performed with varying numerical integration time steps, and the suitability of Parareal algorithm in computing near real-time solutions of ordinary different equations is presented. A speedup of 25× and 31× is achieved with the Parareal algorithm for classical and detailed dynamic models of the synchronous generator respectively compared to the sequential modified Euler’s method. The weak scaling efficiency of the Parareal algorithm when required to solve a large number of ordinary differential equations at each time step due to the increase in sequential computations and associated memory transfer latency between the CPU and GPU is discussed.


Introduction
The time-domain simulation technique is widely used by the power industry to describe a power grid transient behavior accurately. The high level of accuracy achieved using the time-domain simulation technique is due to the use of detailed mathematical models of controls, nonlinearity, saturation, and protection systems. Power system stability studies or analysis typically involve computing the system response to a sequence of large disturbances, such as generator outage or network short circuit, followed by a switching operation as part of protective measures. The system response computation involves a direct simulation in the time-domain of duration varying between 1 s and 20 min., or more. The system response or stability at different stages of time-domain simulation is affected by different components of the grid dictated by the level of mathematical modeling of the individual components used.
The power system stability studies using the time-domain simulation technique are performed using two levels of mathematical models of the grid components, namely the short-term and long-term models. The short-term models represent the rapidly responding system electrical components of the power grid like generators, exciters, governors, turbines, etc., while the long-term models represent the slow-oscillatory system power balance (variable load). Using the short-term models to perform power system stability studies addressing the post-disturbance times of up to 5 -10 secs is classified as "Transient Stability" Analysis (TSA) whereas using the long-term models to stability studies are associated with frequency and voltage stability. The focus of the research work presented in this paper is on the Transient Stability Analysis of the power system.
The TSA involves computing the step-by-step solution of thousands of non-linear systems of coupled differential-algebraic equations (DAEs) representing the dynamic components and the network interconnect of the dynamic components of the power system. The TSA, in particular, is concerned with the simulation of faults and contingencies, which can produce instability of the power system. The focus is to simulate a number of possible contingencies in a short-time horizon to evaluate possible instability conditions and develop appropriate corrective actions. The preventive simulation and corresponding corrective action are repeated for tens or hundreds of cases, until a system or utility operator by an online evaluation of the power system state, detects unsafe operating conditions. These exhaustive and computationally intensive simulations are performed to provide an operator with appropriate corrective action that can be triggered when a contingency occurs in real-time. However, the online TSA, in particular, S. Lakshmiranganatha, S. S. Muknahallipatna DOI: 10.4236/jcc.2020.82004 Journal of Computer and Communications is a computationally challenging problem, requiring 10 -15 minutes to perform preventive simulation of a power system (depends on the size of the power system) for a set of fault conditions and outages [1]. In 1990s, a number of researchers explored the use of traditional time-domain simulation and innovative computer architectures like parallel/vector processing and distributed computing [2] [3] and [4] to achieve online TSA and demonstrated the limited amount of parallelism that could be exploited.
The parallelization techniques that can be applied to perform a transient stability analysis of a power system can be broken into spatial domain decomposition, numerical method, and temporal domain decomposition or time-parallel parallelism approaches. The spatial decomposition approach [5] [6] involves partitioning the system DAEs and distributing the computations over various processors. The parallelism across numerical method approach is to exploit the parallelism in the numerical scheme used to solve the DAEs. The approach is to use waveform relaxation, VDHN-Maclaurin numerical schemes [7] [8] instead of the traditional trapezoidal, Runge-Kutta, and Adam-Bashford methods. The temporal decomposition approach involves dividing the integration interval into blocks and solving the blocks in parallel.
The first two parallelization techniques have been researched and available in some commercial packages like PSS-E, PowerWorld, and OPAL-RT [9]. The research focus has been on using task level and distributed computing across multiple contingency analysis but not on speeding up the computations in a single case. The use of the time-parallel parallelism approach in other fields was first considered by Nievergelt [10], and he presented a method for parallelizing the numerical integration of an ordinary differential equation. In 1979, Alvarado proposed the use of time-parallel with trapezoidal algorithm [11] for transient problems. However, the implementation of the proposed approach could not achieve significant computational speedup due to the coupling between adjacent time steps resulting in sequential execution. Later, pipelining [12] and relaxation-based techniques [13] were proposed to implement time-parallel in single case. The speedup gain of the pipelined-in-time was limited due to sequential convergence, while the gain with the relaxation-based techniques was limited due to slow convergence. Many time-parallel approaches [14] are applicable to certain classes of problems only and dependent on specific computer architectures.
In this paper, we investigate the use of the time-parallel approach and in particular the Parareal algorithm (PRA) implementation on the Graphical Processor Unit (GPU) using the Compute Unified Device Architecture (CUDA) for solving ODEs representing the electrical components of the power system. The investigation in [10] [20], distribution of workload [21] [22], and application to solve a large class of traditional problems involving the solutions of non-linear parabolic equations [23], nonlinear differential equations in the financial world [24], equations of molecular dynamics [16], quantum chemistry [25], partial differential equations in optimal control, etc., to mention a few. Recently, the use of time-parallel algortihms to reduce the computational time of power system dynamics simulation has been investigated by a few researchers.
The implementation of PRA has been investigated in [26] for the detailed models of power systems for TSA. The authors have investigated different numerical integration methods for coarse solvers to analyze the stability and con- The research in [28] shows that embedding the spatial decomposition into PRA has better performance than using either one of them individually. Each system is spatially divided into two sub-areas and is solved in parallel. Each sub-area is solved using PRA to achieve time parallelism and reduction in execution time. The research demonstrated a ~33% improvement in the parallel fine steps execution time between the hybrid method proposed and only PRA. A new approach was proposed to perform TSA in [29], which is a spatially parallel hybrid approach combining the high-order Taylor series and the block bordered Furthermore, the research in all of the above investigations is on evaluating the suitability of PRA for transient stability analysis. In this paper, we investigate the performance of PRA to solve ODEs using heterogeneous computing architecture, namely the use of massive parallel cores on the graphical processing units (GPU) with compute unified device architecture (CUDA) [30]. The focus of the investigation is to develop a reliable implementation of PRA on CUDA architecture to solve ODEs in temporal decomposition to reduce computational time and which can be applied to achieve real-time or faster than real-time TSA with a large number of GPUs.
This paper is organized as follows: In section 2, the PRA with the Predictor Correction approach is discussed. Section 3 gives a brief overview of power system modeling, in particular, the classical and fourth-order models of synchronous generator. In section 4, we present the implementation of Parareal algorithm on NVIDIA GPUs, and in section 5, the simulation results and analysis are presented. Section 6 presents the conclusion and future work.

Parareal Algorithm
The origin of PRA can be traced back to spatial domain decomposition technique. The PRA involves dividing the entire simulation time T into small sub-intervals and solving these subintervals in parallel. Initial conditions are required to solve the small sub-intervals in parallel. The initial conditions are provided by a fast but less computationally expensive sequential numerical integrator. The small sub-intervals are then solved in parallel to get more accurate solution of an ODE.
Consider a general nonlinear ODE with given initial condition, ( ) 0 The entire simulation time t is decomposed into N sub-intervals as Two numerical operators, namely, coarse and fine propagators are defined in PRA. The coarse propagator denoted as G ΔT , operates using the initial condition ( ) is used to compute the approximate solution of Equation (1) with time step ΔT at time T n as shown in Figure 1. The approximate solution computed by the coarse propagator is denoted as  n U . The solution obtained using coarse propagator is less accurate but is computationally inexpensive. The coarse propagator is mathematically represented in Equation (2).
The fine propagator denoted as F δt , also will use the initial condition ( )   The solution computed from the fine propagator is denoted as  n U . The solution obtained from the fine propagator is more accurate compared to coarse propagator but it is computationally expensive. The fine propagator is mathe- The flowchart of the implementation of PRA is shown in Figure 3.
The flow chart consists of the three steps of the PRA which are discussed below: Step 1: The initial step of PRA is the computation of the initial conditions sequentially using the coarse propagator that is used for solving the sub-intervals in parallel. The initial coarse propagator generates a fast but less accurate initial conditions using Equation (4).
where, the superscript "0" denotes the initial iteration.
This step is performed to initialize the PRA iterations.
Step 2: The fine propagator is used to propagate the fine solution in parallel where, Step 3: Once the fine solutions are obtained using the coarse solutions as initial conditions, the PRA corrects the sequential coarse predictions using predictor-corrector method. Predictor-Corrector method is used to correct the solution difference obtained from coarse and fine propagators for the next iteration. The predictor-corrector scheme is described in Equation (6).
where,  Journal of Computer and Communications the kth iteration, the coarse value at time T k will get corrected to its respective fine solution. The steps 2 and 3 of the algorithm is iterated until the difference between the two successive coarse values meets the desired tolerance level shown in Equation (8).
The coarse solutions are generally less accurate and play an essential role in the convergence of the algorithm [14].

Ordinary Differential Equations Representing the Power System Dynamics
Power system dynamics are modeled as a set of Differential-Algebraic equations (DAE) of the form ( ) , , The set of differential Equation (9)

Classical Model of Synchronous Generators
The classical model primarily focuses on modeling the rotor angle and angular velocity of the generator when subjected to a disturbance. When the power system is subjected to a disturbance, the rotor of the synchronous generator will accelerate or decelerate with respect to rotating magnetic field which causes rela-

Detailed Model of Synchronous Generators
The detailed model of a synchronous generator addresses the direct and quadrature axis parameters of a synchronous generator taking into account the sa- where, d E′ and q E′ are the transient voltages along direct (d) and quadrature (q) axis respectively of the generator.  The electrical torque e T is given by the Equation (18) To compute e T using Equation (18), two algebraic equations involving the stator parameters of the generators have to be solved. Therefore, the set of differential equations modeling the dynamics of the power system with classical model of the synchronous generators will consist of two first order ordinary differential equations, and with the detailed model will consists of four first order ordinary differential equations along with three algebraic equations. The TSA of a power system due to a disturbance will involve solving a set of first order ODEs of each generator in a power system using a suitable numerical integration method. Since the generator ODEs are typically stiff the time step used in the numerical integration method has to be small to compute an accurate solution and not encounter numerical integration instability.
The number of ODEs modeling the dynamics of a generator depending on the level of modeling can vary from two to twenty seven when all of the control devices like exciter, governor, turbine, stabilizer, etc., are included. A typical power grid having in excess of thousands of generators, the TSA involves computing the numerical integration solution of in excess of ten thousands of ODEs to determine the stability of the grid, necessitating the use of Parareal algorithm executing on GPUs.

Numerical Method
The two state variables δ and ω variation in time due to a disturbance is determined by solving the Equations (12) and (13) in case of classical model and Equations (14) and (17)  The modified Euler's method is also known as the predictor-corrector method.
The modified Euler's method is a single-step method, which given the initial values for an interval (t n−1 , t n ), the approximate solution at t n is obtained in two steps: Step 1: Predictor In this step, the approximate solution p n y is computed using the explicit Euler's method with time step size h described by the Equation (19).

GPGPU Based Parareal Algorithm Implementation
General The pseudocode of the PRA implementation for GPGPU is shown in Figure   4. The first for loop implements the coarse propagator to compute the initial conditions in a sequential manner since it is an inexpensive computational task.

Test System
The performance of the PRA is demonstrated by studying the dynamics of a single machine infinite bus system (SMIB) shown in Figure 5.  Table 1 and Table 2, respectively.
In Figure 5, both of the transmission lines have a reactance of 0.3 pu while the transformer reactance is 0.2 pu.

Results and Performance Analysis
The PRA is implemented on a server having a Intel Xeon CPU E5-2670 @2.30 GHz, interfaced through the PCIe bus to with NVIDIA Quadro RTX 6000 GPU hosting 4608 computing cores with 24 GB GPU memory [37]. The C programming language version of CUDA is used in implementing the PRA for execution on GPUs.
TSA simulations using both classical and detailed generator models were performed using both the sequential algorithm and PRA. First, the variation of rotor angle of the generator with respect time due to a disturbance computed with  Next, the impact of the number of coarse propagators and fine propagators on speedup is analyzed.

Simulations Using the Classical Generator Model
Classical generator model has only two state variables rotor angle δ and rotor speed ω that result in two ODEs that need to be solved at every angle with sequential and PRA are shown. As expected, the rotor angle keeps increasing due to the system being unstable. The rotor angle computed using the PRA follows the angle from the sequential algorithm demonstrating the suitability of PRA even during the unstable system state. In Figure 7(b), the absolute error between the sequential and PRA solution is presented. In Figure 7(b), the magnitude of the error between the sequential and PRA is increasing while following the contour of the rotor angle variation. This is due to the numerical solutions computed by the coarse propagator are numerically instable (increasing numerically). Furthermore, these numerically instable values are used as the initial conditions for fine propagators resulting in an amplification of the numerical instability. The numerical instability affects negatively the performance of the predictor-corrector stage of the PRA resulting in an increasing error. This behavior is expected as the system is unstable.

Simulations Using the Detailed Generator Model
The In Figure 8(a), the rotor angle variations with sequential and PRA simulations are presented. The rotor angle solutions computed using the PRA is similar to the traditional sequential method. Also, it can be noticed that the rotor angle swing is damped and settles to a new system steady-state. In Figure 8 In Figure 9(a), the rotor angle variation with time when the fault clearing time is 1.3 secs is shown. The 1.3 secs are larger than the 0.77 secs critical clearing time and therefore the system is unstable. The rotor angle variations computed using the PRA follows that computed using the sequential algorithm. In Figure 9(b), the absolute error between PRA solution and sequential solution is shown. The absolute error has larger value due to the cascading of the error through the four ODEs.

Performance Analysis
The performance of PRA is analyzed using the execution time speedup achieved with respect to the traditional sequential algorithm. The speedup is given by Equation (21). Journal of Computer and Communications  Sequential v/s PRA for unstable system; (b) Absolute error between sequential and PRA for unstable system. In Figure 10, the variation of speedup with f N is shown. Since the speedup is increasing linearly, the parallel scalability of the PRA has strong scaling efficiency. The strong scaling efficiency is due to the f G t being significantly large compared to the sum of remaining four time components in Equation (22).
In Table 4, the execution times of both sequential algorithm and PRA computing the solutions of the ODEs of the detailed model along with the speedups are presented. The simulation time T was set to 26.1 secs to account for the long term dynamic stability with the detailed model of a generator. In Table 4, it can be seen that the PRA execution time is significantly small compared to the sequential computational time resulting in a speedup of 31×. It is important to emphasis that with the detailed model, the number of ODEs solved sequentially  can impact the performance due to significant amount of memory transfers between the host and device for systems with higher-order generator models. In future work, various methods will be explored to mitigate the memory transfers between the host and device, and the PRA algorithm will be tested for higher-order generator models for large power systems.