^{1}

^{2}

Cyber-physical systems (CPS) represent a class of complex engineered systems where functionality and behavior emerge through the interaction between the computational and physical domains. Simulation provides design engineers with quick and accurate feedback on the behaviors generated by their designs. However, as systems become more complex, simulating their behaviors becomes computation all complex. But, most modern simulation environments still execute on a single thread, which does not take advantage of the processing power available on modern multi-core CPUs. This paper investigates methods to partition and simulate differential equation-based models of cyber-physical systems using multiple threads on multi-core CPUs that can share data across threads. We describe model partitioning methods using fixed step and variable step numerical in-tegration methods that consider the multi-layer cache structure of these CPUs to avoid simulation performance degradation due to cache conflicts. We study the effectiveness of each parallel simu-lation algorithm by calculating the relative speedup compared to a serial simulation applied to a series of large electric circuit models. We also develop a series of guidelines for maximizing performance when developing parallel simulation software intended for use on multi-core CPUs.

Cyber-physical systems represent a class of complex engineered systems whose functionality and behavior emerge through the interaction between the computation and physical domains; in addition, these interactions can occur, locally, within a system, or be distributed in a networked environment [

Systems engineering methods play an important role in designing and analyzing CPSs. In the traditional approach to systems engineering, design has followed a discipline by discipline approach, with individual components being created in isolation to meet specified design criteria. After the individual, compartmentalized, design phases, all of the components are brought together for integration testing to verify that the design criteria are met [

With the increasing prevalence and complexity of present-day CPSs, the traditional systems engineering approach is proving to be detrimental to the overall design process. Several research papers, such as [

Previously, the increasing complexity of systems and the corresponding increases in their computational complexity were matched by faster processor speeds that kept the simulation runtime within reasonable bounds. However, since the year 2005 processor clock speeds have largely leveled off (see

Modern CPUs have several layers of memory between the physical CPU registers and the main system memory collectively called the cache (the cache is discussed in more detail in Section 3.1.2). The cache allows CPU core to keep data that it is currently working on readily available in a layer of cache that provides very quick access, and allows data that is not needed to stay in a layer of cache farther away from the core [

The potential parallel processing power of modern multi-core CPUs, coupled with the potential performance pitfalls of the CPU cache, leads us to focus our research into parallel simulation algorithms that target a multi-core CPU and use the CPU cache as a performance asset instead of a liability.

The goal of this paper is to take steps toward facilitating the design process for cyber-physical systems by reducing the time it takes to simulate complex and large system models by developing parallel simulation algorithms for multi-core CPU architectures. A primary component of these algorithms is the incorporation of suitable memory management and the use of program constructs that take advantage of the CPU memory architecture. Our research contributions, therefore, focus on:

1) Developing classes of parallel simulation algorithms that appropriately uses the multiple cores and the cache memory organization on a multi-core CPU, and

2) Running experimental studies that help us analyze the effectiveness of various multi-threading and memory management schemes for parallel simulations.

The contents of this paper are as follows. Section 2 describes our simulation methodology. Section 3 describes our shared memory parallel processing architecture. Section 4 our approach to parallel simulation of ordinary differential equations and the models that we use to evaluate our algorithms. Section 5 presents our parallel simulation algorithms and experiment results. Section 7 presents our conclusions regarding parallel simulation of ordinary differential equations.

This section provides definitions and derivations related to simulation (Section 2.1), describes our approach to simulation (Section 2.2), the numerical integration methods we use to perform our simulations (Section 2.3), and an overview of inline integration (Section 2.4).

The physical systems we will be working with are continuous dynamic systems, and are modeled using ordinary differential equations (ODE) or differential algebraic equations (DAE). ODE models are represented mathematically in the state equation form

where

equal to

All of the parameters have the same meaning as the previous equation, except they are in terms of the current simulation step n instead of continuous time. It is possible that the discrete time function,

One important point to note from Equation (1) is that the calculation of any value in

Solving the functions in

Definition 1 (Function Evaluation) A function evaluation is one evaluation of the vector function

There is one equation for each variable that is not a state variable, that is, one equation for each element in the sets

The goal of a simulation is to generate time trajectory data for all of the variables in the dynamic system model. The high-level process is shown in

variables are functions of the simulation step,

In modern simulation software the model compiler converts the declarative DAE model to explicit ODE form [

Definition 2 (Numerical Integration Method) The numerical integration method integrates the derivative of a variable at time t to determine the value of the corresponding state variable at time

Definition 3 (Step Size) The step size of a simulation is the distance between two individual time steps, and, depending on the solver, it may be kept fixed or dynamically adjusted during simulation. It is described by the variable h in definition 2.

Definition 4 (Solver) The solver is a standalone piece of software that implements a specific numerical integration method, and, potentially, step size control, order control, and any other tasks necessary to accurately complete a simulation [

The solver uses a numerical integration method to integrate the derivatives to find the new values of the state variables at the next time step,

guarantee that the results are within the prescribed error tolerance. If they are not then the solver takes an appropriate action, usually halving the simulation step size, and tells the simulation code to re-evaluate the previous time step with the smaller step size. If the state variables are within the defined tolerance, the new state variables are passed to the computational model which then calculates the next values of the state variable derivatives. The solver can also take the step of making the simulation step size larger if a small step size is no longer needed. This process continues until the simulation stop time is reached.

The two aspects of the simulation, shown in

The numerical integration methods that we use in this work are the Forward Euler (FE), and Runge-Kutta- Fehlburg 4-5 (RKF45). The FE method is defined as [

The Runge-Kutta family of methods [

This method is really 2 separate RK methods that share their initial stages. One method is a 4^{th} order method that uses stages 1 through 5 and equation Final 1, and the second method is a 5^{th} order method that uses stages 1 through 6 and uses equation Final 2. This makes determining the error for the simulation trivial. The error for a time step is evaluated according to:

If Equation (5) is false then the simulation needs to choose a new step size and recalculate the time step. If Equation (5) is true then the time step is accepted by the solver and the 5^{th} order value of

Inline integration was first introduced in [

integration is used. By itself, inline integration will not necessarily yield a reduction in simulation time, but it will provide a means for parallelizing the integration method with the model, and open up other opportunities for reduce the simulation time.

Traditionally computing architectures have been described in terms of four qualitative categories according to Flynn’s taxonomy [

1) Single Instruction, Single Data (SISD): This architecture is a conventional sequential computer with a single processing element that has access to a single program data storage.

2) Multiple Instruction, Single Data (MISD): In this architecture there are multiple processing elements that have access to a single global data memory. Each processing element obtains the same data element from memory, and performs its own instruction on the data. This is a very restrictive architecture, and no commercial parallel computer of this type has been built [

3) Single Instruction, Multiple Data (SIMD): In this architecture, multiple processing elements execute the same instruction on their own data element. Applications with a high degree of parallelism, such as multimedia and computer graphics, can be efficiently computed using a SIMD architecture [

4) Multiple Instruction, Multiple Data (MIMD): In this architecture, there are multiple processing elements and each element accesses and performs operations on its own data. The data may be accessed from local memory on the processor, from shared memory, i.e., memory shared by multiple processor units.

Multi-core CPUs, such as Intel’s Core line of processors [

At a conceptual level, the CPU can be divided into two parts: the computational processor, and the memory. Our primary focus will be on the processor memory, which is covered in Section 3.1.2, but we will first summarize the important aspects of the processor as it relates to the memory system.

Current high end chips have several processor cores [

Definition 5 (Thread) A single program execution stream that includes the program counter, the register state and the stack [

Definition 6 (Hardware Multithreading) Increasing utilization of a processor by switching to another thread when one thread is stalled due to causes such as waiting for data from memory, or a no operation instruction [

Definition 7 (Process) A higher level computation unit, above threads. A process includes one or more threads, the address space, and the operating system state. Changing between processes requires invoking the operating system [

CPU hardware that implements hardware multithreading is designed such that each hardware thread is presented to the Operating System (OS) as an individual CPU, which gives the OS double the number of cores to use for computational work. In our work, we treat these multithreaded processors as individual processors, just like the OS, so that a computer with 4 cores that supports 8 threads (2 threads per core) is treated as having 8 unique cores.

The cache memory hierarchy is the memory closest to each processor core. The cache closest to each core is called the L1 cache, and, in a 3 level hierarchy such as Intel's Core architecture [

In most cases, the cache is implemented as a hierarchy, where data cannot be in the L1 cache without it also being in L2 and L3 [

In order for processors to communicate or exchange data, the data needs to be in the layer(s) of cache that are shared across cores. Usually, as shown in

therefore the slowest layer of cache. As an example, since the threads can only directly access their L1 cache, to send data from Core 0 to Core 2, Core 0 will first have to write its data to the L1 cache. Then Core 2 will request the data through a load or get instruction. The memory controller on the chip will then move the data from the L1 cache on Core 0, to the L2 cache on that core, and then to the global shared L3 cache. After it is in the L3 cache the data will be moved into the Core 2 L2 cache, and finally into the L1 cache on Core 0 where the data can finally be used.

The cache is organized into cache lines, or blocks.

Definition 8 (Cache Line) The minimum unit of information that can be present or not present in a cache [

Definition 9 (Cache Line Read) The process of pulling a cache line from a cache far away from a CPU core to a cache close to the CPU core.

In most modern desktop processors the cache line is 64 bytes. This means that moving 4 bytes of data (typically, the size of an integer in C/C++) to the L1 cache will move the entire 64 byte cache line that contains the requested 4 bytes of data. Multiple threads in different cores accessing data in the same cache line can lead to a problem known as cache line sharing.

Definition 10 (Cache Line Sharing) When two unrelated variables are located in the same cache line are being written to by threads on separate cores, the full line is exchanged between the two cores even though the cores are accessing different variables [

Cache line sharing can have a dramatic impact on performance, which can be demonstrated through a simple experiment (modified from [

This experiment was performed on a machine running Ubuntu 15.04 with an Intel i7-880 CPU clocked at 3.08 GHz and 8 GB of RAM. This CPU has 4 physical cores that each support 2 hardware threads, the same architecture as

in the computation time when using 1 through 4 threads. This is expected as the threads are working on independent cache lines and do not interfere with each other. The jump in computation time above 4 threads is likely due to hardware limitations of Intel’s hardware multithreading implementation (marketed as Hyperthreading by Intel [

As this example shows, sharing cache lines between threads can lead to significant performance problems, and should be avoided [

Definition 11 (Cache Aligned Data) Cache Aligned Data is data that has a memory address that is an even divisor of the cache line size. That is, in most modern architectures, cache aligned data has memory address modulo 64 = 0.

In effect, a cache aligned data structure was used in the above example where each thread was assigned an array index in which to work that was on its own cache line. Programmers need to consider the processor cache, and avoid cache line sharing, when designing multithreaded applications to ensure program performance.

This section presents our general approach to simulating our models (Section 4.1), a mathematical description of model partitioning (Section 4.2), a discussion on the overhead associated with running parallel simulations (Section 4.3), the models that we used to evaluate our algorithms (Sections 4.4 and 4.5), our experiment configuration (Section 4.6), and our base test case (Section 4.7).

The state equations can be integrated independently and in any order for each time step (Section 2.1). This gives us freedom, within a time step, to group the state equations in ways that achieve the best run time performance. At the end of a time step, the updated state variable values will need to be synchronized across the state equations of the system, to allow the different execution threads to acquire the updated state variable values before starting the calculations for the next time step.

We implemented two types of integrators for the parallel simulation algorithms: 1) a fixed-step Euler integrator, and 2) a variable-step Runge-Kutta integrator. We discuss our implementations for the two integration schemes next.

We used an Inline Forward Euler (IFE) integration method [

where h is the step size. Inlining the integration equation allows our simulation to calculate

We simplify the definition of this equation to be:

During simulation using the IFE method, for each time step the simulation calculates the values for

and then advances the simulation time:

where t is the simulation time and h is the time step. Then the simulation loops and repeats the process until the simulation reaches the simulation end time. In our fixed step simulations the step size of the simulation is passed as a parameter to the simulation at run-time. The simulation algorithm using IFE integration is described in Algorithm 1.

We used an explicit Runge-Kutta-Fehlberg 4,5 (RKF4,5) solver (described in Equation (4)) from the GNU Scientific Library (GSL) [

Partitioning any computational problem into independent components, such that each part can be executed on a separate processor, is the first step to solving the problem in parallel [

To facilitate the partitioning, it is important to note from Equation (1) is that the calculation of any value in

allows us to divide the functions represented by

Therefore, we take the dynamic system model described in ODE form (Equation (1)), and divide the system into

Algorithm 1. Algorithm describing simulation using Inline Forward Euler integration.

Algorithm 2. Algorithm describing simulation using a variable step RKF4,5 solver.

approximately equal number of state variable calculations are assigned to each thread^{1}.

Differences in the cardinality of the sets accounts for remainders in integer division of

The value of

As discussed earlier, this partitioning approach allows us to complete the calculations within each time step in parallel, but it will require a synchronization phase at the end of each time step to guarantee correct results. The partitioning, and the memory structure that supports the chosen partition, will differ for each of our parallel simulation algorithms, and the specific differences are highlighted in Section 5, which describes the parallel simulation algorithms and the results of the experimental runs with those algorithms.

In any parallel program the parallelization adds overhead that is not present in a serial execution of a program. The number of computations per time step of the simulation is the same for sequential and parallel algorithms. However, the overhead generated by parallelization can limit the effectiveness of the parallel implementation. This overhead typically takes the form of scheduling overhead and communication overhead. The created threads need to be assigned, by the operating system, to a CPU core during the thread runtime. It is the responsibility of the OS to ensure that all threads are given enough time on a CPU core to complete their work. There is also overhead due to communication between the created threads. This communication overhead does not occur in a single threaded program, but it is necessary in a parallel program synchronization. The run time of a parallel program can be described at a high level as:

The challenge in designing parallel software is to minimize the overhead component of Equation (17). This will generally involve two components: 1) the overhead required to manage shared memory between the different execution threads, and 2) the amount of swapping that needs to be performed when there are more threads than available processor cores. These two components are not independent of each other, and we describe a number of algorithms that trade off these two parameters.

We use the Modelica modeling language [

The RLC circuit models were implemented to exploit Modelica’s hierarchical nature, which allows for efficient construction of models from individual components. The base component is shown in

resistor is a modified version of the MSL resistor model. The thermal components were removed from the resistor model because they contain Modelica if statements, which can lead to hybrid behavior. The terminals associated with the components are also from the MSL and were not altered. This set of base components were used to create a larger component with six connected base components as shown in

Algebraic loops are very common in modern complex models [

To create models with algebraic loops, we modified the models in

We used two sets of parameter values in our experiments. The first set of values set all parameters equal to 1, and the resulting electrical circuits have slow time constants. A parameter value of 1 for all an electrical components of a circuit is not realistic for most applications, but it was useful for an initial evaluation of a parallel algorithm. We also used more realistic parameter values of 15 W for all resistors, 15 mH for all inductors, and 250 mF for all capacitors. The circuits using these parameters had much faster time constants.

We will measure the simulation run time using the wall-clock time of the simulation.

Definition 12 (Wall-Clock Time) The wall-clock time of a simulation is the time it takes from an external user’s perspective for the overall simulation to be completed, i.e. the time for a simulation to complete as measured by a clock on a wall.

We use the wall-clock time so that we can study the reduction in simulation time using parallel algorithms

from the user’s point of view. In a parallel program it is also possible to measure time as the aggregate of the busy time of each CPU, but this is not useful from an end user perspective. It is a better measure of the performance from a hardware use perspective. We measure the effectiveness of a parallel simulation as the relative speedup of the wall clock time for a parallel simulation compared to a serial simulation. The equation to calculate the relative speedup is:

A value of greater than 1 implies that the parallel simulation provides a speedup, and a value of less than 1 indicates that the parallel code runs slower than the serial code. The simulation times,

The simulation programming language used for this study is C/C++, compiled and run on Linux. C/C++ simplifies the programming task and generates very efficient execution code. Further, C/C++ also has low-level memory management functions that allow cache-aligned data structures to be created. We target Linux as a simulation environment because it provides more control over the created threads, and generally faster execution than Windows.

Unless otherwise noted, all experiments were run on an Intel i7-880 desktop PC clocked to 3.08 GHz with 8 GB of memory running Ubuntu 15.04. The generated C++ code was compiled using g++ version 4.9.2 [

We focus on comparing our parallel simulation algorithms to the fastest and most basic serial simulation algorithm we were able to create. Comparing a parallel implementation to a fast serial implementation of the same problem was advocated by [

A parallel simulation algorithm that appropriately uses multiple cores and the CPU cache has conflicting goals, even though using multiple cores effectively and managing the cache both focus on the physical hardware of the CPU. An ideal program architecture from the standpoint of a multi-core CPU will involve a limited number of parallel computation threads, with very little sharing of data between the threads. An ideal program architecture from the standpoint of the CPU cache might be to divide the program into many very small pieces so that the data being worked on by each thread fits entirely into the size of one cache line. These two ideal architectures are often in conflict with each other, one prefers large partitions and the other small partitions, and finding the right balance between the two is the key aspect of our research.

Partitioning the system of equations, such that the cache can be used effectively to minimize the communication between the execution threads was a key factor that drove the design of the simulation algorithms presented in this section. We focus on 1) Minimizing the communication between execution threads, and 2) Utilizing the fastest communication methods for exchanges that have to take place. We leverage the shared-memory architecture in modern multi-core CPUs so that communication between threads can be handled in hardware as a part of the processor’s cache (parallel architectures and processor cache are covered in Section 3.1). However, we design our algorithms so that they share only a minimum amount of data, as sharing data through the cache across threads also causes computation delays, as we demonstrated in Section 3.1.2.

This section presents the set of parallel simulation algorithms we developed in a progression. The different simulation algorithms varied on how many threads were created, and how the variables in the sets

This section describes definitions related to the simulation threads (Section 5.1), a brief description of the initial algorithms we developed that did not produce good results (Section 5.2), and a detailed description of the algorithms that did produce results (Sections 5.3, 5.4 and 5.5).

The set of threads used in each experiment is:

where

where m is the number of CPUs available to the operating system (on a processor that implements hardware multithreading that number of CPUs available to the OS is going to be double the number of cores on the processor, see Section 3.1). As an extreme we also developed an algorithm where

to identify the threads that were created by the main thread,

where M represents a block of memory, and X is the size of that memory. We will also use the symbols ®, ¬, and « to describe if a thread writes to a memory block, reads from a memory block, or reads and writes to a

memory block, respectively. Each of the threads in

Another factor that drove the implementation of our algorithms was to enable fast communication and simple synchronization between hardware threads, i.e.

variables. Simple spin locks are used at synchronization points to pause threads [

Our initial attempts to produce a speedup did not produce good results. This section summarizes our initial algorithms, and the lessons learned from them that were applied to our later algorithms.

Our first parallel algorithm creates a separate thread for each individual function in

Even for moderately sized models, this results in many more threads than cores available on a typical multi- core CPU (at the time of this writing, typically there are 4 to 8 cores available [

where M represents the block of memory,

We also tested a second version of this algorithm that set the thread affinity for each of the created threads, such that the threads were evenly distributed between the processor cores. The expectation with setting the thread affinity is that it would offload the work of dynamically scheduling the threads from the OS and fix the schedule, thus reducing the simulation time.

Definition 13 (Thread Affinity) The thread affinity identifies on which processor a thread is allowed to run.

Our second parallel algorithm agglomerated our state variable calculations so that we could create fewer than

Algorithm Full Shared Memory Simple Agglomeration, uses a simple agglomeration scheme where the equations in

Algorithm Full Shared Memory Smart Agglomeration, uses a smart agglomeration scheme that groups the ODE equations, such that the equations that have a large number of dependencies (used a large number of state variables to calculate a particular value of

This parallel simulation algorithm, Partial Partitioned Memory, is identical to the simple agglomeration approach presented above, except that it partitions

where each memory block

that contains the values of

None of our initial algorithms produced a speedup compared to the serial case. Our first algorithm produced speedups on the order of 10^{−3} in the best case, which means that the parallel algorithm was orders of magnitude slower than the serial algorithm. Our second algorithm produced speedups of 0.92 in the best case, which means that our parallel algorithm was almost as fast as the serial algorithm. Our third algorithm produced speedups of 0.49 in the best case, which means that our parallel algorithm was half as fast as a serial algorithm. We derived a number of lessons from these first parallel algorithms that we applied to our later algorithms. These lessons include:

1) Do not create more threads than there are processors,

2) Managing thread workload so that the computational work assigned to each thread is greater than the overhead associated with creating the thread,

3) Avoid cache line sharing between computational threads,

4) Evenly divide computational work between all threads, and do not reserve one thread for controlling the simulation and all other threads for computation, and

5) Reduce the communication and dependencies between threads.

The algorithms that did produce a speedup are discussed in detail in the following sections.

The fixed-step simulations that gave us the best performance used fully distributed memory. This means that each thread has its own cache aligned block of memory that it writes to which includes both

These two algorithms expand the roles of the threads in

program flow describing these two partitioned memory algorithms is described in

simulation time steps, on lines 3 and 4 of Algorithm 1, are performed in parallel. The threads in

The first full distributed memory approach, Full Partitioned Memory Minimum Sharing, created memory blocks:

Each memory block is created on its own cache line to avoid cache line sharing. Also, it is assigned a subset of variables from

needed in more than one thread. To calculate its values of

Aligning the data structures to a cache line boundary is accomplished through the align as C++ keyword introduced in the C++11 standard [

a cache line. If each thread only writes to one memory block, and all memory blocks are aligned to separate cache lines, then there will be no cache line sharing between threads.

The second full distributed memory approach, Full Partitioned Memory Simple Agglomeration. It has the same program flow as Full Partitioned Memory Minimum Sharing, shown in

The memory approach used in Simple Agglomeration is a little different from Minimum Sharing. Each thread in

The relationship between the threads and the different memory blocks is shown in

The relative speedups compared to the serial case are shown in ^{−2}.

From

We note that Minimum Sharing and Simple Agglomeration have very different wall clock times when they are run only using one thread. This is an unexpected result because the memory differences between the two partitioning approaches should not come into play when only using one thread. The likely reason for the difference in single threaded simulation time is differences in implementation. Version 1 uses C-style structs for sharing data, and version 2 uses C-style arrays. Indexing into an array requires pointer arithmetic, which takes extra time, that is not present when using structs.

We also note that both of the Full Partitioned Approaches perform better on the complex RLC model with

Total Threads | Slow Time Constants | Fast Time Constants | ||
---|---|---|---|---|

Min. Sharing | Simple Agglom. | Min. Sharing | Simple Agglom. | |

1 | 1.01 | 0.46 | 1.01 | 0.43 |

2 | 0.50 | 0.47 | 0.62 | 0.50 |

3 | 0.52 | 0.50 | 0.66 | 0.55 |

4 | 0.52 | 0.56 | 0.65 | 0.61 |

5 | 0.45 | 0.49 | 0.58 | 0.53 |

6 | 0.45 | 0.51 | 0.59 | 0.56 |

7 | 0.43 | 0.50 | 0.55 | 0.56 |

8 | 0.39 | 0.46 | 0.53 | 0.51 |

Total Threads | Slow Time Constants | Fast Time Constants | ||
---|---|---|---|---|

Min. Sharing | Simple Agglom. | Min. Sharing | Simple Agglom. | |

1 | 1.03 | 0.52 | 1.00 | 0.59 |

2 | 0.80 | 0.65 | 0.92 | 0.78 |

3 | 1.02 | 0.81 | 1.21 | 0.97 |

4 | 1.17 | 1.02 | 1.44 | 1.26 |

5 | 1.02 | 0.86 | 1.24 | 1.05 |

6 | 1.12 | 0.98 | 1.34 | 1.19 |

7 | 1.11 | 1.04 | 1.37 | 1.27 |

8 | 1.10 | 1.05 | 1.39 | 1.29 |

804 state variables using fast time constants than slow time constants. A possible explanation is due to the equations in

extra complexity, essentially just the presence of parameter values scaling the state variable values, means that the processor has more computational work to solve each equation and therefore the ratio of cache line reads to computational work goes down, and the cache line reads have less of an opportunity to dominate the run time. This analysis seems tenuous, but since these methods use fixed step integration, the change in model behavior has no effect on the integration (because the step size does not change as a result of system dynamics), and the only real difference between the fast and slow parameter values is the presence of the parameter value terms in the integration equations.

In

We also see in

We performed a mean-squared error analysis on our simulation results for both Minimum Sharing and Simple Agglomeration. The MSE calculations compared the time trajectory data of a single threaded fixed step simulation and a parallel simulation using 8 threads. In all cases where the simulation step sizes were small enough to produce a stable simulation, the error was on the order of 10^{−16} or smaller, so we did not include the mean square error results here.

A variable step simulation is likely to provide better simulation performance than a fixed-step simulation. This algorithm tests a parallel version of a variable step solver, using a full partitioned memory approach, to determine if further speedups can be found.

We use a Runge-Kutta-Fehlberg4,5 solver from the GNU Scientific Library [

the solver the user provides functions to calculate the system state variable derivatives,

Jacobian matrix. The user also provides the solver with an output interval detailing how frequently the user wants to receive updates on the state variable derivatives. Controlling the simulation time step is left up to the solver (see Section 4.1.2).

To create a partitioned variable step solver we divide the model into

where the

The

Each independent RKF4,5 solver is responsible for integrating its portion of state variable derivative values:

Due to the fact that the solvers are independent, the solvers will be forced to integrate their set of

Since this system is so simple, the Jacobian matrix of this system matches the system matrix above:

However, if this system is divided into two independent systems, as we do for the partitioned RKF4,5 parallel approach, the equations for the first system become:

and the equations for the second system become:

where the

the Jacobian matrix in Equation (33) is missing the partial derivatives with respect to

Taken together, the two factors of out of date data and an incomplete Jacobian matrix can lead to significant problems for this simulation approach. The only option we have to control this potential problem is to reduce the synchronization time interval, because a faster synchronization time, like a smaller step size, will help to reduce errors in simulation. Producing an accurate simulation data will require balancing between setting the synchronization interval small enough that the errors in the approximations are small, but not setting the synchronization interval so small that there is no performance benefit.

The memory structure of the simulation is shown in

threads read and write to their local memory blocks at every function evaluation. The threads only update the shared memory block at the synchronization points, but can read from it at any time.

The program flow for the partitioned RKF4,5 simulation is somewhat more complex than the previous program flows and is shown in

This approach generally performed very well. The relative speedups for the models in

Total Threads | RLC with 288 State Variables | RLC with 804 State Variables | ||
---|---|---|---|---|

Slow TC | Fast TC | Slow TC | Fast TC | |

1 | 1.00 | 1.00 | 1.00 | 1.00 |

2 | 1.75 | 1.76 | 1.81 | 1.93 |

3 | 2.17 | 2.30 | 2.43 | 2.72 |

4 | 2.81 | 2.84 | 3.14 | 3.56 |

5 | 2.41 | 2.17 | 2.46 | 2.12 |

6 | 2.68 | 2.41 | 2.87 | 2.01 |

7 | 2.60 | 2.64 | 1.44 | 0.32 |

8 | 2.53 | 2.81 | 1.51 | 0.33 |

were calculated by comparing the simulation times to a single threaded, serial, variable step simulation.

These results show that for the complex RLC model with 288 state variables the partitioned RKF4,5 method was able to match the serial simulation or provide a speedup when using 2 through 8 threads. The models with fast and slow time constants had similar performance, with the best speedup of approximately 2.8 coming when 4 threads were used.

For the complex RLC model with 804 state variables and slow time constants the partitioned RKF4,5 method was able to provide a speedup when using 2 through 8 threads. When simulating the model with fast time constants the algorithm was able to provide a speedup when using 2 through 6 threads. When using 7 and 8 threads, the model with fast time constants did not provide a speedup.

We performed a Mean Squared Error (MSE) analysis for the RLC models, by comparing the simulation time trajectory data between a serial fixed step simulation and a parallel variable step simulation. When we calculated the MSE for a simulation using our chosen synchronization interval, the mean square error was very small, with the largest error on the order of 10^{−5} and most errors much smaller than that. Since the errors were so small, they are not included in this paper. Also, the small error values indicate that the process of parallelizing a simulation does not negatively affect the accuracy of the simulation. This shows that our concerns about breaking a model into independent partitions were not justified for this set of models.

Algebraic loops are very common in modern complex models, algebraic loops are typically solved using the Newton Iteration method [

the algorithm. It is reported that nonlinear loops have a computational complexity of

approach to parallelize the simulation of a model that contains algebraic loops will focus on solving the algebraic loops in parallel, and then perform the numerical integration serially. We use KINSOL from the SUNDIALS solver library [

The models have a number of properties that allow us to simplify the simulation algorithms. The first property is that the variables calculated by the loops are from the set

all of the loops are independent of each other, and therefore can be solved in any order for each time step. In addition, each thread is assigned approximately the same number of loops to solve for each time step. Once all the loops are solved, the state variable derivatives are calculated and then integrated to end the simulation time step. The program flow is shown in

The memory management scheme employed is shown in

When developing this approach for parallelizing the solving of algebraic loops, we elected to not parallelize the variable step solver, because parallelizing it would likely add much more computational work to the simulation, and likely end up slowing down the simulation instead of making it faster. The reason for the potential slowdown is due to the fact that solving the algebraic loops is embedded the function

The relative speedups and standard deviations of the models with algebraic loops are shown in

Total Threads | Slow Time Constants | Fast Time Constants | ||
---|---|---|---|---|

Fixed Step | Variable Step | Fixed Step | Variable Step | |

1 | 1.00 | 1.00 | 1.00 | 1.00 |

2 | 1.59 | 1.81 | 1.85 | 1.76 |

3 | 2.06 | 2.33 | 2.50 | 2.36 |

4 | 2.32 | 3.06 | 3.32 | 3.20 |

5 | 1.99 | 2.63 | 2.69 | 2.49 |

6 | 1.99 | 2.43 | 2.95 | 2.62 |

7 | 1.70 | 2.46 | 3.32 | 3.04 |

8 | 1.67 | 2.40 | 3.31 | 2.97 |

Total Threads | Slow Time Constants | Fast Time Constants | ||
---|---|---|---|---|

Fixed Step | Variable Step | Fixed Step | Variable Step | |

1 | 1.00 | 1.00 | 1.00 | 1.00 |

2 | 1.46 | 1.72 | 1.92 | 1.88 |

3 | 1.95 | 2.42 | 2.52 | 2.11 |

4 | 2.38 | 3.14 | 3.33 | 2.88 |

5 | 1.83 | 2.56 | 2.72 | 2.42 |

6 | 2.14 | 2.93 | 2.94 | 2.53 |

7 | 2.10 | 2.91 | 3.46 | 2.67 |

8 | 2.11 | 3.11 | 3.49 | 2.43 |

formance, and every test, except for the single threaded experiments which served as a control, showed a speedup.

The variable step simulation performed slightly better than the fixed step simulation on the models with slow time constants. On these models the variable step simulation achieved a maximum speedup of slightly more than 3 when using 4 threads, while the fixed step simulation only achieved a speedup of about 2.3 when using 4 threads.

The fixed step and variable step simulations had near identical performance on the complex RLC model with 288 state variables and fast time constants; when using 4 threads the fixed step simulation achieving a maximum speedup of 3.3 and the variable step simulation achieving a maximum speedup of 3.2. The results between the fixed step and variable step simulations are likely similar because the processing of the algebraic loops is dominating the simulation run time, and therefore there is no advantage to using a fixed step or variable step solver.

The complex RLC model with 804 state variables and fast time constants provided interesting results because the fixed step simulation performed better than the variable step simulation. The fixed step simulation achieved a speedup of 3.5 when using 8 threads (it achieved a speedup of 3.33 when using 4 threads), while the variable step simulation achieved a maximum speedup of only 2.88 when using 4 threads. This is likely due to, again, the processing of the algebraic loops dominating the computation time, which allows the computational efficiency of the fixed step simulation to provide a speedup over the variable step simulation.

This section summarizes our parallel algorithms, and experiment results, and discusses the conclusions we are able to draw from those results. As a review,

^{−3}, which means our parallel implementation was significantly slower than the serial case. In this algorithm we were creating more threads than the CPU and the operating system were able to efficiently handle and most of the processing time of the simulation was spent on the overhead of switching between the threads instead of on advancing the simulation.

In Algorithm 2 we reduced the number of threads so that the maximum number of threads used for the simulation was equal to the number of CPU cores available on our processor. This method did not provide a speedup, but it was able to match the serial simulation time. The lack of speedup for Algorithm 2 was caused by

Algorithm | Memory Structure | Role of | Role of | Agglomeration | |
---|---|---|---|---|---|

Type 1 | Full Shared | Calculate | Merge | None | |

Type 2 | Full Shared | Calculate | Merge | Simple and Smart | |

Type 3 | Partial Partitioned | Calculate | Merge | Simple | |

Type 4 | Full Partitioned | Calculate | Calculate | Simple and Minimum Sharing | |

Type 5 | Full Partitioned | Calculate | Calculate | Simple | |

Type 6 | Full Partitioned | Solve algebraic loops | Solve algebraic loops and integrate | Simple |

Algorithm | Best Performance | ||
---|---|---|---|

Rel Speedup | Threads | Model | |

Type 1 | 8 | RLC 288 State Variables, Slow Time Constants | |

Type 2 | 0.61 | 8 | RLC 804 State Variables, Fast Time Constants |

Type 3 | 0.41 | 4 | RLC 804 State Variables, Fast Time Constants |

Type 4 | 1.44 | 4 | RLC 804 State Variables, Fast Time Constants |

Type 5 | 3.56 | 4 | RLC 804 State Variables, Fast Time Constants |

Type 6 | 3.49 | 8 | RLC 804 State Variables, Fast Time Constants |

not considering the CPU cache in our experiments, and the different threads had to wait for an update to their cache lines instead of completing the simulation.

In Algorithm 3 we attempted to avoid the problems of cache line sharing by creating a memory structure that would prevent cache conflicts between the threads. Unfortunately this algorithm performed worse than Algorithm 2 due to our memory structure significantly increasing the amount of work

In Algorithm 4 and Algorithm 5 we further enhanced our memory structure so that the problems seen in Algorithm 3 were solved, and we reduced the workload of

Algorithm 6 only parallelized the algebraic loops, it did not parallelize the simulation as a whole as the previous algorithms. However, in developing Algorithm 6 we applied the lessons learned in Algorithms 1-5. These lessons include: limiting the number of threads created, avoiding cache line sharing, and evenly distributing the processing across all threads. Algorithm 6 also provided a good speedup compared to a serial simulation, and the largest speedup was 3.49 on the complex RLC model with 804 state variables and fast time constants. Further conclusions are discussed in the next section.

This research provided a number of interesting and practical conclusions about parallel simulation. These will be detailed in the following sub-sections.

The first conclusion that we are able to draw from these results is that there is a model size threshold below which it is difficult to draw a benefit from parallelization. We are parallelizing within a time step, so the amount of time taken per time step is the real barrier that we are trying to beat with parallelization. Our RLC model with 288 state variables when using parameters that created slow time constants had a time per time step of about 500 ns on our hardware; when using the parameters that created the fast time constants the time per time step is about 700 ns. On this model we only saw a speedup when using variable step integration; the best performance for the fixed step integration was 0.56 when using the slow parameters, and 0.61 when using the fast parameters. For the large RLC model the time per time step for the serial simulation is about 2 ms for the slow parameters and about 3 ms for the fast parameters, and the full partitioned memory method was able to produce a small speedup of about 1.2 for the slow parameters and of about 1.4 for the fast parameters. A better measurement is CPU clock cycles. On the CPU we used for our experiments, clocked to 3 GHz, 700 ns equates to approximately 2100 clock cycles on the CPU, while 3 ms is approximately 9000 clock cycles. Since the large model produced a speedup and the small model did not, we can determine that the minimum model size above which parallelization is practical is going to be just under 800 state variables, or about 9000 clock cycles for one time step, and the serial time of the time step needs to be on the order of microseconds.

In Section 3.1.2 we presented an experiment that shows that cache line sharing can have a significant impact on a program’s run time (

Another conclusion that we can draw from Section 3.1.2 and from our experiment results in Section 5 is that minimizing cache line reads and avoiding cache line sharing is crucial to parallel program performance. These two factors prevented parallel simulation methods 1 and 2 from providing very good performance.

In the full shared memory approaches there was both cache line sharing between the threads of

the variables, and therefore the time to perform the cache line reads dominated the time to perform the merge. The benefit of assigning each thread in

Another factor in the good performance of these methods is that the memory block that each thread in

Another important aspect to designing parallel simulation algorithms is to ensure that the computational work of a time step is divided evenly across the computational threads. This was a second major problem with our early parallel simulation methods. These methods assigned the computational work of solving and integrating the equations in

Simulations execute for a large amount of simulated time will likely require a large number of time steps. For our experiments, some models were run for 5 million time steps. At that number of iterations small inefficiencies in the simulation code, that would have been ignored or undetectable had they only been run once, can become a source of significant lost time. In a program such as a simulation where the same piece of code is repeated many times, every line of code in that program needs to be closely examined to make sure it is as efficient as possible.

In this work, we presented a series of parallel simulation algorithms for ODEs that are specifically designed to accommodate the features of a modern multi-core CPU. Specifically, the algorithms consider the multiple CPU cores, the processor cache, and their interactions to derive maximum speedup. The final three of these algorithms, Full Partitioned Memory Fixed Step (Section 5.3), Full Partitioned Memory Variable Step (Section 5.4), and Algebraic Loop Simulation (Section 5.5), produced good speedup when compared to our serial test case. We also developed a series of experimental studies that allowed us to analyze the effectiveness of various multi- threading and memory management schemes for parallel simulations. Since these algorithms are based on an ODE representation of the model behavior, they can be applied equally across models that cover multiple physical domains.

In the process of systematically developing these algorithms we derived a set of recommendations that apply to parallel simulation (Section 7) on modern multi-core processors. These conclusions include recommendations on the size of the model to be parallelized, the number of threads to use for the simulation, the memory management scheme to use, how to divide the computational work between the threads, and software optimizations to implement.

An important limitation of this work is that the models to be simulated must be reduced to ODE form. This results in a loss of the time trajectory data of the algebraic variables in the system, but the trajectory data for the state variables is maintained. The time trajectory data of these algebraic variables are typically preserved in traditional Modelica simulation [

Another limitation is that our parallel algorithms are specifically designed for the parallel architecture of a multicore CPU. Applying the algorithms, unaltered, to a different parallel architecture, such as a SIMD architecture or a GPU, the algorithms will not perform as well.

A third limitation is that we only address a small subset of the Modelica language; more complicated Modelica models that include discrete mode transitions and conditional behavior, can force the simulation software to re-derive the system ODE equations during the simulation. Our algorithms depend on a fixed set of ODE equations, and do not allow those equations to change during run time. Therefore, we cannot support models that have those features without significant changes to our simulation architecture.

For future work, we would like to develop methods for formally solving the optimization problem of the model equations, expand our parallelization algorithms to support an additional numerical integration method, apply our algorithms to a General Purpose GPU, parallelize the integration of our algebraic loop simulations, and apply intelligent load balancing when assigning algebraic loops to threads.

Initial research supported under DARPA META contract FA8650-10-C-7082. This support is greatly appreciated. The authors also wish to thank Zsolt Lattmann at Vanderbilt University for his help creating simulation models.

Joshua D. Carl,Gautam Biswas, (2016) An Approach to Parallel Simulation of Ordinary Differential Equations. Journal of Software Engineering and Applications,09,250-290. doi: 10.4236/jsea.2016.95019