Runtime Power Allocation Based on Multi-GPU utilization in GAMESS

To improve the power consumption of parallel applications at the runtime, modern processors provide frequency scaling and power limiting capabilities. In this work, a runtime strategy is proposed to maximize performance under a given power budget by distributing the available power according to the relative GPU utilization. Time series forecasting methods were used to develop workload prediction models that provide accurate prediction of GPU utilization during application execution. Experiments were performed on a multi-GPU computing platform DGX-1 equipped with eight NVIDIA V100 GPUs used for quantum chemistry calculations in the GAMESS package. For a limited power budget, the proposed strategy may deliver as much as hundred times better GAMESS performance than that obtained when the power is distributed equally among all the GPUs.


Introduction
Power and the subsequent energy consumption pose major challenges in the design of large scale systems. The power/energy constraints are due to many reasons with technical and economical costs being the primary. Therefore, power and energy consumption is a major obstacle to application scalability, availability, and affordability, and it is urgent to develop techniques that optimize energy consumption while maximizing performance. Hurdles to such optimization are in part due to 1) a great variability in modern high-performance application workloads and 2) complexity of modern hardware architectures. These two fac-Journal of Computer and Communications tors have to be accurately modeled to predict runtime performance under different power levels.
In authors' previous work [1], machine learning (ML) modeling was used to provide accurate performance models, which accounted for a multitude of hardware characteristic as they relate to the application runtime changes. For the accurate predictions, it is of utmost importance to provide accurate estimations for the said characteristics, which may serve as input features to the ML performance models executing dynamically. In [1], the authors considered history-window methods with averaging across a few past time intervals. In this work, time series forecasting methods are explored for feature value prediction. Given a real-world quantum chemistry application GAMESS [2], its feature value patterns are first explored as a function of time. Next, time series forecasting methods are applied to predict their future values during the runtime. To test the resulting solution in modern high-performance computing (HPC) scenarios, it was implemented as a foundation of a novel runtime strategy that allocates a given power budget among the multiple GPUs executing an application for maximum performance. The strategy operates in a manner transparent to the application and utilizes a timeslice based approach to set the power limits for the next timeslice. In a nutshell, the main contributions of this work are as follows: ○ Explored the feature value patterns with respect to time in GAMESS GPU executions. In particular, considered GPU utilization feature.
○ Selected and justified an appropriate time series forecasting method that may be applicable to a variety of HPC GPU calculations in GAMESS.
○ Proposed a novel runtime strategy to maximize GPU execution performance under a given power budget.
○ Deployed the chosen forecasting method in the runtime strategy and demonstrated its superior performance to standard power allocation approaches.
The structure of the rest of the paper is as follows. Section 2 provides literature review. Section overviews GPU calculations in GAMESS used in this work.
Section 4 provides methodology to model runtime behavior of the GAMESS GPU workload and studies applicability of different time series forecasting algorithms. Section 5 provides methodology and description for the runtime strategy to allocate available power. Section 6 shows experimental research results and Section 7 concludes the paper.

Review of Published Related Work
Power consumption is one of the principal design constraints for the modern exascale systems. To mitigate the resulting operating costs and prohibitive failure rates, researchers have been devising strategies to budget power on system components. In this section, a brief discussion of previous work in power capping and closely related work in system-level power and energy savings is provided.
The strategies to budget the power consumption for modern computing sys- and 2) hardware-enforced power bounds using such interfaces as Intel(R) RAPL [3]. The authors in [4] propose a multi-input multi-output (MIMO) power control algorithm to distribute a given power budget between the processor and memory domains to maximize the application performance. A machine learning technique to determine the sensitivity of application performance with respect to CPU and DRAM power capping was proposed in [5] and used to devise a strategy for power budgeting. A runtime system termed conductor was proposed in [6] which utilizes configuration space exploration and adaptive power balancing to maximize performance under a hardware-enforced power cap. A multilevel power distribution framework termed CLIP was proposed in [7], which estimates per node power budget using the workload characteristics and utilizes memory accesses to determine CPU and memory affinity.
Many strategies have been developed to improve GPU energy efficiency by utilizing DVFS, micro-architectural techniques, workload division and application specific information. Zamani et al. [8] propose a framework for matrix multiplication employing undervolting beyond minimum operating voltage to reduce energy consumption of GPUs. To guard against any rise in faults by such undervolting, a fault-tolerance algorithm was also proposed in [8]. Authors in [9] characterize and demonstrate the NP-hardness of optimal task scheduling problem on GPUs and propose a constant approximation algorithm assuming that GPU cores can scale with continuous frequencies.
Guerreiro et al. [10] develop microbenchmarks to devise GPU power models using an iterative approach and the performance counter parameters. The work in [11] proposes strategies to modify application and CUDA parameters, such as grid and thread dimensions, to improve GPU utilization and energy efficiency.
Kernel fusion, which combines two kernels into a single thread, is proposed in [12] to improve GPU utilization and reduce energy consumption. Greengpu [13] involves low level programming and memory management with custom pthreadbased kernel launches for the GPU to divide workload between CPU and GPU for synthetic benchmarks. The authors in [14] utilize software prefetching and DVFS to reduce GPU energy consumption.
The work in [15] proposes PowerCoord that dynamically controls the power of the PKG and GPUs to cap power consumption while seeking to maximize performance of the system in which multiple jobs are executing. The problem of job co-scheduling on an integrated system with both CPUs and GPUs under a power cap was studied in [16]. Factors, such as memory, power contention and job period, were observed to affect performance and considered in heuristic algorithms for performance enhancing co-schedules. Authors in [17] study the gaming workloads from the point of view of PID controllers, least mean squares (LMS) and autoregressive moving average (ARMA) to manage their power consumption. In [18], workload characterization is performed using time series for reducing CPU power consumption. The work described in this paper differs from the related work in that it considers not only GPU power allocation dynamically but also the distribution of power among multiple GPUs for maximum performance. Furthermore, a time series approach was used to model GPU streaming-multiprocessor (SM) utilization during the execution. [19] is one of the most representative freely available quantum chemistry applications used worldwide for ab initio electronic structure calculations. A wide range of quantum chemistry computations may be accomplished using GAMESS, ranging from basic Hartree-Fock and Density Functional Theory computations to high-accuracy multi-reference and coupled-cluster computations.

Overview of GAMESS Calculations on GPUs
The high performance multi-GPU capabilities in the GAMESS/LibCChem [20] suite of programs include a GPU-accelerated Fock build [21], a full implementation of the Self-Consistent-Field (SCF) method [22] including the one- The overarching scheme for the SCF program includes a coordinator/worker dynamic work balancing algorithm, the steps of which are as follows: Firstly, the basis set information is accepted, the shells and shell pairs are constructed. Secondly, the shell pairs are sorted and stored in the binned batch container [22], which ensures a load balanced distribution of the work among the processes.
Thirdly, the coordinator process evaluates the one-electron integrals on the GPU and transfers the Hamiltonian (H core ) matrix back to the CPU. After the oneelectron integrals have been computed, the process to evaluate and digest the two-electron integrals begins.
The coordinator process binds the workers to a GPU, binding one rank to one GPU exclusively. The batch-binned shell pair container is copied among the ranks. The coordinator process statically assigns the first batches of integrals to be evaluated to a GPU. Each batch contains a batch-size number of shell pairs, the value of which is set in the input file with the default of 2560 shell pairs. The batches are organized in a greedy fashion, having the most computationally expensive first. Journal of Computer and Communications In the GPU, each thread calculates a contracted integral and its results are digested into partial Fock matrices in order to avoid race conditions. After a worker process is done with its associated batch of work, it sends a signal to the coordinator process. The coordinator process then assigns the next available batch to that worker. This is repeated until all shell-pair batches are processed.
At this point, the partial Fock matrices are gathered in the host and the final Fock matrix is transferred back to the GPU for the DIIS algorithm to diagonalize the matrix and obtain the final Hartree-Fock energy of the respective iteration.
This process continues until self-consistency is achieved, which is controlled by a user-defined convergence threshold with the default of 10 −6 .

Methodology: Time Series Forecasting
Section 3 outlined the scheme of the SCF program executed on multiple GPUs that are dynamically load-balanced, from which it is inferred that all the participating GPUs have a similar pattern of their utilization. In particular, Figure 1 shows the streaming multiprocessor (SM) utilization for the eight V100 GPUs for a total of 300 seconds of an RHF calculation of a valinomycin molecule. Observe that utilization traces for all the GPUs are virtually indistinguishable in A set of data points ordered in time and equally spaced are considered time series and thus, corresponding analysis methods are applicable. In the rest of this section, the SM utilization will be treated as time series and a particular analysis   Figure 1, which also includes this very utilization trace).
method chosen. Notice that the SM utilization, shown in Figure 2, exhibits a repeating pattern in intervals of similar duration, so time-series forecasting techniques can be employed to predict future utilization values. Figure 3 outlines decision-flow stages as suggested in [26] to identify an appropriate method to predict a time series accurately. In particular, for the applicability of sophisticated forecasting models, it should be verified that the sequence of data considered is not a random walk, meaning that, in addition to its stationarity, the sequence points must be autocorrelated, thereby avoiding a random movement without any pattern. Note that, if the given time series is not stationary, differencing transformation can be applied to make it stationary, and that the augmented Dickey-Fuller (ADF) test can be used to determine if a time series is stationary by testing for the presence of a unit root. If a unit root is present, the time series is not stationary and a differencing transformation needs to be applied. This testing-differencing process is repeated. For the time series in Figure  2, the initial p-value output by ADF was determined to be 0.67, which is much higher than the threshold of 0.05 below which the series is considered stationary. Therefore, the first-order differencing transformation has been applied and the ADF p-value of 0.0012 was obtained indicating that this transformation made the time series stationary. Next, per Figure 3, one has to check if any autocorrelation lies between the lagged values of the time series. Figure 4 shows the autocorrelation plot with y-axis depicting the autocorrelation values and x-axis depicting the time lag. It can be observed from Figure 4 that there is a significant autocorrelation even when the lag value is as large as 20, and hence, the time series is not a random walk, and the lag q = 1, where the autocorrelation is the maximum, may be used as a moving average parameter in the forecasting model.   Since the autocorrelation values are not abruptly going down after a certain lag value, the given time series is not a moving average process. Following the steps in Figure 3, partial autocorrelation needs to be evaluated. Figure 5 shows the partial autocorrelation values of the time series. The values do not decrease abruptly after a certain lag, and the lag p = 1, where the partial autocorrelation is maximum, may be used as the order of the autoregressive process. Therefore, following Figure 3, it can be concluded that the given time series is an Autoregressive Moving Average (ARMA(p, q)) process. ARMA requires that the data is first transformed by differencing into a stationary series. Then this transformation must be reversed after the forecasting is done to obtain predictions on the    Figure 6. Comparison of the observed and predicted SM utilization values using the ARIMA(1, 1, 1) process and the history-window mechanism for the execution shown in Figure 2.
it is compared with the history-window prediction mechanism, in which the future value is predicted as a rolling mean of the past three values. Note that the window size of three has shown the best performance in authors' previous work on runtime power and performance modeling strategies [1] [27]. It can be observed from Figure 6 that the predicted values significantly match the observed whereas the history-window prediction fails to capture the pattern of the observed values. Specifically, an R-squared value of 0.917 was obtained for the model predictions in Figure 6, which demonstrates that the ARIMA process can predict the SM utilization time series with high accuracy.

Methodology: Runtime Strategy to Allocate Available Power
In Section 4 the ARIMA-based prediction of the SM utilization time series was developed for GAMESS GPU execution. Here, the selected ARIMA prediction is employed in a proposed novel runtime strategy to maximize parallel application performance executing on multiple GPUs under a given power budget. In particular, Figure 7 displays the algorithmic steps of this strategy. Initially, at Step 1, the given power budget P B is divided equally to all the available N GPUs by setting their individual power budgets i P to B P N . In Step 2, the application executes for the duration τ of a timeslice. Note that, similar to authors' previous work [1] [27], the timeslice duration was set at 250 ms, which has been found here as acceptable for the accuracy of predictions as evidenced in Section 6 describing research results. In Step 3, given the i P , the resulting SM utilization i U is rec- Then, the ARIMA model is employed (Step 4) in each GPU i to predict the utilization ˆi U for in the current timeslice r followed by collecting the predicted utilization values from all the N GPUs (Step 5).
In Step 6, the power budget of each GPU is set in proportion to its predicted utilization.
Step 7 executes the application for the duration τ in the timeslice r. Then, the strategy proceeds to consider the next timeslice r + 1.

Experimental Research Results
The experiments were performed on the DGX-1 compute node at Old Domi-

Algorithm: PB
Step 1. Set power budget Pi = N .
Step 4. Predict the SM utilization (Ji for timeslice r by the ARIMA model.
L i= l u, Step 7. Execute application for the duration T.

EndFor Journal of Computer and Communications
Four different GAMESS/LibCChem inputs were chosen to evaluate the efficacy of the proposed strategy. Table 1 provides their input names along with descriptions. In order to select a (reduced) power budget P B for the calculations, their total GPU unrestricted power consumption at the maximum GPU frequency was measured and is also shown in Table 1 (Column P max ). It can be observed, that the power consumption ranged from 1206 W to 1220 W. Therefore, same limits may be applied to all four calculations. In particular, three power budgets P B were chosen as follows: 1000 W, 900 W, and 800 W, which represent similar percentages of the unrestricted power P max for the calculations, about 81%, 73%, and 65% for the three budgets P B , respectively.
To evaluate performance of the four inputs under the proposed runtime strategy, a baseline strategy termed allhigh was considered. In the allhigh strategy, the power limits for all GPUs were raised to their maximum values. The proposed strategy was also compared with another power-limiting strategy, termed naïve, which allocates the given power budget among GPUs based on their respective thermal design power (TDP) limits. In particular, the TDP of the V100 GPU is 300 W [28]. There are eight identical V100 GPUs in the DGX-1 compute node, so under the naïve strategy, the power budget is divided equally among the eight V100 GPUs.
Quantitatively, the performance of a strategy is proportional to the inverse of the execution time T under that strategy. Therefore, as authors suggested previously in [27], the performance degradation s δ of a strategy s relative to the allhigh strategy is calculated as     GPUs in proportion to their utilization in the proposed strategy and from the high prediction accuracy of the underlying ARIMA process for the SM utilization. For the tight budgets of 900 W and 800 W (73% and 65% of the maximum required, respectively), the proposed strategy still performed 5.6 and 3.6 times better on average than the naïve one did so, respectively. Additionally, it is worth noting that, for lower power consumption (~1040 W) than that shown in Table   1 no performance degradation was observed for any input when the proposed strategy was employed.

Conclusions and Future Work
In this paper, a runtime strategy is designed and implemented to distribute available power among multiple GPUs in order to maximize performance of the quantum chemistry application GAMESS. The foundation of the strategy is the workload prediction model that considers utilization as time series and applies Autoregressive Integrated Moving Average (ARIMA) process dynamically to predict the GPU utilization in the next timeslice to be executed. Then, in each GPU, the available power is allocated in accordance with the predicted utilization as a fraction of the total utilization predicted across all the GPUs participating in the calculation. Experiments, performed on the DGX-1 compute node having eight V100 GPUs, and demonstrated that the proposed strategy provided a near maximum performance even with substantially limited power budget for four GAMESS calculations. Specifically, for the calculation of the glycine molecule, as much as a 18% reduction in the available power resulted in only a 0.4% of performance loss. It was also observed that, for the four inputs, the proposed strategy improved performance several-fold, ranging from 3.6 to 54.3 times on average, from an "equitable'' power distribution strategy based on the ratio of GPU thermal design power (TDP) limits. Using the proposed strategy, a lower overall unrestricted power amount was found for which none of the four inputs exhibited a performance loss and which was ~14% less than the average unrestricted power consumption observed initially.
Future work will focus on extending the proposed strategy to a distributed system with multiple nodes and multiple GPUs to facilitate its usage on exascale platforms. Journal of Computer and Communications tional Science Foundation under grant CNS-1828593. The authors appreciate helpful reviewer comments.