Runtime Energy Savings Based on Machine Learning Models for Multicore Applications

To improve the power consumption of parallel applications at the runtime, modern processors provide frequency scaling and power limiting capabilities. In this work, a runtime strategy is proposed to maximize energy savings under a given performance degradation. Machine learning techniques were utilized to develop performance models which would provide accurate performance prediction with change in operating core-uncore frequency. Experiments, performed on a node (28 cores) of a modern computing platform showed significant energy savings of as much as 26% with performance degradation of as low as 5% under the proposed strategy compared with the execution in the unlimited power case.


Introduction
Modern computing systems are being increasingly controlled by their power consumption ranging from node components to a full fledged data center. The power/energy constraints are due to manifold reasons with technical and economical costs being the primary. To reach the exascale, modern computers must still nearly double their performance while the further increase in power consumption becomes prohibitive. Therefore, power/energy consumption becomes a major obstacle to application scalability, availability, and affordability, and it is urgent to develop techniques that optimize energy consumption while maximizing performance.
On the other hand, such optimization is a difficult task due in large part to a 1) great variability in modern high-performance application workloads, and 2) complexity of modern hardware architectures. These two factors have to be accurately modeled to predict runtime performance under different power levels.
Existing analytical and heuristic models fall short of this task because they cannot account for the multitude of hardware characteristics as they relate to the application dynamic changes. Recently, machine learning (ML) has been proposed as an effective alternative to modeling application time-to-solution under different dynamic voltage and frequency scaling (DVFS) [1] and uncore frequency scaling (UFS) levels. Note that the uncore encompasses those processor functions that are not handled by the core, such as L3 cache and on-chip interconnect. In this work, machine learning models are first investigated for their use during the runtime performance modeling of a diverse set of application workloads exhibiting dynamically changing compute-and memory-intensities. The models are incorporated into a novel runtime strategy along with power and processor-frequency selections. The strategy aims to maximize energy savings under a user provided performance constraint. The strategy operates in a manner transparent to the application and utilizes a timeslice based approach to select appropriate frequencies for the next timeslice. In a nutshell, as main contributions, this work  Investigated and applied ML models for predicting performance during the runtime: Considered uncore frequency scaling as an independent variable.
Employed dimensionality reduction to obtain a set of predictor variables from performance events.
Collected data for prediction on scientific workloads with diverse memoryaccesses and computational patterns, including a large-scale quantum chemistry package GAMESS [2].
Evaluated three different ML algorithms as to their prediction accuracy vs time.
Investigated and tested two types of ML model construction, fully runtime and pretrained statically.
 Proposed a transparent runtime strategy that incorporates the developed ML performance models to maximize energy savings.  Compared the proposed runtime strategy with its counterpart developed in authors' prior work [3].
The rest of the paper is organized as follows. Section 2 provides the related work. Section 3 outlines a previously developed performance model, used here for comparisons. Section 4, first, discusses hardware performance-event data collection along with dimensionality reduction; then, evaluates several ML algorithms for their usage at the runtime and proposes two types of the ML model construction. Section 5 details the runtime strategy along with its implementation steps. Section 6 shows experimental results and comparisons with the prior approach. Section 7 concludes the paper.

Related Work
There have been many previous research efforts that propose to use machine learning strategies for power and energy savings in modern computing systems.
They target both single-and multicore systems using supervised and reinforcement learning for power management, temperature management, and performance maximization under a power constraint.
In [4], a dynamic power management (DPM) technique is proposed for an arbitrary number of sleep states that shuts down idle components based on clustering of idle periods. In [5], authors have proposed a strategy for an arbitrary number of sleep states that minimize the power consumption under a given performance constraint. Wang et al. [6] present an online hierarchical mechanism with application-level scheduling for an embedded system minimizes the total power consumption and finds an optimal point for power-delay relation for the connected devices. Albeit [4] [5] [6] operate on a single-core platform, they are relevant to the current work because they also deal with transparent strategies to manage dynamically power consumption of the processor.
Similarly to the current work, the work in [7] uses DVFS along with a learning based prediction. In particular, it proposes a supervised-learning based power management framework for minimizing energy consumption on a multicore chip equipped with DVFS on each core. A Bayesian classifier is employed for predicting performance of each core per incoming task by observing a set of input features. This predicted state is further used to find an optimal power management action in a pre-computed lookup table.
Bartolini et al. [8] propose a distributed thermal management technique utilizing model predictive control and self-calibration for minimizing energy consumption under performance and temperature constraints. Specifically, each core takes a value of the predicted cycles per instruction (CPI) of the running task as input and chooses the minimum frequency value while satisfying the performance constraints. The work in [8] considers the power management with thermal and performance constraints and utilizes analytical modeling by just using the CPI performance event. The results are demonstrated in simulation only rather than in real time.
In [9], a power management and task allocation framework based on Q-learning is proposed to attain a trade-off between performance and power consumption while simultaneously following the temperature constraints. A reinforcement-learning based strategy is proposed in [10] to prepare a scheduling policy where system invokes DVFS and any penalty is considered in decision making through the learning process. The work in [11] proposes an online thermal management strategy for dual goal of maximizing performance and reducing thermal cycles under a temperature constraint. The policies are directed by a reinforcement learning algorithm and apply a power management to optimize the desired goal.
Using constrained energy minimization, modeling with the CPI performance processor pipeline and to consider the uncore frequency scaling, which are important factors in gaining maximum energy savings for highly changeable workloads as shown in the current work on a multicore platform.

Overview of the Analytical Performance Model
In the authors' prior work [3], a frequency scaling runtime strategy that targeted both core and uncore power domains was proposed to save energy in parallel applications with a minimal performance loss. This strategy relied on the analytical performance and power models that were also developed by the authors.
In particular, performance modeling of the core-uncore domain was expressed in the following Equation  While the strategy based on this model delivered promising results of 15.3% in energy savings with 5.3% performance loss, its shortcomings were observed in applications with memory-intensive workloads, such as iterative linear system solvers, for which the the strategy saved less than 10% of energy. Such shortcomings may be explained by rather crude estimates of memory accesses per micro-instruction, which were modeled by only one parameter, last-level-cache (LLC) misses in Equation (1). Adding more performance events to model complex application behavior appeared not feasible in the analytical expression because the interdependence of multiple events cannot be determined for the broadly applicable model. The single heuristic used in the model, the experimentally determined OOO factor, already showed its narrow applicability scope since it had to be tuned beforehand for a given processor.

Journal of Computer and Communications
To overcome these deficiencies of the analytical model, machine learning approaches are considered in this paper in combination with analytical modeling of power and frequency levels. By definition, machine learning is suitable to consider numerous parameters-features, such as multiple performance events here, for training and producing models that work without expressing the parameter dependencies explicitly.

Machine Learning Model Construction
In this section, first the relevant performance data to train and test the model is selected using the reasoning based on the processor operation and the nature of workloads. Then three different ML algorithms are evaluated for the time vs accuracy trade-off along with the investigation of two possible modes to perform ML training stage: during the runtime and pretraining statically.

Performance-Event Data Collection
In general, the workload behavior of an application varies throughout its execution, exhibiting memory-or compute-intensive patterns on a fine-grained scale. Hence, runtime performance modeling is typically done in small time intervals, called here timeslices. For modeling energy consumption, timeslices of a fixed duration on the order of the frequency scaling overhead have proven to be a good choice in the authors' earlier work (see e.g., [12]), where it has been shown that the timeslices of 250 ms incur a low modeling overhead and are sustainable for large-scale applications, such as GAMESS quantum chemistry calculations, which are considered in the present work as well.

GAMESS Overview
GAMESS is one of the most representative freely available quantum chemistry applications used worldwide to do ab initio electronic structure calculations. A wide range of quantum chemistry computations may be accomplished using GAMESS, ranging from basic Hartree-Fock and Density Functional Theory computations to high-accuracy multi-reference and coupled-cluster computations.
The central task of quantum chemistry is to find an (approximate) solution of the Schrödinger equation for a given molecular system. An approximate (uncorrelated) solution is initially found using the Hartree-Fock (HF) method via an iterative self-consistent field (SCF) approach or restricted HF (RHF), and then improved by various electron-correlated methods, such as second-order Mø ller-Plesset perturbation theory (MP2). The SCF-HF and MP2 methods are implemented in two forms, namely direct and conventional, which differ in the handling of electron repulsion integrals (ERI, also known as 2-electron integrals).
Specifically, in the conventional mode all ERIs are calculated once at the beginning of the interactions and stored on disk for subsequent reuse whereas in the direct mode ERIs are recalculated for each iteration as necessary. The SCF-HF iterations and the subsequent MP2 correction find the energy of the molecular and msn-32 in the rest of the paper.
Additionally, several NAS parallel benchmarks (NPB) [13] were chosen to further increase the mix of compute-and memory-intensive workloads with common scientific irregular computation patterns. Both NPB and GAMESS were executed on the Xeon based platform and data on certain performance events-as detailed below-was collected for a 250 ms timeslice duration. The core and uncore frequency ranges on the Xeon platform are 1.

Selection of Performance-Counter Events
The hardware platform used in this work employs an Intel Xeon E5-2695 v3 processor that comprises 14 cores. Each processor core is equipped with multiple hardware performance counters, which provide runtime count for such events as micro-operations retired and L3 cache misses. Similar to the analytical model in Equation (1), the ML performance model considers the micro-operations retired as the measure of processor performance at different core-uncore frequencies to predict performance in a given timeslice.
The Xeon E5-2695 v3 processor has approximately 190 performance events 1 , which makes it intractable to read all of them and use in an ML model for the runtime performance prediction. Therefore, only certain most relevant and impactful, events must be selected. In particular, the following procedure was undertaken. All the events, along with their event codes, are scraped and those events that are a part of aggregation or are not related to processor performance have been omitted. For example, there are about twelve performance events that deal with resource stalls (e.g., RESOURCE_STALLS.LB, RESOURCE_STALLS. ROB, RESOURCE_STALLS.RS), eleven of which are aggregated in a single event RESOURCE_STALLS.ANY that is taken as the one dealing with stalls in the ML model. By continuing with agrregation in other event groups, the number of events was brought down to 45, thereby reducing the data-variable dimensionality by more than 75%.
A custom C language program had to be developed to periodically and selectively collect the performance-event data as follows. To read a specific event, the event code is first written to a given performance-event select register whose address starts from 0 × 186. Then the value of the event is read from the corresponding performance-event counter whose address starts from 0 × C1. Next, the Journal of Computer and Communications event values were monitored at the runtime. It was observed that many event values were at zero. Therefore, all such events were also dropped, which left only ten events further reducing the dimensionality drastically. Table 1 shows the resulting set of events considered in this work along with their ranges on the NPB and GAMESS inputs. Note that, for the input to ML models, the event values are to be normalized by the UOPS_RETIRED event value of the corresponding timeslice, so that the events are made invariant to core-uncore frequency changes in order to increase the accuracy of predictions at the runtime and to further reduce the number of events to nine.

Evaluation of Different ML Algorithms
Execution performance modeling may be treated as the multiple regression problem. Hence, three appropriate algorithms tackling this problem were chosen as follows: Linear Regression (LR), K-Nearest Neighbors (KNN), and Random Forest Regressor (RFR) [14]. The train-test validation was used for the evaluation on the collected data (Section 4.1). Due to a large variety of workloads in the dataset, the training data may be assumed to come from many different distributions, and thereby allowing the model to better generalize on a variety of test data. Furthermore, the stratified sampling [15] was employed to avoid random sampling bias in the dataset.
In the authors' previous works [3] [12], it was determined that the LLC miss count (see event #1 in Table 1) was considered a greatly important parameter for modeling because an LLC miss leads to a DRAM access during which there is an opportunity to reduce the processor frequency and, thereby, obtain energy savings with minimum performance penalty. As a consequence, the train and  Table 2 shows the ranges of LLC misses and the corresponding bin numbers into which the misses were sampled. Figure 1 presents the resultant training dataset distribution into bins based on the LLC misses incurred. Note that the relatively low and high values of LLC misses correspond to the compute-and memory-intensive application inputs, respectively. It has been observed that the test set exhibited a distribution similar to that shown in Figure 1.
Hence, the sampling categories in Table 2 were chosen for use in the ML algorithms considered here.
After dividing the entire data set into train (80%) and test set (20%), which are the commonly used ratios for evaluating machine learning models [14], the prediction accuracy of the three algorithms was determined as shown in Figure 2. It can be observed from Figure 2 that, while the KNN and RFR algorithms have much higher train accuracy than LR does so, they all have nearly the same test

accuracy. A large discrepancy between train and test accuracies of KNN and
RFR is intrinsic to their design (see, e.g., [14]), which is different from that of LR.   In this work, an important ML algorithm selection criterion is the time to use the obtained ML model dynamically. In particular, the said time has to be on the order of the timeslice duration considered here so that the overall application performance is not degraded. Figure 3 shows the time to predict a single data point, i.e., to apply the ML model only once to predict the runtime performance, spent by the three algorithms. It can be seen that this time is the lowest for LR. Additionally, LR is designed to accurately capture a relationship between the response and predictor variables, contrary to the other two ML algorithms that predict based on the boundary estimates. Therefore, LR is selected to model performance in the runtime strategy developed in this work (see Section 5).

Training Stage: Performed Statically or Dynamically
When considering dynamic usage of the ML models, the prediction stage has to be always done at the runtime to react to the actual input into the model, which is changing on the timeslice scale in this work. Now, the training stage may be performed either statically, before the execution, resulting in the pretrained model applied in each timeslice during the execution, or dynamically, such that both (re)training and prediction are done in each timeslice on the newly acquired data as a part of the runtime strategy. Obviously, the overhead from application of the former-termed here pretrained statically or trn-Static for short-affects the performance less during each timeslice. However, trn-Static may lead to less accurate predictions, thereby affecting both the overall energy savings and time-to-solution. Furthermore, the OOO overlap factor α (see Equation (1)) is still determined experimentally, as in the analytical modeling, when trn-Static is used since all the model training is done completely offline similar to analytical modeling. The latter-termed here fully runtime or all-Runtime for short-appears the most agile and incorporates α implicitly in the model at each timeslice. The application of the ML model with all-Runtime too, if care is not taken, may come at a price of lesser accuracy and higher performance loss accumulating for the entire execution. To train with all-Runtime efficiently, both the training time and the training sample size have to be considered. Figure 4 shows the time to train the model on the entire collected data set for the three algorithms. It can be observed from Figure 4 that LR has the lowest time among the three algorithms, and thus, corroborates its selection in the runtime strategy. Also, Figure 5 shows that LR exhibits a stable performance when the sample size    grows from 100 to 2000, which is another argument in support of the selection of LR.

Design and Implementation of the Runtime Strategy
This section outlines the combination of the proposed ML model with the ones for power and frequency level, which are similar to those used in authors' earlier work [3]. Then, it describes in detail the algorithmic steps implementing the strategy. The proposed strategy utilizes ML modeling to maximize energy sav- ings of a parallel application on a compute node. In each timeslice, for each core, the strategy gathers the relevant performance-counter information, which is used to model the performance and power for the next timeslice to be executed in order to determine the next optimal core and uncore frequencies for each core, followed by the application of the chosen frequency pair before commencing the next timeslice.
To better predict the next timeslice performance characteristics, a few past neighboring timeslices may weigh in their actual characteristics along with the current predicted values. In particular, following the work in [3], a history-window predictor has been built into the strategy such that some function g accepts a window of several neighboring values and outputs a prediction for the next timeslice. After each prediction, the window slides forward by one position.
The performance loss tolerated due to energy savings has to be bounded (typically, at no more than 10%). Hence, the potential performance loss must be calculated and kept within the upper bound for the core-uncore frequency prediction to be feasible. A frequency-pair subset F contains all such feasible frequency pairs. The following expression calculates the performance loss Note that the value of the micro-operations retired µτ , relates directly to application performance.

Power Modeling
To account for the instantaneous power consumption in the proposed runtime strategy and to select the core and uncore frequencies that minimize the system energy under a performance constraint, the Intel RAPL tool [17] is used, which provides instantaneous processor power consumption. The processor power consumption, denoted ( ) where 1 k and 2 k are constants. The values of 1 k and 2 k were determined through a regression analysis similar to the one proposed in [3] Then, the total power consumption ( ) , T P i j of core, uncore, and DRAM domains is as follows: where m P is the memory power consumption (determined at the runtime from RAPL), and static P is the static power consumption of the three domains, determined to be 40 Watts using RAPL.

Choosing the Optimal Frequency Levels
Finally, given the predictions for the next timeslice r of the performance and total power at all the available core-uncore frequency level pairs ( )  (2)). In other words, a total energy minimization problem may be solved as follows: where τ is the fixed timeslice duration, the term represents the next interval execution time, which is possibly larger than τ by factoring in the performance loss corresponding to the operation at a frequency from the feasible subset F. Figure 6 displays the steps of the algorithm underlying the proposed runtime strategy.

Runtime Energy-Saving Algorithm
Step 1 profiles the application for duration τ and obtains the relevant event values from the performance counters. Next, Step 2 predicts value of events for the next timeslice r to be executed by using the history-window algorithm with the window of size of three, in which averaging as the g function proved sufficient for the given workloads and timeslice duration.
Step 3 calls either fully runtime or pretrained statically ML model with the LR algorithm (in function useML_LR) to predict the micro-operations retired for all the core-uncore frequency pairs, returned as set M τ . Next (Step 4), a subset F M τ ∈ is determined consisting of all those coreuncore frequency pairs for which the predicted performance loss does not exceed the performance-loss constraint γ . The threshold value of γ is provided by the user while the actual resulting performance loss is measured using the Then, in Step 6, an appropriate operating frequency pair is chosen from solving the energy minimization problem as described in Section 5.2.

Experimental Results
The experiments were performed on a compute node having two Intel Xeon E5-2695 v3 14 core Haswell-EP processors with 32 GB (4 × 8 GB) of DDR4. The core and uncore frequency ranges are 1.2 -2.3 GHz and 1.0 -2.6 GHz, respectively. To measure the socket and DRAM power, Intel RAPL API was used. The user-defined performance-loss tolerance γ was taken as 10%, which is a typical value to allow for energy savings (see, e.g., [18]). Figure 7 shows the performance degradation of the proposed runtime strategy relative to the performance with both the core and uncore frequency levels staying at their maximum. The four NAS benchmarks are shown as "xx.yy.zz", where "xx", "yy", and "zz" denote benchmark name, class, and number of processes used, respectively. The four GAMESS MSN inputs are distinguished by their fragment sizes (11,16,22, and 32). The proposed runtime strategy is also compared with the one developed earlier in [3]-termed here eq-PerfMod-the performance model of which is outlined in Section 3.

Performance and Energy Savings
The EP benchmark is invariably CPU-intensive throughout the execution with its performance degrading in a linear manner with the reduction in the core frequency. Therefore, the all-Runtime and eq-PerfMod execute EP at the lowest  Static to select such a low core frequency for EP is that it has been pretrained on data, which came from a broad set distributions, while EP is showing an obvious compute-intensive penchant. Similarly, a poor prediction tendency of trn-Static may be observed for the memory-intensive inputs, such as the CG benchmark.
Although the value of the LLC misses for CG is much higher than that for EP, the high bandwidth DDR4 is able to significantly overlap computational work with memory accesses [19]. Hence, its memory intensity is not enough to warrant a significant reduction in the core frequency, which is correctly detected by both trn-Static and eq-PerfMod. For the memory-intensive MG, trn-Static reduces both core and uncore frequency to 1.5 GHz and experiences significant performance degradation of ~20%. The reason for trn-Static performing poorly is that it fails to interpolate and generalize on the data generated during the runtime. The ML model with trn-Static seems to average when generalizing on the unseen data. Therefore, trn-Static performs poorly for all the NAS benchmarks.
The ML model with all-Runtime, on the other hand, for a given workload, adds training on a single input of the currently executed application per timeslice and, thus, is able to generalize much better. Such an advantage of all-Runtime, is even more pronounced in the MSN inputs. For the four MSN inputs, only all-Runtime succeeds in maintaining the performance constraint by primarily executing he MSN inputs at 1.1 GHz uncore frequency and the highest core frequency. The dynamic changes in the workload parameters listed in Table 1 are even more pronounced for the MSN inputs than those are for the NAS benchmarks. Hence, the poor performance of the trn-Static and eq-PerfMod is observed, which is due to the same reasoning as for the NAS benchmarks. They both operate the MSN inputs with the core frequency between 1.4 and 1.6 GHz and the uncore frequency between 1.5 and 1.7 GHz. Although the two switch to similar core-uncore frequency ranges, the trn-Static tends to execute the MSN inputs at the lower end of the range of the core frequencies.
Overall, across all the eight inputs, the average performance loss incurred by all-Runtime, trn-Static, and eq-PerfMod was 5.1%, 21.8% and 10.9%, respectively.
The eq-PerfMod, despite depending on the heuristic performance analysis, is dynamic in nature, contrary to trn-Static, and is updated with the most current LLC misses in each timeslice, thereby adapting at the runtime and yielding better energy savings than those obtained by the trn-Static in all the tested applications. Figure 8 shows the energy savings corresponding to the performance losses in Figure 7.
Since both all-Runtime and eq-PerfMod operated the EP benchmark at the lowest uncore frequency, they were able to reduce energy consumption by more than 20%. The trn-Static on the other hand, provided only 13% of energy savings. A maximum of only 4.8% in energy savings was achieved for CG benchmark  since, overall, it is neither memory-nor compute-intensive due to the DDDR4 memory effects. For the MG and FT benchmarks, all-Runtime yields more energy savings than the other two strategy variants do and it saves 6.8% and 10.4% of energy, respectively. Note that, for CG, MG, and FT, the trn-Static and eq-PerfMod provide similar energy savings. However, trn-Static achieves them by aggressively applying frequency scaling, and thereby breaching the performance constraint while eq-PerfMod applies frequency scaling more carefully and yields a much smaller performance degradation. By comparing broadly, a minor trend is observed where trn-Static gradually starts to improve its prediction (cf. Figure 8) for the MSN inputs with the increase in the size of the MSN calculation. Larger MSN inputs afford more opportunities, in terms of the number of timeslices, for the trn-Static to pretrain on the stabilized workload parameters beyond the erratic initialization phase, which may skew smaller MSN calculations. Overall, smaller calculations lead to more sensitivity in the trn-Static due to fewer data points available to capture a given workload changes reliably. Therefore, longer execution traces are preferred for the proposed ML model, which is consistent with the idea where more data produces better performing models. The maximum energy savings (26%) among all the inputs are obtained by all-Runtime for a GAMESS MSN input (msn-11) because all-Runtime uniformly executes the MSN inputs at the reduced uncore frequency without tinkering with the core frequency. The other two variants do reduce the core frequency for the MSN inputs, consequently resulting in their much lower energy savings. Overall, across all the eight inputs, the average energy savings by all-Runtime, trn-Static, and eq-PerfMod were 17.7%, 7.2%, and 9.9%, respectively.  on the msn-11 input during the first 50 seconds of its execution. It can be observed that all-Runtime keeps the core frequency at ~2.1 GHz while trn-Static and eq-PerfMod prescribe around 1.6 GHz and 1.8 GHz, respectively, thereby severely degrading the application performance. The reason for their selecting a relatively low core frequency values is that they both rely on a static performance model (static heuristic components for eq-PerfMod), which is not being dynamically updated at each timeslice as this is done in all-Runtime .

Conclusions and Future Work
In this paper, a runtime strategy using both DVFS and uncore frequency scaling is proposed to maximize energy savings for a parallel application under a given performance constraint. Machine learning based performance modeling was developed such that it incurs low overhead during its runtime usage while delivering a good prediction accuracy. Strategy variants with a training stage performed statically or dynamically were analyzed and compared with authors' previously developed strategy.
Experiments on a 28-core Haswell-EP platform with the NAS-NPB benchmarks and GAMESS MSN inputs showed that the proposed strategy provided significant energy savings with minimal performance degradation. Specifically, for an MSN input, 26% energy savings was achieved with a small 5% performance loss. Overall, a clear win of the proposed fully runtime ML model was demonstrated by its highest average energy savings of 17.7% with the lowest average performance loss of 5.1%.
Future work will focus on developing runtime power-limiting strategies driven by machine learning modeling that will maximize performance under a given power budget. The proposed runtime strategy will be further extended to multinode multi-GPU scenario where performance modeling for GPUs will be explored. Ensemble modeling options would be explored to boost the prediction V. Sundriyal, M. Sosonkina