Maximizing Performance under a Power Constraint on Modern Multicore Systems

Energy efficiency and energy-proportional computing have become a central focus in modern supercomputers. These supercomputers should provide high throughput per unit of power to be sustainable in terms of operating cost and failure rates. In this paper, a power-bounded strategy is proposed that maximizes parallel application performance under a given power constraint. The strategy dynamically allocates power to core, uncore, and memory power domains within a node to maximize performance under a given power budget. Experiments on a 20-core Haswell-EP platform for a real-world parallel application GAMESS demonstrate that the proposed strategy delivers performance within 4% of the best possible performance for as much as 25% reduction in the minimum power budget required for maximum performance.


Introduction
Power consumption has become a major concern for modern and future supercomputers. For the current topmost petascale computing platforms in the world, it is typical to consume power on the order of several megawatts as depicted in the biannual TOP 500 list 1 , which may cost on the order of several million dollars. In the quest for exascale performance, the power consumption growth rate must slow down and deliver more calculations per unit of power, giving rise to power-bounded computing in which components of a computing system operate under a fixed power budget such that performance is maximized.
Previous generations of Intel processors used either a fixed uncore frequency or a common frequency for the core and uncore. The uncore describes the functions of a processor that are not handled by the core, such as the L3 cache and on-chip interconnect. Starting from the Intel Haswell micro-architecture, the core and uncore frequency domains have been decoupled, so that the uncore frequency can be modified independently of the core frequency, typically done by dynamic voltage and frequency scaling (DVFS). The uncore frequency has a significant impact on the on-die cache-line transfer rates as well as on the memory bandwidth. By default, the uncore frequency is set by the hardware and can be specified via the model-specific register (MSR) UNCORE_RATIO_LIMIT [1]. This technique is denoted uncore frequency scaling (UFS). The latest Intel CPUs work with at least two clock speed domains: one for the core (or even individual cores) and one for the uncore, which includes the L3 cache and the memory controllers.
In the authors' previous work [2] [3], the efficacy of UFS was explored in terms of its energy-saving potential and a strategy was proposed, which employed both DVFS and UFS to maximize energy savings for parallel application execution under a performance constraint. Experiments showed that larger energy savings can be achieved when UFS and DVFS are used jointly. In addition, joint and simultaneous DVFS of the processor and DRAM was explored in [4], where novel power and performance models were proposed.
The Intel Running Average Power Limit (RAPL) interface [5] provides MSRs containing energy consumption estimates for up to four power planes or domains of a machine as follows: • PKG: for the entire package, • PP0: for the cores, • PP1: for the uncore subsystem (available in client-type platforms only, mainly used for general-purpose applications), • DRAM: main memory (available in server-type machines only).
The authors' previous research [6] considered primarily PP0 and DRAM domains for budgeting power to solve the parallel application performance optimization problem in the quantum chemistry software GAMESS [7] [8]. The present paper adds the PP1 (uncore) domain, similarly to the work described in [2], to solve this problem and proposes a power-bounded runtime strategy, which maximizes the parallel application performance under a given power budget. In essence, the work presented here may be considered as a combination of [2] and [6] because it determines optimal values for both uncore and core frequencies with the goal to distribute a given power budget to hardware components such that the application performance is maximized. Note that, because the server platform used in this work does not provide a separate PP1 interface to limit uncore power, UFS is used to achieve uncore power shifting within a given power budget. In a nutshell, the contributions of this work include: • Determining the priority of the power budget allocation to the three domains, namely, core PP0, uncore PP1 and memory DRAM.
• Devising novel performance and power models to correlate changes in uncore frequency to PKG power consumption.
• Proposing a runtime power-bounded strategy to maximize parallel application performance under a given power budget by carefully allocating power to PKG, DRAM and uncore domains.
• Maximizing performance of a quantum chemistry application GAMESS under power constraints. The rest of the paper is organized as follows. Section 2 provides the related work. Section 3 studies power allocation priorities among power domains. Section 4 proposes performance and power models. Section 5 develops the runtime strategy to maximize performance under a given power budget for any parallel application. Section 6 shows experimental results while Section 7 provides conclusions.

Related Work
Power is one of the most prominent HPC challenges, forcing the objectives and approaches of HPC power management to continuously evolve. Therefore, extensive research has been conducted to measure, model, and budget power on computer components and systems. In this section, a brief discussion of previous work in power capping and closely related work in system-level power and energy savings is studied.
The two most commonly used techniques to limit the power consumption of a node come in the form of 1) DVFS/Throttling for processor and memory [9] [10] and 2) Hardware enforced power bounds from RAPL [5]. The authors in [11] propose a runtime system termed conductor that dynamically distributes available power to different compute nodes and cores based on the available slack to improve performance. It also performs either upscaling or downscaling of processor frequency to decrease execution time and to save energy in an indirect manner through power clamping. Reference [12] explores the coordinated power allocation among different components within a node, observing which optimal power allocation strategy is proposed. The authors in [10] propose models that predict the performance of HPC computations under varying caps for different components in a node. A cluster level power allocation framework termed CLIP was proposed in [13], which performs application characterization along with performance modeling to allocate power budget to nodes and their components to maximize performance in a cluster.
The work in [14] discusses a hardware level power capping strategy for limiting DRAM power consumption. A multi-level hierarchical variation-aware approach of power management is proposed in [15], which at the macro level partitions the system power budget across jobs, and at the micro level, evaluates the power allocation based on application performance metrics. The idea of hard-ware overprovisioning has been used in [16] by proposing a scheme for determining the optimal number of nodes while distributing power between the CPU and memory. The design of a power scheduler capable of enforcing power bounds by employing dynamic system-wide power reallocation was discussed in [17].
Most of the work discussed in this section primarily focused on redistributing power between the processor cores (PP0) and memory (DRAM) domains, whereas the uncore (PP1) one has largely been ignored. This paper considers the uncore domain and proposes a strategy that resolves the power allocation problem to maximize system throughput at the runtime.

Power Allocation Priority
For appropriately allocating a given power budget among different RAPL domains, it is imperative to determine the order in which power should be distributed among them because insufficient allocation to a power domain may have drastic negative effects on the application performance.  Figure 1(b), the black vertical bar is drawn to indicate that, for the (PKG, DRAM) allocation pair of (42, 58) W, marked with horisonal dashes where the bar crosses the corrsponding power limit lines, the PKG and DRAM power consumptions are observed as 42 W and 7 W, respectively, and the execution time is 36.9 seconds.
It can be observed from Figure 1 whereas reducing the PKG power allocation essentially modifies the operating frequency of the processor cores. Therefore, given a specific power budget, the DRAM domain must have the highest priority of all three power domains when it comes to allocating the power budget. As for the PKG and uncore power domains, the power allocation between them may be decided by using a performance model proposed in [2].

Performance and Power Modeling
To effectively distribute the power budget to the application performance, a fine-grained performance model is needed. A power model is also required to correlate the variation in core and uncore frequency with resultant power consumption to effectively apply the power limits. In this section, the two models are discussed.

Performance Model
A performance model proposed in a previous work [2] is used here. This model (in Equation (1)) correlates application performance, expressed in micro-operations retired, with particular core ( ) c f i and uncore ( ) u f j frequencies expressed by their corresponding levels, from the highest to lowest, , 1, , where ( )

Power Model
The processor power consumption, denoted ( ) , T P i j , can be expressed as [2]: uncore frequencies, respectively. s P stands for the processor static power consumption, which was measured as 12 W through RAPL. Since uncore (PP1) power limiting is not supported in Intel server processors, the power model in Equation (2) is required to relate the power consumption of core/uncore domains to the corresponding levels of core/uncore frequencies. Parameters 1 k and 2 k were determined by a regression analysis of the processor power obtained through the RAPL registers at different core and uncore frequencies for several benchmarks. The values 1 k and 2 k were found to be 0.97 and 0.46, respectively, indicating that changes in the core frequency affect the processor power consumption more than those in the uncore frequency do so. Given a power budget for the three domains-PP0, PP1, and DRAM-in a server-type platform, the shifting of power between the core and uncore domains is essentially done by first modifying the uncore frequency and then shifting the corresponding reduction in power to increase the power limit for the core domain to maximize the performance. Equation (3)  j . In this manner, the reduction in power obtained through UFS is transferred to the PKG power limit to increase the core frequency and subsequently to improve performance.

Runtime Power-Bounded Strategy
The proposed runtime strategy is based on the history-window predictor [4], which employs a window of the previous L values of a measured parameter and predicts its next value as some function g of these past L values. To implement this prediction mechanism, two registers-denoted CPR and MPR-of length L are maintained to record the values of exe CPM and MAPM, respectively. If the register is not filled, then the corresponding quantity is considered unchanged from the previous prediction. Figure 2 displays the algorithmic steps of the proposed runtime strategy Step 5. Next (Step 7), the optimal core-uncore frequency pair is determined, such that the predicted number of micro-operations retired is at its maximum. In Step 8, the total power consumed at the chosen frequency pair is determined using Equation (2) in Step 9, the power limit for DRAM is set as the measured DRAM power consumption, while the PKG power limit is set as in Equation (3) ω ω to be used in the next timeslice.

Experimental Results
The experiments were performed on a compute node, termed Gwent having two Intel Xeon E5-2630 v3 10 core Haswell-EP processors with 32 GB (4 × 8 GB) of DDR4. The core and uncore frequency ranges are 1.2 -2.3 GHz and 0.8 -2.9 GHz, respectively. To measure the node power and energy consumption, a Wattsup 2 power meter is used with a sampling rate of 1 Hz. [19] is one of the most representative freely available quantum chemistry applications used worldwide to do ab initio electronic structure calculations. A wide range of quantum chemistry computations may be accomplished using GAMESS, ranging from basic Hartree-Fock and Density Functional Theory computations to high-accuracy multi-reference and coupled-cluster computations.

GAMESS [7]
The central task of quantum chemistry is to find an (approximate) solution of the Schrödinger equation for a given molecular system. An approximate (uncorrelated) solution is initially found using the Hartree-Fock (HF) method via an iterative self-consistent field (SCF) approach, and then improved by various electron-correlated methods, such as second-order Møller-Plesset perturbation theory (MP2). The SCF-HF and MP2 methods are implemented in two forms, namely direct and conventional, which differ in the handling of electron repulsion integrals (ERI, also known as 2-electron integrals). Specifically, in the conventional mode all ERIs are calculated once at the beginning of the interactions and stored on disk for subsequent reuse whereas in the direct mode ERIs are recalculated for each iteration as necessary. The SCF-HF iterations and the subsequent MP2 correction find the energy of the molecular system, followed by evaluation of energy gradients.
Data Server Communication Model: The parallel model used in GAMESS was initially based on replicated-data message passing and later moved to MPI-1. Fletcher et al. [20] developed the Distributed Data Interface (DDI) in 1999, which has been the parallel communication interface for GAMESS ever since. Later [21], DDI has been adapted to symmetric-multiprocessor (SMP) environments featuring shared memory communications within a node, and was generalized in [22] to form groups out of the available nodes and schedule tasks to these groups. In essence, DDI implements a PGAS programming model by employing a data-server concept.
Specifically, two processes are usually created in each PE (processing element) to which GAMESS is mapped, such that one process does the computational tasks while the other, called the data server, just stores and services requests for the data associated with the distributed arrays. Depending on the configuration, the communications between the compute and data server processes occur either via TCP/IP or MPI. A data server responds to the data requests initiated by the corresponding compute process, for which it constantly waits. If this waiting is implemented with MPI, then the PE is polled continuously for the incoming message, thereby being always busy. Hence, it is preferred that a compute process and data server do not share a PE to avoid significant performance degradation. When executing on a 2N-processor machine, the compute C and data server D process ranks are assigned as follows:

Experiment Setup
NAS benchmarks (NPB) [18] and GAMESS were used for evaluating the efficacy of the proposed runtime strategy and to validate the modeling effort as NPB provides a good mix of compute-and memory-intensive benchmarks to test the core, uncore and DRAM power limiting addressed in this work. The first GAMESS calculation was set-up to perform the third order Fragment Molecular Orbital (FMO3) [23] calculation-in the conventional mode-for a cluster of 64 water molecules at the Restricted Hartree-Fock RHF/6-31G level of theory. As such, it involves calculations of fragment monomers, dimers, and trimers. The system is partitioned into 64 fragments such that each fragment is a unique water monomer and is referred to as h2o-64 in the rest of the paper. The second GAMESS calculation also performs an FMO3 calculation on 20 water molecules at the MP2/6-31G(d, p) level of theory. As such, each fragment N-mer (monomer, dimer, and trimer) is calculated sequentially using all compute elements allocated to the GAMESS executable. Three-body calculations at the RHF/6-31G(d, p) level of theory are also performed and are critical in order to capture the significant exchange and charge-transfer effects present in a cluster of water molecules. This calculation is referred to as wat-20 in the rest of the paper. Table 1 depicts the PKG and DRAM power consumptions, with a 100 W power budget, for the three NAS benchmarks EP, CG, and LU and two GAMESS calculations executing at the highest core and uncore frequencies on Gwent. It can be observed from Table 1 that the compute-intensive benchmark EP tends to have lower DRAM power consumption due to its low memory utilization as compared with the rest of the test cases, which are more memory intensive [4]. For all the inputs, the total power consumption ranges from 80.2 to 88 W. Therefore, to stress-test the proposed runtime strategy. three power budgets of 70, 60, and 50 W were chosen because they are substantially lower than the power consumption needed to maintain maximum performance for these input benchmarks. Figure 3 shows the performance degradation for each input when the proposed runtime strategy is used to distribute the chosen power budgets of 70, 60, and 50 W.

Strategy-Guided Performance under a Power Budget
EP.C.20: For the highest power budget of 70 W, the strategy selects the highest core and a low uncore frequency of 1.1 GHz, which results in a performance Table 1. PKG and DRAM power consumption (W) of NAS NPB benchmarks and GAMESS inputs to achieve the maximum performance with a 100 W power budget. In the NAS benchmark column names, the two-letter prefix denotes benchmark name, "C" stands for class C, and the two-digit suffix states the number of processes used.  degradation of 1%. These frequencies were chosen by the strategy because the EP benchmark is substantially compute-intensive and any decrease in the core frequency may substantially degrade performance. Therefore, when only the uncore frequency is reduced its equivalent additional available power is added to the PKG power budget bringing it close to the 77 W needed for the maximum performance. When the total power budget is reduced to 60 W, the uncore frequency is reduced to its lowest value. This reduces the PKG power consumption by ~10 W and subsequently provides an opportunity to increase the allocated PKG power to 67 W, as obtained from eq:pkg and measured PKG and DRAM power consumptions of ~57 W and ~3 W, respectively. However, this extra power allocation due to the uncore frequency downscaling is not enough to enforce the given power budget of 60 W without also reducing the core frequency from its highest value. Therefore, a performance degradation of 13% was observed for the reduced core frequency of 2.1 GHz. Similarly, the power budget of 50 W resulted in performance degradation of 40% since the core frequency had to be reduced even further to accommodate the tight power constraints. CG.C.16: When the power budget is 70 W, the uncore frequency is set to 2.1 GHz by the strategy, and the resultant performance degradation is 8%. Even though CG is memory-intensive benchmark, as was determined from eq:uops and [24], scaling the uncore frequency results in a smaller performance loss compared to reducing the PKG power limit and, thus, reducing the core frequency. The 60 W and 50 W power budgets result in 13% and 21% performance losses, respectively.
LU.C.16: Its memory intensity lies between that of EP and CG. Therefore, the performance degradation under the three power budgets appears to be in be-tween the corresponding performance-loss values of the EP and CG benchmarks.
GAMESS calculations: The two GAMESS inputs h2o-64 and wat-20 are mostly compute-intensive (see tab:pow) throughout their execution; and their compute processes are somewhat memory-intensive at certain execution phases as compared with the data servers. At the 70 W power budget, per runtime strategy, the data servers are operated at the minimum uncore frequency and the compute processes operate at 1.1 GHz uncore frequency throughout the execution. Subsequently, the performance losses for h2o-64 and wat-20 budget are 1% and 2%, respectively, the majority of which is due to the overhead of the strategy itself. When the power budget is reduced further to 60 W, the uncore frequency for both the data servers and compute processes is scaled to its minimum value and the additional power availability allows the PKG limit to be set to 66 W, leading to a core frequency of 2 GHz. This results in 4% and 8% performance loss for h2o-64 and wat-20, respectively. The 50 W power budget pulls the PKG power allocation down to 56 W, requiring to reduce the core frequency even more and resulting in an average performance loss of 32% for these GAMESS calculations.

Minimum Power Budget for GAMESS Calculations
The proposed strategy does not take into account the specifics and knowledge of the given application. Hence, its decisions may not result in the maximum optimizations, which is a trade-off between using the strategy as "black-box" and maintaining good performance under power-budget constraints for a variety of applications.
In order to find a minimum power budget to keep the GAMESS performance at its maximum, a knowledge of the relative performances of data servers and compute processes is needed. As explored in a previous work, data server performance is not affected at all by DVFS [25]. With this knowledge, the minimum power budget required for the GAMESS calculations considered here without any performance degradation on Gwent is 59 W (and without using the proposed strategy). Under this power budget, the core frequency of the data servers is reduced to its lowest value of 1.2 GHz, and the PKG portions for compute processes, data servers, and the DRAM are allocated 36, 18, and 5 W, respectively.

Conclusions
In this paper, a runtime strategy that employs UFS to redistribute the power budget was proposed. The strategy may be used as a "black box" to maximize parallel application performance under a given power budget. Power and performance models were devised, which were deployed in a runtime strategy to dynamically apply power limiting to PKG and DRAM power domains along with the UFS in a user-transparent manner. Experiments on a 20-core Haswell-EP platform with the NAS parallel benchmarks and a real-world test case of two GAMESS calculations showed that the strategy provided near maximum V. Sundriyal et al. performance even with substantially limited power budgets. Specifically, for a GAMESS calculation, a 25% reduction in the power consumption resulted in only a 4% performance loss. It was also observed that even for memory-intensive applications, the strategy chose the uncore frequency to be reduced first under a power budget instead of reducing the PKG power limit (i.e., the frequency of the cores).
Future work will focus on testing the efficacy of the PKG power limiting on the platforms with the DDR3-and DDR4-based memory architectures and on accelerators, such as GPUs. Taking into account the application-architecture behavior, and thereby developing a "gray-box" strategy for runtime power allocations power will also be studied. While inter-process communications are explicitly targeted in the authors' previous works [26] [27] to obtain energy savings, the future plan also includes adapting and testing the proposed strategy on a distributed system.