FSM Based DFS Link for Network on Chip

As low power consumption is the main design issue involved in a network on chip (NoC), researchers are concentrating more on both algorithms and architectural approaches. The conventional Dynamic Frequency Scaling (DFS) and history based Frequency Scaling (HDFS) algorithms are utilized to process the energy constrained data traffic. However, these conventional algorithms achieve higher energy efficiencies, and they result in performance degradation due to the auxiliary latency between clock domains. In this paper, we present a variable power optimization interface for NoC using a Finite State Machine (FSM) approach to attain better performance improvement. The parameters are estimated using 45 nm TSMC CMOS technology. In comparison with DFS system, the evaluation results show that FSM-DFS link achieves 81.55% dynamic power savings on the links in the on-chip network, and 37.5% leakage power savings of the link. Also, this proposed work is evaluated for various performance parameters and compared with conventional work. The simulation results are superior to conventional work.


Introduction
The design complexity of a Network-on-Chip (NoC) is due to the requirement of number of steps involved in the design process, the time to market and design cost problems. Previous NoC researches have been dedicated to increasing the processing speed and analyzing the system-level performance [1]. NoC provides extremely high bandwidth by distributing the propagation delay across multiple switches that may cause a power disturbance in the circuit [2]. This NoC architecture consists of floating-point cores and packet-switched routers at 4 GHz. 15-F04 has mesochronous clocking and various techniques. The 65 nm 100 M transistor is designed to based observer; 2) frequency selection ( Table 3); 3) FSM-DFS link performer; 4) Clock Distribution Network (CDN). In this paper, we present a control block that utilizes a dynamic frequency scaling (FSM-DFS) method along with adaptive strategies to avoid the process variations and reduce power consumption. CDN is the clock splitting mechanism which is used to validate clock and data to progress the clock to the router unit in NoC. The dynamic power consumption in CDN is reduced by the proposed adaptive clock gating scheme.
In this paper, we encourage the use of FSM based DFS link. Traffic estimator is used to estimate the traffic rate and according to the traffic id is passed to the router unit. Next, a FSM-DFS algorithm is proposed and applied to the NoC link. Finally, the power saving is achieved in on-chip interconnection network. To the best of our knowledge, this is the chief investigation of power reduction for on-chip interconnection network based on the clock boosting mechanism. It is used to predict the impact of DFS policy on system performance. This strategy is also used to reduce the complexity and improve the overall performance under various traffic scenarios. This work proposes a distributed network under various traffic scenarios, which can operate individual routers at different frequency levels effectively.
The rest of this paper is organized as follows. Related work is introduced in Section 2. The system model is introduced in Section 3. The problem constraints are discussed in Section 4. The proposed FSM based DFS link is discussed in Section 5. The performance measure analytical model is discussed in Section 6. The experimental results are presented in Sections 7. Finally, conclusions are drawn in Section 8.

Related Work
Low power network on chip design has become a vital paradigm in the CMOS technology. Since network on chips are likely to consume a considerable part of the total chip power, the design of low power on chip processor offers a general approach for overall performance improvement. NoC consumes a significant portion of total chip power in multicore systems. Some recent researches in a low power network on chip design [26]- [28] architectures are validated to be 10% -36%. Therefore, necessities of latency and power-aware NoC lead to a serious issue in designing low power multicore systems. In order to provide those provisions, designers introduced several dynamic voltage and frequency scaling algorithms with application and traffic aware system. There is a single way to hold the power concern using conventional scaling algorithm. Conventional researchers have proposed DFS for general purpose [29] [30] and multimedia applications [31]. However, the results of conventional works focus on either processor or cache power reduction. Recently, frequency scaling algorithms on NoCs are proposed to further reduce the additional components in NoC like interconnect and core power dissipation. Some past work offers similar methods of NoCs [27] [28] [32] by scaling the voltage/frequency of individual routers, links, or the whole networks. However, these results still focus on general-purpose domain, whereas in multimedia application the traffic aware systems are the emerging research area. The conventional designs are examined under heavy traffic mode, the major problem in performance degradation. Those works are specifically focused on either power or latency, even both performances of NoC. In order to bring better solutions in terms of an end to end delay and other performance parameters, we introduce an FSM-DFS for NoC.

System Model
The key idea behind boosting of NoC router mechanism is to use frequency selection table. The functional diagram of FSM based DFS is shown in Figure 1. This FSM-DFS has the components such as an FSM based observer, frequency selection table and router. In this method, FSM-DFS is used to perform better than the conventional low power algorithm.

FSM Based Observer
The FSM based observer will collect the traffic information from the router. This will provide traffic ID to the frequency selection table. Table   Frequency selection table takes place with respect to traffic ID. The corresponding frequency is selected via  frequency selection table to   is given to the router.

System Performance Model
Dynamic frequency scaling and history based dynamic frequency scaling are used to observe the power consumption, latency and energy consumption. For various traffic benchmarks, the traffic information, tr occupies an N tr -tile region, where the frequencies of the tiles are 1 2 , , , tr f f f  . Due to the traffic information, the tile regions can be overlapped. Let the cache memory is used for many functional modules in the core. This traffic estimator is assessed to map only a single tile region of the core. The average traffic information reaches the traffic estimator and it is represented in the T. Thus, we have where N is the total number of tiles in NoC core. The performance of each core under various traffic benchmarks is observed. In execution cycles is modeled in terms of frequencies of its region/tiles, as follows.
The execution time is measured in cycles, which is a new form of regression model [33]. This execution time is modeled from Bishop et al. which is represented in terms of frequency of tiles in NoC core (see Equation (2)).

( )
(2) Using this execution time, we introduced a new model in the cycle (see Equation (3)). This model is refined from Bishop et al. according to various traffic benchmarks. Due to the traffic, threshold regression may occur. To evade this issue, the proposed model is introduced which satisfies the regression error.
where i eta β is the regression coefficient with respect to frequency of the region/tiles, i f and T is the total traffic information.

Two Levels of Dynamic Power Model
Let us assume that NoC cores work with the same voltage level and the dynamic power of NoC core under various traffic benchmarks (N tr ) expressed as follows (see Equation (4)): where j β is the switching activity, j CE is the effective capacitance, Volt is the voltage, Let us assume that NoC cores work at the variable voltage level and variable frequency. The dynamic power of a NoC core under various traffic benchmarks (N tr ) is expressed as follows (see Equation (5)): where K is a constant. Similarly,

Problem Constraints
With the previous models, the energy planning problem targets to reduce the peak latency in the input power budget. With T is various traffic scenarios by which all occupying an N tr -tile region, we have 1 T tr tr where each P tr is the power budget for application tr at a given time t, and w tr is user defined priority weight for various traffic scenarios. In order to provide better solutions, the new objective function is optimized subject to the following constraints: 1) Traffic constraints: The distribution ratio between a given pair (source, destination) should be equal to 1 under average traffic (low to high) mode.
where src is the source connected to the transmitter side of the router, dest is the destination connected to the receiver section of the router, num is the number of iterations, L src,dest is link of the src, dest connected to NoC router and C is the constraint.
2) Bandwidth constraints: The cumulative bandwidth used for a link should not surpass the link capacity. .

Bandwidth injection rate size frequency
where pi is the packet injection rate.
3) End to end delay constraint: In order to examine the results of various benchmarks, we define QoS requirements in terms of speed and end-to-end (ETE) delay for each class of service [34]. ETE delay was measured in clock cycles of the link. In order to solve the energy budget problem under various traffic scenarios, FSM based model is formed to regulate these problems with respect to various traffic information. The corresponding frequency is allotted to the corresponding N tr -tile region.

Proposed System
The FSM-DFS is a traffic aware performance improvement solution to achieve both latency and power consumption. In this work, we model the procedure with four states, namely selection of processor and approximate frequency (same as HDFS), traffic observation, traffic ID departure, and desired frequency using a Mealy machine model in the router.

FSM
In FSM model, the output circuit is obtained in various sets of states (i.e. all output is defined as a state). A state register is used to hold the state of the machine. A next state logic decodes the next state and output register provides the output of the machine. The entire algorithm gives a detailed explanation in one process with the re-E. Sakthivel et al.
duced hardware system in FSM.

FSM-DFS Link
The proposed FSM has a state diagram to construct Barn's benchmark with 16 particles, which split into four terms, namely t1, t2, t3, and t4. The selection process is considered as a selection of processor and appropriate frequency to obtain desired frequency. The traffic observation is used to examine the traffic during processing. Also, the traffic ID will be sent in order to place the desired frequency in the router. We have two input signals as clock and reset. When the positive edge of clock button is set, the machine will continue to work. When the reset button is pressed, the machine will come back to its initial state.

Design Methodology
The state diagram has four states as mentioned earlier. If the reset button is pressed initially, machine will be set to select the processor/frequency/traffic and it is considered as the initial state of the process. Then, the user should select the traffic to distribute. This is used to select any traffic within t1, t2, t3, and t4. The processor will verify the selected traffic information. If the traffic is selected as per the user need, the traffic ID will be sent. Finally, the exact frequency of the selected traffic is generated in the process. Hence, the particular frequency is placed in the router. If the traffic is not available in the processor, then the control unit will insist for the selection process, after getting reset. The complete methodology is explained in the flow diagram as shown in Figure 2.

Proposed FSM Model
Proposed FSM is based on State Assignment Process (SAP) which is targeting a low power and effective communication link for NoC. The two stage operation of the proposed FSM model is as follows: (a) The traffic information id assignment stage (b) The frequency boost performing stage.
Traffic information id assignment stage: In this stage, FSM based SAP assigns traffic information id to all possible pairs of states, which is an estimate of the similarity states to one another. This stage is used for the computation of traffic information id, which is represented in Algorithm 1.
To compute this id, all the state sets are examined first. For state num , the edge traffic information (under various traffic modes) of state sets is checked from (1, num) to (num-1, num).
Let the num-1 states are distributed between all the nodes in the router. Thus, no two nodes get the similar sets without any conflicts. In a router, every node updates the traffic information independently.
Algorithm 1: The traffic information id assignment stage. Result: Computation of the traffic information in all states: 1) For x = 1 to Ns do, 2) For y = 1 to num-1 do, 3) Compute the Traffic information weight (s src , s dest ), 4) End, 5) End.
The Frequency scaling performing stage: The proposed frequency scaling stage involves assignment of unique frequency pattern to each state of the FSM model. This state is represented by simple counter and controller logic. Our proposed work implements FSM model using the split and performer modules as well as parallel operations.
The parallel concept is already done in many research for low power and high speed operation. We took the basic information of parallel operation from Samman et al. [35]. A common configuration is preserved in split and frequency scaling performer and parallel operations. The principle of parallel operations applies simultaneously in the frequency scaling operations.
On the other hand, the Split and Frequency scaling performed with respect to traffic ID lets routers make parallel operation. We use the same default Initial and Stopping Frequency boost in the router using of FSM model. At higher traffic rate, variable range frequency scaling is accepted with respect to traffic threshold. A history based dynamic frequency scaling is introduced with respect to traffic state, where the traffic ID sending and frequency scaling operation are performed with the router. At Lower Traffic rate, low range frequency scaling is accepted. A dynamic frequency boost is introduced with respect to traffic state.  The general algorithm of the frequency boost performing algorithm is as follows: 1) Start with an initial and the stopping frequency boost process in router of states. 2) For a given traffic input, select two states at traffic threshold and assign frequency boost process or interchange their frequency of current state and the ideal state. 3) Compute the frequency change of each core. 4) These frequency scaling and Traffic estimation process are managed by the state of FSM model. 5) Admit the interchange for a lower traffic condition. Allow frequency boost process to be accepted, even if it higher traffic condition in the router. 6) Repeat steps 2-5, until a traffic id is getting into less than zero. Then lower traffic process is accepted and the corresponding frequency boost operations take place. 7) Stop, if the traffic id attains zero.

Performance Measure-Analytical Model
We examined the performance parameters such as delay, data rate, energy and static power consumption analysis in a network-on-chip. To have a better view, the performance parameter model is summarized here. 1) To estimate the latency flow, it is necessary to evaluate the waiting time of packets for routers. 2) Bandwidth estimation.
3) The power consumption and link power are calculated recursively for each communication path starting from the receiver section. 4) Given the energy delay product among the cores and routing algorithm, the energy consumption for each node in router is determined. 5) End to end delay and communication density are also modeled, with respect to each communication path starting from the receiver section.

Latency
The latency of a link is the addition of the latency to traverse the Frequency Boosting Mechanism (FBM) in the router and link latency. The latency of the link is defined by the frequency at which the link is operated [36]. Let router_distance denote the distance in mm a signal can traverse in 1 . This can be determined based on the design's technology core. Finally, the latency of a link is given by where F denotes the frequency of the link and it depends on the where the FBM is placed on the link, and length source,destination denotes the length of the link in mm.

Bandwidth
The bandwidth of a link is given by the product of the link width and frequency of operation of the link [

Link Power
Link power is estimated from tool the standard link power estimation is followed in the recent simulator [37] for a NoC router. This power model considers the cross-coupling effect for N-wire interconnect, and also we can determine the total power for an N-wire link per unit length as follows: gate leak wire bias short.  (12) where N w is the total number of wires in the link, C se and C co are the self and coupling capacitance of a wire and neighboring nodes respectively, α sa is the switching activity on a wire and α Co is the switching activity with respect to the adjacent wires, τ is the short circuit period, V sv is the supply voltage and I sh , I bi,w and I le,ga are currents.

Static Power Consumption
Static power is the power dissipated by a gate or a wire when it is ideal or in an active state. The static power is mostly inclined by the structure of the architecture [37]. The static power dissipation can be more precise by the equation:

Energy Consumption
We assume the energy consumption of each core of NoC num ( NoC num E − ) is available after task mapping. In wormhole routing, each input information is distributed into several flits. For every input information, the head flits set up the way bearings for the body and the tail flits [38]. The representation of Parameters and Symbols are indicated in Table 1.
Total energy consumption for processing a single packet in router i is given by:

End-to-End Delay Formulation
The End-to-End flow delay  TSMC CMOS technology under 1-GHz operating frequency, a supply voltage of 1 V and a switching factor of 0.5. The RTL description is synthesized to the gate level net list with a Synopsys design compiler [40]. A power analysis is carried out using the Synopsys Prime Time PX tool [41]. The benchmark from the SPLASH-2 (Woo et al. 1995) suite is used to obtain the workload for the NoC interface system [41]. The experimental benchmark specification for this proposed work is as shown in  Figure 3. Simulation result of the various link policy when control period has eight cycles: 1) injected workload (Figure  3(a)); 2) link utilization estimation (Figure 3(b)); 3) DFS power consumption (Figure 3(c)); 4) HDFS power consumption (Figure 3(d)); 5) FSM-DFS power consumption (Figure 3(e)); 6) DFS latency (Figure 3(f)); 7) H-DFS latency (Figure  3(g)); 8) FSM-DFS latency (Figure 3(h)).
When the router transmits data with specific traffic injection rate, interface link will dissipate static and dynamic power. The performance of conventional and proposed low power link with respect to dynamic and leakage power under different terminals such as the traffic generated, the traffic estimator, the router, the input buffer, the output buffer, and links is estimated at 45 nm technology and these results are plotted in Figure 4. Bandwidth sensitivity offers 14.84% system/instruction throughput improvement.
Latency of peak and average are observed. For a power optimized interface link, power-agile algorithm should offer very high throughput and low average latency for high flit rate data transmission. The FSM-DFS link characteristics for each boosting clock frequency are obtained by simulation and are summarized in Table 3.
The 1× boosting router finishes the entire packet transmission in 24:34 ms, spending more time than 2× and 4× boosting router. DFS method has the highest average and peak latency. When compared to FSM-DFS, it has 42.6 ns/flit for 1× boosting, 2× and 4× boosting have 8.9 ns/flit and 8.1 ns/flit. Similarly, 4× boosting router is much better compared to the 1x boosting router in terms of latency.
The overall power consumed by 1× boosting in DFS is 1.85 mw and the FSM-DFS system consumes 0.39 mw for the same 1× boosting. These experimental results demonstrate the feasibility of clock, boosting router in the FSM-DFS link for a power-aware of-chip interconnection network for a NoC platform. Table 4 summarizes the experimental results of the DFS, FSM-DFS and history based DFS policy for varying the control period from 8 to 128 cycles of the 1× clock. Under varying control period, the physical parameters such as average latency, peak latency, end time, dynamic power, leakage power and total power are measured and these results are plotted in Figure 5. The FSM-DFS method is compared with previous DFS (Seung Eun Lee et al. 2009) and H-DFS.
The DFS has the highest average latency of 24.06 ms/flit for the 8 control period. Similarly, the FSM-DFS     Table 5. For various modules such as Link, Buffer, Cross bar, this area comparison is already reported using Mesh based core and CCNoC in Volos et al. [28]. We examined conventional strategy model and proposed FSM based NoC. With the outcoming results, we prove the proposed system is giving better performance than conventional work. Energy-delay product comparison with conventional NoC is reported in Table 6. Also, we compared our new model (FSM-DFS-NoC) with the conventional architectures like Mesh, Homogeneous, Heterogeneous, CCNoC, HDFS-NoC. This proposed work contributes enhanced results. Also, we examined end to end delay with various flits using conventional and proposed strategy as tabulated in Table 7. Likewise, end to end delay and buffer size is also compared with conventional work as organized in Table 8.
The static power, overall dynamic power and energy of the three low power interface links are clearly estimated for NoC and listed in Table 9. The overall simulated results show that FSM-DFS interface attains 37.5% leakage power saving, 81.55% dynamic power saving and 61.8% energy savings in NoC. Finally, the static power, overall dynamic power and energy under various benchmark results are observed and listed in Table 10.

Conclusions
The power optimization technique is achieved in NoC by successfully presenting the FSM based DFS link for NoC in algorithmic level. The proposed FSM based DFS interface is compared with the conventional low power interfaces such as DFS and H-DFS. Their performance metrics like dynamic power, leakage power, average throughput, average latency, and average energy per useful flits are evaluated using 45-nm technology. The experimental results reveal that the FSM-DFS is the finest power optimization interface for NoC platform.
In this paper, we proposed a FSM based DFS link to achieve low power in NoC. The traffic estimator is used to estimate the traffic rate of workload on the NoC. Based on the traffic, appropriate working frequency can be set to the link by DFS policy. The implementation of the proposed FSM-DFS policy is discussed in detail. An experimental result shows that the proposed policy attains 81.55% dynamic link power reduction, 37.5% leakage power reduction and 61.8% energy savings in NoC. In this way, the proposed work is examined using various benchmarks. All the simulation results of the FSM based DFS link for NoC contribute enhanced results, when associated with the conventional work.