Information theoretic distinguishers for timing attacks with partial profiles: Solving the empty bin issue

.


I. INTRODUCTION
The field of cryptography is currently very sensitive as it deals with data protection and safety.Thus, in order to assess the security of cryptographic devices, it is crucial to know and test their weaknesses.For example, the Advanced Encryption Standard (AES) [1] is renowned as trustworthy from a mathematical point of view-there is currently no realistic way to cryptanalyze the AES-128.However, it is possible to break the 128-bit secret key byte by byte using side-channel analysis (SCA).SCA exploits the physical fact that the secret key leaks some information out of the device boundary through various "side-channels" such as power consumption or timing-number of clock cycles to perform a given operation.These leakages, correctly analyzed by SCA, yield the secret key of a device.
A good side-channel attack needs a good leakage model.Timing, for example, can be modeled easily when the implementation is unbalanced: Several successful attacks [2], [3], [4], [5] exploit a timing leakage in the conditional extrareductions of Montgomery modular multiplications.Some conditional operations can also happen in AES, e.g. in field operations, as for instance discussed in [6,Alg. 1].Even when the code is balanced-a recommended secure coding practice-some residual unbalances in timing can result from the hardware which executes the code.Indeed, processors implement speed optimization mechanisms such as memory caching and out-of-order execution.As a consequence, it is not possible to predict with certainty how timing leaks information.The attacker is then led to make predictions about the way the device leaks.
In this paper, we consider side-channel attacks that are performed in two phases: 1) a profiling phase where the attacker accumulates leakage from a device with a known secret key; 2) an attacking phase where the attacker accumulates leakage from the device with an unknown secret key.
This type of attack is known as a template attack [7].It has been shown [7] to be very efficient under three conditions: (a) leakage samples are independent and identically distributed (i.i.d.);(b) the noise is additive white Gaussian; and (c) the secret key leaks byte by byte, which enables a divide-andconquer approach.For some side-channels, such as power or electromagnetic radiations, condition (b) is met in practice.However, for timing attacks, the noise cannot be Gaussian as timing is discrete.Moreover, the noise source is nonadditive in this case, since it arises from complex replacement policies in caches and processor-specific on-the-fly instructions reordering.
The first proposed profiled timing attack is the seminal timing attack of Kocher [8].The same methodology can be used on AES, as noted by Bernstein in 2005 [9].Further works used the same method [10], [11], [12].To our best knowledge, all these works consist in profiling moments, such as the average timing under a given plaintext and key.However, it is known [7] that the best attacks should use maximum likelihood 1 .
In this paper, as illustrated in Tab.I, we focus on a profiling where the distribution is characterized and used as such, and is not reduced to its moments.The attacker computes distributions using histogram methods.These distributions are then used to recover the correct secret key.

Profiling method
Reference articles Moments [9], [10], [11], [12] Distributions Our paper (Caution about empty bins) The discrete nature of timing leakage leads to an empty bin issue which appears when a data value in the attacking phase has never been seen during the profiling phase.Based on profiling only, this data should have a zero probability, which can be devastating for the attack.One known workaround is to use kernel distribution methods [13] to estimate probabilities since the smoothing can be such that no empty bins remain.This method can however be seen as a postprocessing in existing information.This alters therefore the data.In addition, this method has very large computational complexity and its performance highly depends on ad-hoc choices of several parameters such as kernel type and bandwidth.Moreover the estimation via the kernel method depends on other parameters such as the choice of the kernel and the size of the kernel.In our paper, we have decided to keep information as it comes as we focus on information theoretic distinguishers.a) Contributions: In this paper, we show that even when all abovementioned requirements (a), (b), and (c) are not present, timing attacks with incomplete profiling can be achieved successfully by adapting the maximum likelihood distinguisher and keeping the histogram method for probabilities estimation.We build six different distinguishers, which are all good answers to the empty bin issue.For some of them, new histograms are built, such that the empty bin issue totally disappears.Furthermore, we compare these distinguishers and show which one of them is the best in every specific context.We underline that, in practice, for a moderate profiling with 256 000 offline measurements, the soft drop and the combined offline-online profiling approaches are clearly the two best strategies: the AES key is typically extracted with only about 2 000 online measurements, i.e., a complete break in about 0.2 ms.Finally, we provide some theoretical results proving how optimal some of the distinguishers can be.
b) Organization: The paper is organized according to the following structure.Section II provides mathematical tools to understand distinguishers and notations.Section III introduces new distinguishers that are suitable in the context of empty bins.Section IV provides simulations for these distinguishers and Section V investigates real attacks on an ARM processor.Interestingly, all proposed distinguishers work, albeit with very noticeably different performances.
In section VI, some interpolations of the obtained results in the presence of external measurement noise are derived.Section VII concludes.

A. Notations and Assumptions
We consider a side-channel attack with a profiling stage and use the following notations: • during the profiling phase, a vector t of q text bytes is sent and the profiler garners a vector of x measurements; • during the attacking phase, a vector t of q text bytes is sent and the attacker gathers a vector x of leakage measurements-also customarily known as traces; • we use simplified notations t, q and x when discussing either profiling data or attacking data; • the probability of a vector x with i.i.d.components x i is denoted by P(x) = i P(x i ); • we define the following sets: 1) X , T , X and T are the sets of possible values of components x, t, x and t, respectively; 2) X = X ∪ X and T = T ∪ T ; 3) K is the set of all possible values for the key k.
• k and t are made of n bits (in particular, they are "bytes" when n = 8).Here all sample components of one vector are i.i.d. and belong to some discrete set.Typically, X is a finite subset of N and T is equal to {0, 1} n .
In the profiling stage, the secret key k * is known and variable.In the attacking phase, the secret key k * is unknown but fixed.Further, we assume that x i depends only on t i and k * for all i = 1, 2, . . ., q, in the form: where ⊕ is the XOR (exclusive or) operator and ψ is an unknown function which may contain noise, masking and other hidden parameters 2 .Furthermore, in this paper, we use of the notation n x,t to denote the number of occurrences of (x, t).Thus we can write where Definition 1 (Probabilities).We define three3 different types of probabilities P, P and P. P is the actual (real) underlying probability distribution, but it is generally not available and has to be estimated by either P or P.
• P is computed using the profiling data: • P is computed using the attacking data: In practice, as the secret key leaks through the function via a XOR (Equation ( 1)), we shall often consider P(x, t ⊕ k).
For a fair comparison between distinguishers, Standaert et al. [14] have put forward the success rate as a measure of efficiency of a given distinguisher.
Definition 2 (Success Rate).The success rate SR is probability, averaged over all possible keys, of obtaining the correct key.
where k is the key guess obtained by the distinguisher during the attack.
It has been proven [15, Theorem 1, equation (3)] that for equiprobable keys the optimal distinguisher maximizes likelihood: In equation (9), we use the "arg max" operator, which is defined as follows: let a function f : K → R, then In real life, however, the attacker does not know the leakage model perfectly and thus P( x| t ⊕ k) is not available.In order to get an estimation of P, we use the profiling data to build P defined in Equation (4).This is the classical template attack.The distinguisher becomes This distinguisher is no longer optimal as it does not use the real distribution P.However, if profiling tends to exhaustivity, P and P will be very close since by the law of large numbers, Moreover, we notice that non-optimality is not the only issue with template attacks in the context of discrete leakage.The attacker also faces the problem that the attack is illformed.In practice, it is convenient to use the logarithm arg max k∈K log P( x| t⊕k).Notice that the basis of the logarithm is arbitrary, as all key hypotheses scale alike when switching bases.In fact, since the samples are i.i.d., we have and Therefore, the attacker computes where the logarithm is used to transform products into sums for a more reliable computation.However, we would like to avoid empty bins for which P( 14) would not be well defined.

B. About Empty Bins
The empty bin issue appears when there exists i ∈ {1, . . ., q} and k ∈ K such that P( x i | t i ⊕ k) > 0 and P( x i | t i ⊕ k) = 0.This may even happen for the correct key hypothesis, leading to a wrong key guess during the attack.Figures 1 and 2 show how empty bins can look like after a profiling phase 4 .We notice that some parts of the histograms are left blank, some of them indicated by arrows noticed as "holes" in the figures.These timing values x are possible "empty bins".When such a hole is called during the attack, meaning that the attacker gets a trace with corresponding with a hole, we call this an empty bin.Notice that no additional "binning" is needed as in the case of continuous distributions.The figures also show that the noise is not Gaussian as can be observed from the shape of the distribution.The shortcoming of empty bins can be seen when evaluating the likelihood.The attacker encounters a zero probability, which makes the product vanish for the probability of a given key guess, even if many traces are used.As we wrote earlier, the empty bin may appear even for the correct key guess in template attacks, leading to a null success rate if not taken into account and not well treated.As an example, the number of empty bins for the practical example presented Section V for the correct key guess is around 500 for a poor learning phase ("poor" in that the amount of training data is limited) and around 50 for a good learning phase.This multiplication by zero is not inherent to the attack; it is rather a profiling artifact.In fact, with more profiling traces, the empty bin would likely be populated.Thus, the empty bin issue is a mere side-effect of insufficient profiling, which results in an attack failure if it is encountered in the computation of the likelihood of the correct key.

III. DISTINGUISHERS WHICH TOLERATE EMPTY BINS A. Building Distributions or Models
Before presenting the novel distinguishers in Subsection III-B, we need to define yet another other type of distribution known as a Dirichlet a posteriori in a Bayesian approach.
The Dirichlet A Posteriori: In order to avoid zero probabilities, we use a method based on Dirichlet Prior calculations [16,Section 1].This method leads to a new distribution denoted by s P α , where α > 0 is a user-defined parameter whose value (typically = 1) will be discussed next.
Let X be the set of possible values for x and T be the set of possible values for t.For any x, we set p x,t = P(x, t) their joint probability and p = (p x,t ) x,t .Prior to obtaining any trace, p x,t is completely unknown and we consider a Bayesian approach to estimate p x,t .
1) We consider the following a priori: without further information, we suppose that for all x, t, , where α x,t > 0 is an a priori parameter.To simplify, we may choose α x,t = α constant for all x, t.Let us suppose that p follows a Dirichlet (prior) distribution, whose probability density function is where Γ is the Gamma function defined for x > 0 as The Dirichlet distribution can also be written as where x,t Γ(αx,t) is a normalization factor.Notice that the prior distribution is uniform when α x,t = α = 1 for all x, t.
2) Then suppose we know x, x, t and t.We can now compute the a posteriori probability By Bayes' rule, As components x i and t i are i.i.d., we can write Again by Bayes' rule, be the new normalization constant for this distribution.We, finally, obtain which is known as the Dirichlet a posteriori.
3) The integral can be easily expressed in terms of the Gamma function: which simplifies to .
This new distribution will now be noted: It is important to notice that for all (x, t) ∈ X × T , one has s P α (x, t) > 0. In other words, s P α has no empty bin issue.4) With s P α (x, t) we can calculate where α t = x α x,t .The resulting conditional probability 5 is The Learned MIA Model: When q is small, the model cannot be profiled accurately, and P is a bad approximation of P.However, these profiled values x and t can still be useful, yet they require a more robust distinguisher.
Distinguishers that compute models using profiling have already been proposed.For example, [17], [?] computes a correlation on moments.However, correlations analysis may be sensitive to model errors [18].Mutual Information Analysis (MIA) yields a distinguisher that can be robust when models are not perfectly known [18, Section 4], but it requires at least a vague estimation of the leakage model.
Since our function ψ is unknown, we can create a first-order model ψ with the profiled data as The Step function is a function that ensures the non-injectivity of the model.The simplest way to define Step is the following: where d > 0-the greater d, the smaller the step size.This parameter d has to be small enough in order to make the model non-injective [19,Sec. 4.1].In our case, we choose, for all our experiments, d = 1.With such a model, it is possible to compute a MIA, which successfully distinguishes the correct key.

B. Robust distinguishers
In this subsection, we present six distinguishers that tackle null probabilities.Some of these solutions seem quite obvious while others are deduced from the notions presented in the preceding Subsection III-A.
Hard Drop Distinguisher: The first naive method consists in removing all the traces which, for any key guess, have a zero probability.
Definition 3 (Hard Drop Distinguisher).The hard drop distinguisher is defined as followed: where set I is defined as Recall that P, defined in Equation ( 4), is an empirical histogram estimated on profiled data x (along with corresponding texts t).
The Hard Drop Distinguisher, as the name indicates, drops some data.In very noisy cases, it may even drop most of the data.
Soft Drop Distinguisher: The second possibility is to drop values only for some keys.However, it has to be done carefully because dropping a value in a product implicitly implies a probability value of one.For this reason, instead of removing the trace, we replace the zero probability by a constant which is smaller than the smallest probability.Definition 4 (Soft Drop Distinguisher).We define the Soft Drop Distinguisher as This means that we penalize data with zero probability.The smaller γ, the harder the penalty.
The choice of parameter γ is thus important in order to get a fair result for the distinguisher.If we choose γ ≥ 1 q , the penalty may be greater than the smallest strictly positive probability.This case would mean that the penalty is less important than some licit probabilities.On the other hand, choosing γ smaller than 1  q means a very strong penalty.In this case, the limit when γ → 0 is a distinguisher for which only the number of empty bins is really matters.This leads to the Empty Bin Distinguisher presented next in Definition 8.
The Dirichlet Prior Distinguisher: The Dirichlet Prior Distinguisher uses the Dirichlet a posteriori distributions presented in Subsection III-A.
Definition 5 (The Dirichlet Distinguisher).We define the Dirichlet Distinguisher as: Remark 1.As can be seen in the construction of the Dirichlet a posteriori, the Dirichlet distinguisher is α-dependent.It is important to evaluate the influence of α over the success rate.In practice, α = 1 seems a natural choice since the corresponding prior is uniform, which minimizes the impact of the a priori.In contrast, another value of α like 1/2 can be interpreted as an a priori bin count.We may also consider scenarios where α ≈ 0 to have the least possible impact to the modified values of the histogram.
Offline-Online Profiling: The Dirichlet Prior Distinguisher is set by α.As we discussed in Remark 1, we can choose any α so long as it is strictly positive (the Dirichlet distribution would not be defined if α = 0).However, it is interesting to study its asymptotical behavior as α vanishes: This distribution can be denoted as s P 0 (x|t) and resembles a profiling stage that would start offline and continue online.
Definition 6 (Offline-Online Profiling).The Offline-Online Profiled (OOP) distinguisher is defined as: The OOP distinguisher seems easier than the Dirichlet prior distinguisher since α is no longer in use.Of course, it also solves the empty bin issue since for all (x, t) ∈ X × T , one has s P 0 (x, t) > 0.
Learned MIA Distinguisher: The Learned MIA Distinguisher is constructed with the profiled model function ψ presented in Eqn.(20) of Subsection III-A.

Definition 7 (The Learned MIA Distinguisher).
The Learned MIA Distinguisher is defined as: where I is the empirical mutual information [20].
Empty Bin Distinguisher: The empty bin Distinguisher is yet another intuitive solution based on the idea that instead of avoiding null probabilities, we may take only these into account.It is the key guess with the least number of null probabilities that "should" be the correct key.Definition 8.The Empty Bin Distinguisher is defined as: The Empty Bin Distinguisher assumed that missing data contain more information than actual (measured) data.More precisely, a drop should normally not happen unless the guessed key is wrong; hence, the key guess with the least drops should be the correct key.Obviously, this distinguisher is not effective anymore if no drop occurs for at least two key guesses.
a) Further Remarks: All these distinguishers use a profiling phase.Before comparing them, we would like to make a priori discussion about their respective efficiency.As the Hard Drop Distinguisher does not take into account some data, we may suppose that it will be the one with the least success rate for a given number of traces.The OOP Distinguisher takes into account two types of data: profiling and attacking data.Therefore, it should be more efficient than other distinguishers.Lastly, we build the Learned MIA Distinguisher in order to prevent model errors, such as inaccurate profiling.In that case, we suppose that Learned MIA should work better with few data during the profiling stage.

IV. SIMULATED RESULTS
In this section, we present the results obtained on a simulated model.With these results, we can give a comparison of the proposed distinguishers.

A. Presentation of the Simulated Model
The simulated model is built as follows: where u i is a discrete uniformly distributed noise u i ∼ U(−σ, σ), SubBytes is the AES substitution box function, and H w is the Hamming weight of a byte.This very simple leakage is used to compare distinguishers in the case the attacker has no information about the model.
Remark 2 (Optimal Distinguisher).The optimal distinguisher (9) can be easily calculated if the model is perfectly known, as ) where δ σ is defined such that δ σ (x) = 1 if |x| ≤ σ and 0 otherwise.In Figures 3, 4 and 5, we include the optimal distinguisher for reference, to show how far the other curves are from the fundamental limit of performance.
By construction, the leakage simulation (28) generates some traces with zero probability, but notice that there is no i such that P(x i |t i , k) = 0 for the correct key guess.This academic example is useful to compare the distinguishers defined in Section III.

B. Attack Results
We computed the success rates (8) of the various attacks (namely attacks , , , and -attack being less Fig. 5: SR for q = 4 000 and σ = 24 on synthetic measurements efficient than its limit ) for σ = 24, n = 4 bits, and q ranging from small to high values.
The only difference between Figures 3, 4, and 5, is that we have increased the number of data during the profiling stage.When profiling is bad (Figure 3), the best distinguisher is the Offline-Online profiling distinguisher, while the Learned MIA Distinguisher is not as good as was expected.When q = 1 600 (Figure 4), all distinguishers improve.Finally, when profiling is good ( q = 4 000, Figure 5), the best distinguisher is now the Empty Bin distinguisher, followed by the Soft Drop distinguisher and the Offline-Online profiling.
Remark 3. In this very special case, we can show that the Empty Bin Distinguisher can accurately approximate the Optimal Distinguisher.Indeed, the actual probability is such that for all (x, t) ∈ X × T , which is constant if x is in the appropriate interval.For the Empty Bin Distinguisher, due to the leakage model.Therefore, we can predict that at least q = (2σ + 1)|Y| 1 min P(y) = 3 920 profiling traces are needed to make sure that the Empty Bin Distinguisher becomes as efficient as the Optimal Distinguisher.As profiling consists in random draws with replacement, the D Empty_Bin distinguisher is found very close to the D Optimal distinguisher with q = 4 000 profiling traces.

V. RESULTS ON REAL DEVICES
We have chosen to carry out a timing attack on an STM32F4 discovery board [21].One interesting aspect is that we do not make any assumption on the model.In real life, the leakage model happens to be much more complex than the one employed in simulations (e.g., Equation (28)).As will be seen, in practice empty bins appear even for the correct key guess and for a "good" profiling phase.This observation differs from the ideal case of our simulations carried out in the preceding Section IV.

A. The ARM processor
We used a STM32F4 discovery board by STMicroelectronics 6 .It contains an STM32F407VGT6 microcontroller, which has an ARM cortex-M4 MCU with 1 MB flash memory for instructions and data, and a 192 KB Random Access Memory (RAM).The RAM is divided into three sections: one of 16 KB, another one of 112 KB, and the last one consisting of 64 KB Core Coupled Memory (CCM).The CCM has a zero flash wait state and is often used to store critical data such as data from the operating system.Since the RAM is divided into three regions, the users are unable to make use of the 192 KB RAM as a continuous memory block.
STM32F4 microcontrollers contain a proprietary prefetch module (Adaptive Real-Time memory accelerator -ART accelerator).ART accelerator contains an instruction cache which has 64 lines and a data cache which contains 8 lines.The line size of both instruction cache and data cache is 128-bits.The precise details about ART accelerator (cache replacement policy and cache associativity) are not mentioned as the module is an intellectual property of STMicroelectronics The STM32F407VGT6 microcontroller does not have either a CPU cycle counter or a performance register to measure a cycle accurate time.However, the Data Watchpoint and Trace (DWT) unit has a cycle accurate 32 bit counter (DWT_CYCCNT register), which can be used for measuring the duration of critical operations.When processor runs at 168 MHz, the DWT_CYCCNT register will overflow at every 25.5 seconds providing enough time window to measure the encryption / decryption time for an adversary to measure the elapsed time without timer overflowing.In practice, we collected timing data repeatedly within the ARM, and then dump it as large data buffers sporadically.This modus operandi allowed us to reach about 10 000 measurements per second.

B. Weaknesses -Non Constant AES Time
We use OpenSSL (version 1.0.2) AES as the cryptographic library, where the SubBytes function is implemented with large 1 KB T-boxes (see [22, Sec.5.2.1, page 18]).Interestingly, the OpenSSL code (copied in Appendix A) does not contain any conditional statement, hence can be considered constant-time by a code review.However, once programmed on the STM32F4 processor, one notices that the execution duration depends on the inputs.The AES timing acquisition is illustrated in Figure 6.Before each encryption, we reset DWT_CYCCNT register.This yields the exact timing of the AES execution (which is about 2 600 clock cycles in average -recall Figure 1 and 2).In a real attack, an attacker would measure a noisy timing using an external "chronometer".However, our attack models the best case for an attacker; hence, bounds the security of the analyzed implementation.In particular, we underline that our measurement methodology is fully non invasive: the timing measurement is performed in parallel to the AES computation, thereby keeping the victim circuit run at full speed, without interference.We observe a huge time difference when data cache is turned Off / On.When DC is turned off, there is no timing leakage as AES is constant time.Yet, when DC is turned on, AES is not time constant.This non-constant time on AES leads to the following conclusions: • This is a weakness for the security of the processor as two different plaintext lead to two different time clock to compute AES.• Following Figure 7, it seems the enabling or not Instruction Cache, does not modify the behaviour of the leakages.• Data presented Figure 7 are obtained using a fixed key and varying one byte of the plaintext.Figure 7 instructs us that caches shall be disabled to reduce the leakage in timing.However, we emphasize that such decision has a strongly negative impact on the AES performance: with DC off, the overall AES execution time is about 27% longer.
Therefore, in a realistic context, we shall assume that both DC and IC are enabled, which we will do in the sequel (see next Sec.VI for some indications how well attacks perform when caches are disabled).

C. Characterizing the leakages for Data Cache On
As seen earlier, when the Data Cache in enabled, the AES computation is not time constant.This can be due to the Tboxes called during the computation.Indeed, calling a value in a table also stores this in the Data Cache.If this value is called within the eight next calls, the load will be faster.In Appendix A, we have copied the OpenSSL source code for the AES encryption with a 128 bits key.In this code, we notice that there are 160 calls to the T-boxes.
In order find a model of the leakage, we inferred the cache policy of STM32F4 ARM micro-controllers based on a thorough study of their timing response to some adaptively constructed requests.We discovered that it is actually a FIFO (First-In, First Out) cache.If one requests a particular table lookup within last eight cache accesses, then the access is a hit (if not, it is a miss).
In case of a hit, the time to access such register is 5 or 6 clock cycles faster than a miss.To show this behaviour, we have done a very simple experiment:  8 the histogram of the clock cycles.the negative number in the x axis is due to the fact that we have set the 0 at the maximum value of the clock cycles, which is the obtained value for not hit at all 7 .We notice that when a hit occurs, the time is faster by 5 or 6 Figure 8 has to be compared with a full AES encryption timing in order to see if this model is relevant.Therefore, we have plotted in Figure 9 the histogram for a full AES encryption.Once more, the 0 in the x axis is set to the maximum.

Fig. 9: Distribution of the clock cycles for a full AES encryption
Very interestingly, we can observe in this figure high density levels corresponding to the hits: 1) One hit at -5 and -6; 2) Two hits at -10 and -11; 3) Three hits at -15 and -16.
Below -16 clock cycles, the hits are lost into the noise.
The comparison of these two figures show that the FIFO model for table hits is correct, but does not explain all the time leakage due to the cache policy of the processor.

D. Attack Results
As already noticed above, the leakage model is mostly unknown.We only suppose that the text byte is mixed with the key through a XOR operation.As a consequence, the optimal distinguisher (giving the limit of performance) is not known.The SNR of the leakage is Var(E(x|t))/E(Var(x|t)) = 0.4.In Figure 10, we notice that Learned MIA is the best distinguisher in the case of poor profiling.The Hard Drop Distinguisher is not succeeding at all since it drops about 90% of the data.Fig. 11: SR for q = 256 000 on real-world measurements Figure 11 presents the success rate for a better profiling stage.We notice the following interesting improvements: • The Learned MIA distinguisher is only slightly better than in Figure 10.To reach 80% success rate, 1 100 traces are needed as compared to 1 250 traces previously.• The Soft Drop and Offline-Online distinguishers are the best distinguishers in this scenario, with a small advantage for the Soft Drop distinguisher.• The Hard Drop distinguisher remains unsuccessful.We notice that the Soft Drop Distinguisher has been established using the γ parameter defined in Equation 23such that γ = 1/ q.
Figure 12 is the continuation of Figure 11 with much more traces in the profiling stage.The resulting profiling is very Fig. 12: SR for q = 2 560 000 on real-world measurements good and one may consider that the approximation of P is tight.In this case, Soft Drop and OOP Distinguishers are both very successful, which seems natural regarding the fact that P has converged to the actual probability P. For this attack, we recall that the timing of 10 000 traces can be acquired in one second.Therefore, the attack is successfully in about 0.2 second using Soft Drop or OOP distinguishers.
As a conclusion to this study on the STM32F4 discovery board, we have learned the following comparisons between the proposed distinguishers: • when the profiling stage is poor, the best distinguisher is the Learn MIA Distinguisher; • when there is enough data in the profiling stage, the best distinguisher is the Soft Drop Distinguisher, closely followed by the OOP Distinguisher; • the Empty Bin Distinguisher converges to the optimal success rate, but is not as efficient as previously in Section IV.This can be explained by the fact that we skip a lot of data in the computation; • the Hard Drop Distinguisher is the slowest to converge to 100% success rate.
Remark 4. When comparing Figures 11 and 12, we notice that the Empty Bin distinguisher does not improve as the number of profiling traces increases.An explanation that there is no more empty bins to be filled between these two situations; then only a more precise estimation of the probability would make the difference.
Remark 5.As discussed in Definition 4, the value of γ is important.We have run the same experience as in Figure 11 with γ = 1 q×10 10 .The results, we obtained, are presented in Figure 13.When comparing this figure with Figure 11, we notice that the performance of the Soft Drop Distinguisher has dropped and is now much closer to that of the Empty Bin Distinguisher, as we had forecast.

E. Nature of Empty Bins
Defined in II-B, Empty Bins can appear under two circumstances.The first possibility is insufficient profiling: some rare occurrences are not encountered by lack of Fig. 13: SR for q = 256 000 with γ = 1 q×10 10 .
training measurements.The second possibility is what we call Structural Empty Bins.They are present whatever the profiling under fixed key and do not depend on the number of traces q in the profiling stage.In order to decide for the reason of Empty Bins, we have drawn the number of empty bins for a given key according to the number of traces in the profiling stage.
Fig. 14: Empirical number of empty bins Figure 14 presents this study obtained with the STMicroelectronics Discovery Board.We considered q = 1 280 000, and define the number of empty bins as: x ∈ q min q=1 x q , . . ., q max q=1 x q , such that ∃q, x q = x .
We can see that the number of empty bins decreases, but never reaches 0. At the beginning, the high number of empty bins is due to both poor profiling and structural empty bins.With a good profiling, we only keep the structural empty bins.

F. Study on the Mean-Square Error
An interesting point noticed in Figures 10,11,and 12 is that the Learned MIA distinguisher is working better than the Soft Drop Distinguisher for a poor learning phase (i.e., q = 25 600).However, with a better learning phase (i.e., q = 256 000 and q = 2 560 000), the Soft Drop Distinguisher has a much better success rate.In order to understand why the Learned MIA Distinguisher does not improve that much with a better learning phase, we have computed the Mean-Square Error of these two distinguishers for the three learning phases (i.e., q ∈ {25 600, 256 000, 2 560 000}).Definition 9 (MSE, Bias and Variance).Let us consider a random variable X and its expectation θ = E[X].An estimator of the random variable is noted s X.The MSE is defined as follows: The bias of the estimator is the expectation of the difference between the estimator and the mean of the random variable: At last, the variance of the estimator is: From these definitions, we have the following relation between MSE, bias and variance: The Mean-Square Error (MSE) is computed using the following method: 1) For the secret key k * , we calculate the value of the distinguisher i.e. the value of P( x| t ⊕ k * ) for the Soft Drop and I( x; φ( t ⊕ k * ) for the Learned MIA.We compute this value for different number of traces q.This gives an estimation of the normalized distinguisher for the correct key.
2) The most accurate estimation is obtained for the highest value of q.Therefore, taking the average over a large number of experiences for this highest value of q gives a good estimation of the Expectation of the estimator.3) Then we calculate, for every value of q the bias and the variance of the estimator, and the Average MSE is obtained using the formula: MSE = Bias 2 + Variance.We have plotted in Figures 15 and 16 the Average MSE for the two distinguishers.In order to be more relevant, we have plotted the logarithm of the MSE.Furthermore, we have chosen to plot the MSE separately as the distinguishers are not comparable.
The MSE for the Learned MIA Distinguisher stays almost constant with the improvement of the learning phase whereas the MSE of the Soft Drop Distinguisher is much smaller.This means that a better learning phase gives a much better estimator of the distinguisher.
To understand more deeply this MSE, we separate bias and variance for these two distinguishers.The results are computed Figure 17 for the Learned MIA Distinguisher and Figure 18 for the Soft Drop Distinguisher.
We notice the following aspects: • For the Soft Drop Distinguisher, the bias is almost equal to zero.In fact, the MSE is the variance.• For the Learned MIA Distinguisher, it is mainly the opposite: the biggest part of the MSE is the bias.To conclude with the MSE, the Soft Drop Distinguisher improves because the estimator has a much smaller variance with a better learning phase.Meanwhile, the Learned MIA Distinguisher does not improve because it is a biased estimator and a better learning phase does not reduce this bias.

VI. SUCCESS RATE IN PRESENCE OF EXTERNAL NOISE
The measurement setup used in simulation (Sec.IV) and on real-world traces (Sec.V) is ideal.Indeed, the only considered noise is said algorithmic, i.e., it consists in the varying timing which arise from the parts of the algorithm not under study.In this section, we analyse the effect of noise external to the monitored cryptographic algorithm.Subsection VI-A discusses in general terms the effect of noise addition, and subsection VI-B details quantitatively how distribution-based distinguishers cope efficiently with noise (while moment-based distinguishers fail to resist noise).

A. Effect of Measurement Noise
However, in practice, timing measurements contain a noisy part.Let us give three examples: 1) Measure of a difference of timing between request and response from the AES (over a network of unknown latency); 2) Use of a side-channel signal (such as the power or the electromagnetic field) to observe the AES computation; the beginning and the end of an AES are easy to identify, as they consist in sixteen consecutive operations (namely sixteen XOR making up the AddRoundKey operations).As these patterns have a remarkable signature, they can be extracted with great accuracy thanks to a mere crosscorrelation.Still, the AES itself might not be executed in constant time, hence some alignments issues; 3) Use of a cache attack, which would disclose that the program flows entered and exited the AES function.However, the timing for access to cache is non deterministic.Let us denote the variance of the added noise as σ 2 .Now, it is known that any additive distinguishers (which is the case of our distinguishers), the number of traces to recover the secret for a given success rate is inversely proportional to the inverse of the signal-to-noise ratio (see e.g., [23,Corollary 2]).
As a direct consequence, we can predict the complexity of the attacks when IC and DC are disabled.It can be seen in Figure 7 that the timing variation is about divided by three (from ≈ 20 to ≈ 8) when the DC is disabled.Therefore, the number of required traces to recover the key is about multiplied by three.
In addition, we can approximate the required number of traces to extract the key in presence of external noise of standard deviation σ.In our case-study of OpenSSL AES on ARM, the algorithmic noise has standard deviation about 20 clock cycles (see Figure 1 and 2).
So, if the external noise has standard deviation σ < 20, the impact is small.But when σ/20 > 1, the influence of the external noise becomes preponderant.As the algorithmic noise and the external noise are independent, the number of traces required to extract the key will actually grow linearly with σ as soon as σ/20 1.

B. Comparison with Existing Methods in the Presence of Noise
In this subsection, we aim at comparing our distributionbased method with the existing methods (moment-based method mentioned in Tab.I).In particular, we focus on the representative Bernstein correlation [9] with a learned model [the timing expectation for each value of the target AES byte], that we refer to as "CPA".This "CPA" between timing measurements and the learned average of timing per byte of the key does not suffer from the empty bin issue.We start by a comparison with little external noise.In this case, we have plotted in Figure 19 the success rate for both the soft drop distinguisher and the CPA.The x axis represents the number of traces for the profiling phase while the y axis is the number of traces needed during the attack to reach 80% of success rate.We notice that the CPA performs better than the  soft drop method, for any profiling (even when learning with several million of traces).This can be due to bias between the profiled distribution and the attack distribution.
However, in a practical case, we encounter noisy timing leakages.In order to compare our methods with the existing methods (such as CPA) in the presence of external noise, we plotted Figure 20.In this figure, we took a good profiling phase ( q = 3 × 10 6 ), i.e., profiling is performed on sufficiently enough traces.This figure is obtained for a noisy timing, that is the nominal time to compute AES (as in Subsec.II-B), where the noise follows the following law:      0 added time with probability 50%, T added (T ∈ N, a number of clock periods), with probability 50%. ( This models the interruption of the CPU from a peripheral when AES is baremetal, or a descheduling of the AES process during one time slot on systems with an operating system (OS).Indeed, such events have the consequence, when they occur, to add a long period of time (often as long or even longer than the duration of the AES) to the encryption time, so that the interruption can be served, or so that the OS re-schedules the AES process.We notice that, in such case, it is more interesting to compute one of our methods, rather than previous existing methods such as CPA.Indeed, distribution-based profiling is more accurate than CPA estimation with noisy signals.For instance, the results from Hassan Aly and Mohammed ElGayyar [24] show that 2 22 encryptions are required for a key extraction on a more recent processors (Pentium Dual-Core and Core 2 Duo), which is significantly more than that used by Bernstein CPA in his original attack [25].The authors of this paper remark incidentally that the best method is not to use correlation with the means of each class, but with the minimum value in each class.This confirms that the complexity of the distributions are better suited for distinguishing that simply the average per class.This justifies that our study focuses on distributionbased distinguishers (more robust to binary noise situations encountered while measuring durations) rather than momentbased distinguishers (recall Tab.I).

VII. CONCLUSION AND PERSPECTIVES
We have derived several "information-theoretic" distinguishers as possible solutions to the empty bin issue.Some of them, like the Dirichlet Prior and the Offline-Online distinguishers, required the computation of novel distributions.We have shown in particular that the empty bins, previously believed to be an annoyance and dropped accordingly, can turn out to be valuable assets for the attacker as long as they are treated carefully.In all the paper, real timing data are used, making the results very practical.
We have also compared the various distinguishers under two frameworks: a simulated test with synthetic leakage and real-world timing attacks.In both cases, we noticed that the outcome of the attacks depends on the quality of the profiling stage.A good profiling improves the results, where the best distinguisher seems to be the Soft Drop Distinguisher.A poor profiling makes the traditional distinguishers break down.More sophisticated solutions like Offline-Online Profiling and Learned MIA distinguishers are very useful in this case.A possible way to investigate more on this aspect is to use more powerful statistical tools in order to extract the most precise model for the Learned MIA Distinguisher.
The interesting aspect on the studied timing attack is that one does not have to make any assumption on the leakage model.In addition to this, the main advantage of the new distinguishers is that the empty bin issue is completely solved.We also introduced distinguishers which can jointly exploit offline and online side-channel measurements.As an interesting perspective, our approach could advantageously be analyzed using the "perceived information" metric recently introduced by Standaert et al. in [26,Eqn. (1)].
Another perspective would be to compare our informationtheoretic attacks with attacks based on machine learning techniques.Surprisingly and contrary to results reported in other papers, our preliminary results show that SCA based on support vector machines [27] has poor performance, even when profiling with very few traces ( q is small), which may be due to the univariate nature of the leakage.
An interesting observation is that writing cryptographic code robust to timing attacks is challenging.While the OpenSSL code for AES has no obvious flaw (such as unbalanced branches which depend on sensitive data), the timing of AES is data-dependent, due to microarchitectural features of the studied ARM core.There seem to exist two classes of solutions against timing attacks: The first aims at randomizing the execution timing, as studied for instance in [6].Such an implementation can still be attacked with highorder distinguishers, albeit with more traces than without any protection.The second would attempt to balance the timing, yet this requires some hardware support such as the CCM feature of the STM32F4 processors.

Fig. 6 :
Fig. 6: Measuring elapsed time for AES encryption Time deviations for different configurations of Instruction Cache (IC) and Data Cache (DC) are shown in Figure 7.

Fig. 20 :
Fig. 20: Success rate for soft drop versus CPA for small noise and noise of standard deviation T = 50 (recall (32))

•
We generate a table of length 256; • We generate 16 random values between 0x00 and 0xff; • We call 16 elements of the table corresponding to the 16 values generated previously; • We measure the time to call these 16 elements of the table.We have plotted in Figure