^{1}

^{2}

^{*}

^{1}

^{1}

In any side-channel attack, it is desirable to exploit all the available leakage data to compute the distinguisher’s values. The profiling phase is essential to obtain an accurate leakage model, yet it may not be exhaustive. As a result, information theoretic distinguishers may come up on previously unseen data, a phenomenon yielding empty bins. A strict application of the maximum likelihood method yields a distinguisher that is not even sound. Ignoring empty bins reestablishes soundness, but seriously limits its performance in terms of success rate. The purpose of this paper is to remedy this situation. In this research, we propose six different techniques to improve the performance of information theoretic distinguishers. We study t hem thoroughly by applying them to timing attacks, both with synthetic and real leakages. Namely, we compare them in terms of success rate, and show that their performance depends on the amount of profiling, and can be explained by a bias-variance analysis. The result of our work is that there exist use-cases, especially when measurements are noisy, where our novel information theoretic distinguishers (typically the soft-drop distinguisher) perform the best compared to known side-channel distinguishers, despite the empty bin situation.

The field of cryptography is currently very sensitive as it deals with data protection and safety. Thus, in order to assess the security of cryptographic devices, it is crucial to know and test their weaknesses. For example, the Advanced Encryption Standard (AES) [

A good side-channel attack needs a good leakage model. Timing, for example, can be modeled easily when the implementation is unbalanced: Several successful attacks [

Even when the code is balanced—a recommended secure coding practice—some residual unbalances in timing can result from the hardware which executes the code. Indeed, processors implement speed optimization mechanisms such as memory caching and out-of-order execution. As a consequence, it is not possible to predict with certainty how timing leaks information. The attacker is then led to make predictions about the way the device leaks.

In this paper, we consider side-channel attacks that are performed in two phases:

1) a profiling phase where the attacker accumulates leakage from a device with a known secret key;

2) an attacking phase where the attacker accumulates leakage from the device with an unknown secret key.

This type of attack is known as a template attack [

The first proposed profiled timing attack is the seminal timing attack of Kocher [^{1}.

In this paper, as illustrated in

The discrete nature of timing leakage leads to an empty bin issue which appears when a data value in the attacking phase has never been seen during the profiling phase. Based on profiling only, this data should have a zero probability, which can be devastating for the attack. One known workaround is to use kernel distribution methods [

1) Contributions: In this paper, we show that even when all abovementioned requirements (1), (2), and (3) are not present, timing attacks with incomplete profiling can be achieved successfully by adapting the maximum likelihood distinguisher and keeping the histogram method for probabilities estimation. We build six different distinguishers, which are all good answers to the empty bin issue. For some of them, new histograms are built, such that the empty bin issue totally disappears. Furthermore, we compare these distinguishers and show which one of them is the best in every specific context. We underline that, in practice, for a moderate profiling with 256,000 offline measurements, the soft drop and the combined offline-online profiling approaches are clearly the two best strategies: the AES key is typically extracted with only about 2000 online measurements, i.e., a complete break in about 0.2 ms. Finally, we provide some theoretical results proving how optimal some of the distinguishers can be.

Profiling method | Reference articles |
---|---|

Moments | [ |

Distributions | Our paper (Caution about empty bins) |

2) Organization: The paper is organized according to the following structure. Section 2 provides mathematical tools to understand distinguishers and notations. Section 3 introduces new distinguishers that are suitable in the context of empty bins. Section 4 provides simulations for these distinguishers and Section 5 investigates real attacks on an ARM processor. Interestingly, all proposed distinguishers work, albeit with very noticeably different performances. In Section 6, some interpolations of the obtained results in the presence of external measurement noise are derived. Section 7 concludes.

We consider a side-channel attack with a profiling stage and use the following notations:

• During the profiling phase, a vector t ^ of q ^ text bytes is sent and the profiler garners a vector of x ^ measurements;

• During the attacking phase, a vector t ˜ of q ˜ text bytes is sent and the attacker gathers a vector x ˜ of leakage measurements—also customarily known as traces;

• We use simplified notations t , q and x when discussing either profiling data or attacking data;

• The probability of a vector x with i.i.d. components x i is denoted by ℙ ( x ) = ∏ i ℙ ( x i ) ;

• We define the following sets:

1) X ^ , T ^ , X ˜ and T ˜ are the sets of possible values of components x ^ , t ^ , x ˜ and t ˜ , respectively;

2) X = X ^ ∪ X ˜ and T = T ^ ∪ T ˜ ;

3) K is the set of all possible values for the key k.

• k and t are made of n bits (in particular, they are “bytes” when n = 8 ).

Here all sample components of one vector are i.i.d. and belong to some discrete set. Typically, X is a finite subset of ℕ and T is equal to { 0,1 } n .

In the profiling stage, the secret key k ^ * is known and variable. In the attacking phase, the secret key k ˜ * is unknown but fixed. Further, we assume that x i depends only on t i and k * for all i = 1 , 2 , ⋯ , q , in the form:

x i = ψ ( t i ⊕ k * ) ( i = 1,2, ⋯ , q ) (1)

where ⊕ is the XOR (exclusive or) operator and ψ is an unknown function which may contain noise, masking and other hidden parameters^{2}.

Furthermore, in this paper, we use of the notation n x , t to denote the number of occurrences of ( x , t ) . Thus we can write

n ^ x , t = ∑ i = 1 q ^ 1 l x ^ i = x , t ^ i = t , n ^ x = ∑ i = 1 q ^ 1 l x ^ i = x , (2)

n ˜ x , t = ∑ i = 1 q ˜ 1 l x ˜ i = x , t ˜ i = t , n ˜ x = ∑ i = 1 q ˜ 1 l x ˜ i = x . (3)

where 1 l A = 1 if A is true, =0 otherwise.

Definition 1 (Probabilities). We define three^{3} different types of probabilities ℙ , ℙ ^ and ℙ ˜ . ℙ is the actual (real) underlying probability distribution, but it is generally not available and has to be estimated by either ℙ ^ or ℙ ˜ .

• ℙ ^ is computed using the profiling data:

ℙ ^ ( x , t ) = 1 q ^ ∑ i = 1 q ^ 1 l x ^ i = x , t ^ i = t = n ^ x , t q ^ , (4)

ℙ ^ ( x ) = 1 q ^ ∑ i = 1 q ^ 1 l x ^ i = x = n ^ x q ^ . (5)

• ℙ ˜ is computed using the attacking data:

ℙ ˜ ( x , t ) = 1 q ˜ ∑ i = 1 q ˜ 1 l x ˜ i = x , t ˜ i = t = n ˜ x , t q ˜ , (6)

ℙ ˜ ( x ) = 1 q ˜ ∑ i = 1 q ˜ 1 l x ˜ i = x = n ˜ x q ˜ . (7)

In practice, as the secret key leaks through the function via a XOR (Equation (1)), we shall often consider ℙ ( x , t ⊕ k ) .

For a fair comparison between distinguishers, Standaert et al. [

Definition 2 (Success Rate). The success rate SR is probability, averaged over all possible keys, of obtaining the correct key.

SR = 1 2 n ∑ k * = 0 2 n − 1 ℙ k * ( k ˜ = k * ) , (8)

where k ˜ is the key guess obtained by the distinguisher during the attack.

It has been proven ( [

D Optimal ( x ˜ , t ˜ ) = arg max k ∈ K ℙ ( x ˜ | t ˜ ⊕ k ) . (9)

In Equation (9), we use the “arg max” operator, which is defined as follows: let a function f : K → ℝ , then

arg max k ∈ K f ( k ) = { k ∈ K such that ∀ k ′ ∈ K , f ( k ) ≥ f ( k ′ ) } .

In real life, however, the attacker does not know the leakage model perfectly and thus ℙ ( x ˜ | t ˜ ⊕ k ) is not available. In order to get an estimation of ℙ , we use the profiling data to build ℙ ^ defined in Equation (4). This is the classical template attack. The distinguisher becomes

D Template ( x ˜ , t ˜ ) = arg max k ∈ K ℙ ^ ( x ˜ | t ˜ ⊕ k ) . (10)

This distinguisher is no longer optimal as it does not use the real distribution ℙ . However, if profiling tends to exhaustivity, ℙ ^ and ℙ will be very close since by the law of large numbers,

∀ x , t ℙ ^ ( x , t ) → q ^ → ∞ ℙ ( x , t ) . (11)

Moreover, we notice that non-optimality is not the only issue with template attacks in the context of discrete leakage. The attacker also faces the problem that the attack is ill-formed. In practice, it is convenient to use the logarithm arg max k ∈ K log ℙ ^ ( x ˜ | t ˜ ⊕ k ) . Notice that the basis of the logarithm is arbitrary, as all key hypotheses scale alike when switching bases. In fact, since the samples are i.i.d., we have

ℙ ( x ˜ | t ˜ ⊕ k ) = ∏ i = 1 q ˜ ℙ ( x ˜ i | t ˜ i ⊕ k ) (12)

and

ℙ ^ ( x ˜ | t ˜ ⊕ k ) = ∏ i = 1 q ˜ ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) . (13)

Therefore, the attacker computes

D Template ( x ˜ , t ˜ ) = arg max k ∈ K ∑ i = 1 q ˜ log ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) (14)

where the logarithm is used to transform products into sums for a more reliable computation. However, we would like to avoid empty bins for which ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) = 0 ; otherwise, Equation (14) would not be well defined.

The empty bin issue appears when there exists i ∈ { 1, ⋯ , q ˜ } and k ∈ K such that ℙ ˜ ( x ˜ i | t ˜ i ⊕ k ) > 0 and ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) = 0 . This may even happen for the correct key hypothesis, leading to a wrong key guess during the attack.

^{4}. We notice that some parts of the histograms are left blank, some of them indicated by arrows noticed as “holes” in the figures. These timing values x are possible “empty bins”. When such a hole is called during the attack, meaning that the attacker gets a trace with corresponding with a hole, we call this an empty bin. Notice that no additional “binning” is needed as in the case of continuous distributions. The figures also show that the noise is not Gaussian as can be observed from the shape of the distribution.

The shortcoming of empty bins can be seen when evaluating the likelihood. The attacker encounters a zero probability, which makes the product vanish for the probability of a given key guess, even if many traces are used. As we wrote earlier, the empty bin may appear even for the correct key guess in template attacks, leading to a null success rate if not taken into account and not well treated. As an example, the number of empty bins for the practical example presented in Section 5 for the correct key guess is around 500 for a poor learning phase

(“poor” in that the amount of training data is limited) and around 50 for a good learning phase. This multiplication by zero is not inherent to the attack; it is rather a profiling artifact. In fact, with more profiling traces, the empty bin would likely be populated. Thus, the empty bin issue is a mere side-effect of insufficient profiling, which results in an attack failure if it is encountered in the computation of the likelihood of the correct key.

Before presenting the novel distinguishers in Subsection 3.2, we need to define yet another type of distribution known as a Dirichlet a posteriori in a Bayesian approach.

The Dirichlet A Posteriori: In order to avoid zero probabilities, we use a method based on Dirichlet Prior calculations ( [

Let X be the set of possible values for x and T be the set of possible values for t. For any x, we set p x , t = ℙ ( x , t ) their joint probability and p = ( p x , t ) x , t . Prior to obtaining any trace, p x , t is completely unknown and we consider a Bayesian approach to estimate p x , t .

1) We consider the following a priori: without further information, we suppose that for all x , t ,

ℙ ¯ α ( x , t ) = α x , t ∑ x ′ , t ′ α x ′ , t ′ ,

where α x , t > 0 is an a priori parameter. To simplify, we may choose α x , t = α constant for all x , t . Let us suppose that p follows a Dirichlet (prior) distribution, whose probability density function is

f ( p ) = Γ ( ∑ x , t α x , t ) ∏ x , t Γ ( α x , t ) ∏ x , t p x , t α x , t − 1 , (15)

where Γ is the Gamma function defined for x > 0 as

Γ ( x ) = ∫ 0 + ∞ t x − 1 e − t d t . (16)

The Dirichlet distribution can also be written as

f ( p ) = N α ∏ x , t p x , t α x , t − 1 , (17)

where N α = Γ ( ∑ x , t α x , t ) ∏ x , t Γ ( α x , t ) is a normalization factor. Notice that the prior distribution is uniform when α x , t = α = 1 for all x , t .

2) Then suppose we know x ^ , x ^ , t ^ and t ˜ . We can now compute the a posteriori probability

ℙ ( x , t | x ^ , x ˜ , t ^ , t ˜ ) = ∫ f ( p , x , t | x ^ , x ˜ , t ^ , t ˜ ) d p .

By Bayes’ rule,

f ( p , x , t | x ^ , x ˜ , t ^ , t ˜ ) = ℙ ( x , t | p , x ^ , x ˜ , t ^ , t ˜ ) f ( p | x ^ , x ˜ , t ^ , t ˜ ) .

As components x i and t i are i.i.d., we can write f ( p , x , t | x ^ , x ˜ , t ^ , t ˜ ) = ℙ ( x , t | p ) ⋅ f ( p | x ^ , x ˜ , t ^ , t ˜ , t ) = p x , t ⋅ f ( p | x ^ , x ˜ , t ^ , t ˜ ) .

Again by Bayes’ rule,

f ( p | x ^ , x ˜ , t ^ , t ˜ ) = ℙ ( x ^ , x ˜ , t ˜ , t ^ | p ) f ( p ) ℙ ( x ^ , x ˜ , t ˜ , t ^ ) = ∏ x ′ , t ′ ∈ X × T p x ′ , t ′ n ^ x ′ , t ′ + n ˜ x ′ , t ′ ( k ) ℙ ( x ^ , x ˜ , t ˜ , t ^ ) f ( p ) = N α ℙ ( x ^ , x ˜ , t ˜ , t ^ ) ∏ x ′ , t ′ ∈ X × T p x ′ , t ′ n ^ x ′ , t ′ + n ˜ x ′ , t ′ + α x ′ , t ′ − 1 .

We recognize another Dirichlet distribution with parameters n ^ x ′ , t ′ + n ˜ x ′ , t ′ + α x ′ , t ′ . Let N α ′ = Γ ( ∑ x ′ , t ′ α x ′ , t ′ + n ˜ x ′ , t ′ + α x ′ , t ′ ) ∏ x , t Γ ( α x , t + n ˜ x ′ , t ′ + α x ′ , t ′ ) be the new normalization constant for this distribution. We, finally, obtain

f ( p , x , t | x ^ , x ˜ , t ^ , t ˜ ) = p x , t ⋅ N α ′ ∏ x ′ , t ′ ∈ X × T p x ′ , t ′ n ^ x ′ , t ′ + n ˜ x ′ , t ′ + α x ′ , t ′ − 1 .

Therefore,

ℙ ( x , t | x ^ , x ˜ , t ^ , t ˜ ) = ∫ p x , t ⋅ N α ′ ∏ x ′ , t ′ ∈ X × T p x ′ , t ′ n ^ x ′ , t ′ + n ˜ x ′ , t ′ + α x ′ , t ′ − 1 d p .

which is known as the Dirichlet a posteriori.

3) The integral can be easily expressed in terms of the Gamma function:

ℙ ( x , t | x ^ , x ˜ , t ^ , t ˜ ) = Γ ( ∑ x ′ , t ′ α x , t + n ^ x ′ , t ′ + n ˜ x ′ , t ′ ) ∏ x ′ , t ′ Γ ( α x , t + n ^ x ′ , t ′ + n ˜ x ′ , t ′ ) × ∏ x ′ , t ′ Γ ( α x , t + n ^ x ′ , t ′ + n ˜ x ′ , t ′ + δ x , t ) Γ ( ∑ x ′ , t ′ α x , t + n ^ x ′ , t ′ + n ˜ x ′ , t ′ + δ x , t )

which simplifies to

ℙ ( x , t | x ^ , x ˜ , t ^ , t ˜ ) = n ^ x , t + n ˜ x , t + α x , t q ^ + q ˜ + ∑ x ′ , t ′ α x ′ , t ′ .

This new distribution will now be noted:

ℙ ¯ α ( x , t ) = ℙ ( x , t | x ^ , x ˜ , t ^ , t ˜ ) = n ^ x , t + n ˜ x , t + α x , t q ^ + q ˜ + ∑ x ′ , t ′ α x ′ , t ′ . (18)

It is important to notice that for all ( x , t ) ∈ X × T , one has ℙ ¯ α ( x , t ) > 0 . In other words, ℙ ¯ α has no empty bin issue.

4) With ℙ ¯ α ( x , t ) we can calculate

ℙ ¯ α ( t ) = ∑ x ℙ ¯ α ( x , t ) = ∑ x n ^ x , t + n ˜ x , t + α x , t q ^ + q ˜ + ∑ x ′ , t ′ α x ′ , t ′ = n ^ t + n ˜ t + ∑ t α x , t q ^ + q ˜ + ∑ x ′ , t ′ α x ′ , t ′ = n ^ t + n ˜ t + α t q ^ + q ˜ + ∑ x ′ α x ′ ,

where α t = ∑ x α x , t . The resulting conditional probability^{5} is

ℙ ¯ α ( x | t ) = ℙ ¯ α ( x , t ) ℙ ¯ α ( t ) = n ^ x , t + n ˜ x , t + α x , t n ^ t + n ˜ t + α t . (19)

The Learned MIA Model: When q ^ is small, the model cannot be profiled accurately, and ℙ ^ is a bad approximation of ℙ . However, these profiled values x ˜ and t ˜ can still be useful, yet they require a more robust distinguisher.

Distinguishers that compute models using profiling have already been proposed. For example, [

Since our function ψ is unknown, we can create a first-order model ψ ^ with the profiled data as

ψ ^ ( t ⊕ k ^ * ) = Step ( 1 n t ∑ i s .t . t ^ i = t x ^ i ) ( ∀ t ∈ T ) . (20)

The Step function is a function that ensures the non-injectivity of the model. The simplest way to define Step is the following:

Step ( x ) = ⌊ d ⋅ x ⌋ d ( x ∈ ℝ )

where d > 0 —the greater d, the smaller the step size. This parameter d has to be small enough in order to make the model non-injective ( [

In this subsection, we present six distinguishers that tackle null probabilities. Some of these solutions seem quite obvious while others are deduced from the notions presented in the preceding Subsection 3.1.

① Hard Drop Distinguisher: The first naive method consists in removing all the traces which, for any key guess, have a zero probability.

Definition 3 (Hard Drop Distinguisher). The hard drop distinguisher is defined as followed:

D Hard ( x ˜ , t ˜ ) = arg max k ∈ K ∑ i ∈ I log ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) , (21)

where set I is defined as

I = { i ∈ { 1, ⋯ , q ˜ } | ∀ k ∈ K , ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) > 0 } . (22)

Recall that ℙ ^ , defined in Equation (4), is an empirical histogram estimated on profiled data x ^ (along with corresponding texts t ^ ).

The Hard Drop Distinguisher, as the name indicates, drops some data. In very noisy cases, it may even drop most of the data.

② Soft Drop Distinguisher: The second possibility is to drop values only for some keys. However, it has to be done carefully because dropping a value in a product implicitly implies a probability value of one. For this reason, instead of removing the trace, we replace the zero probability by a constant which is smaller than the smallest probability.

Definition 4 (Soft Drop Distinguisher). We define the Soft Drop Distinguisher as

D Soft ( x ˜ , t ˜ ) = arg max k ∈ K ∑ i s .t . ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) > 0 log ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) + ∑ i s .t . ℙ ^ ( x ˜ i | t ˜ i , k ) = 0 log γ , (23)

where γ ∈ ℝ + ∗ is a constant such that ∀ i , k ∈ { 1, ⋯ , q ˜ } × K , γ ≤ ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) . This means that we penalize data with zero probability. The smaller γ , the harder the penalty.

The choice of parameter γ is thus important in order to get a fair result for the distinguisher. If we choose γ ≥ 1 q ^ , the penalty may be greater than the smallest strictly positive probability. This case would mean that the penalty is

less important than some licit probabilities. On the other hand, choosing γ

smaller than 1 q ^ means a very strong penalty. In this case, the limit when

γ → 0 is a distinguisher for which only the number of empty bins is really matters. This leads to the Empty Bin Distinguisher presented next in Definition 8.

③ The Dirichlet Prior Distinguisher: The Dirichlet Prior Distinguisher uses the Dirichlet a posteriori distributions presented in Subsection 3.1.

Definition 5 (The Dirichlet Distinguisher). We define the Dirichlet Distinguisher as:

D Dirichlet ( x ˜ , t ˜ ) = arg max k ∈ K ℙ ¯ α ( x ˜ | t ˜ ⊕ k ) . (24)

Remark 1. As can be seen in the construction of the Dirichlet a posteriori, the Dirichlet distinguisher is α -dependent. It is important to evaluate the influence of α over the success rate. In practice, α = 1 seems a natural choice since the corresponding prior is uniform, which minimizes the impact of the a priori. In contrast, another value of α like 1/2 can be interpreted as an a priori bin count. We may also consider scenarios where α ≈ 0 to have the least possible impact to the modified values of the histogram.

④ Offline-Online Profiling: The Dirichlet Prior Distinguisher is set by α . As we discussed in Remark 1, we can choose any α so long as it is strictly positive (the Dirichlet distribution would not be defined if α = 0 ). However, it is interesting to study its asymptotical behavior as α vanishes:

lim α → 0 ℙ ¯ α ( x | t ) = n ^ x , t + n ˜ x , t n ^ t + n ˜ t .

This distribution can be denoted as ℙ 0 ( x | t ) and resembles a profiling stage that would start offline and continue online.

Definition 6 (Offline-Online Profiling). The Offline-Online Profiled (OOP) distinguisher is defined as:

D OOP ( x ˜ , t ˜ ) = arg max k ∈ K ℙ ¯ 0 ( x ˜ | t ˜ ⊕ k ) (25)

The OOP distinguisher seems easier than the Dirichlet prior distinguisher since α is no longer in use. Of course, it also solves the empty bin issue since for all ( x , t ) ∈ X × T , one has ℙ ¯ 0 ( x , t ) > 0 .

⑤ Learned MIA Distinguisher: The Learned MIA Distinguisher is constructed with the profiled model function ψ ^ presented in Equation (20) of Subsection 3.1.

Definition 7 (The Learned MIA Distinguisher)

The Learned MIA Distinguisher is defined as:

D MIA_Learned = arg max k ∈ K I ˜ ( x ˜ ; ψ ^ ( t ˜ ⊕ k ) ) , (26)

where I ˜ is the empirical mutual information [

⑥ Empty Bin Distinguisher: The empty bin Distinguisher is yet another intuitive solution based on the idea that instead of avoiding null probabilities, we may take only these into account. It is the key guess with the least number of null probabilities that “should” be the correct key.

Definition 8. The Empty Bin Distinguisher is defined as:

D Empty_Bin ( x ˜ , t ˜ ) = arg max k ∈ K ∑ i = 1 q ˜ 1 l ℙ ^ ( x ˜ i | t ˜ i ⊕ k ) = 0 . (27)

The Empty Bin Distinguisher assumed that missing data contain more information than actual (measured) data. More precisely, a drop should normally not happen unless the guessed key is wrong; hence, the key guess with the least drops should be the correct key. Obviously, this distinguisher is not effective anymore if no drop occurs for at least two key guesses.

Further Remarks: All these distinguishers use a profiling phase. Before comparing them, we would like to make a priori discussion about their respective efficiency. As the Hard Drop Distinguisher does not take into account some data, we may suppose that it will be the one with the least success rate for a given number of traces. The OOP Distinguisher takes into account two types of data: profiling and attacking data. Therefore, it should be more efficient than other distinguishers. Lastly, we build the Learned MIA Distinguisher in order to prevent model errors, such as inaccurate profiling. In that case, we suppose that Learned MIA should work better with few data during the profiling stage.

In this section, we present the results obtained on a simulated model. With these results, we can give a comparison of the proposed distinguishers.

The simulated model is built as follows:

x i = H w ( SubBytes ( t i ⊕ k * ) ) + u i = ϕ ( t i ⊕ k * ) + u i = y i ( k * ) + u i , (28)

where u i is a discrete uniformly distributed noise u i ∼ U ( − σ , σ ) , SubBytes is the AES substitution box function, and H w is the Hamming weight of a byte.

This very simple leakage is used to compare distinguishers in the case the attacker has no information about the model.

Remark 2 (Optimal Distinguisher). The optimal distinguisher (9) can be easily calculated if the model is perfectly known, as

D Optimal ( x ˜ , t ˜ ) = arg max k ∈ K ∏ i = 1 q ˜ δ σ ( x ˜ i − H w ( SubBytes ( t ˜ i ⊕ k ) ) ) , (29)

where δ σ is defined such that δ σ ( x ) = 1 if | x | ≤ σ and 0 otherwise. In Figures 3-5, we include the optimal distinguisher for reference, to show how far the other curves are from the fundamental limit of performance.

By construction, the leakage simulation (28) generates some traces with zero probability, but notice that there is no i such that ℙ ( x i | t i , k ) = 0 for the correct key guess. This academic example is useful to compare the distinguishers defined in Section 3.

We computed the success rates (8) of the various attacks (namely attacks ①, ②, ④, ⑤ and ⑥—attack ③ being less efficient than its limit ④) for σ = 24 , n = 4 bits, and q ^ ranging from small to high values.

The only difference between Figures 3-5, is that we have increased the number of data during the profiling stage. When profiling is bad (

Remark 3. In this very special case, we can show that the Empty Bin Distinguisher can accurately approximate the Optimal Distinguisher. Indeed, the actual probability is such that for all ( x , t ) ∈ X × T ,

ℙ ( x | y ( k ) ) = ( 1 2 σ + 1 if − σ ≤ x − ϕ ( t ⊕ k ) ≤ σ , 0 otherwise , (30)

which is constant if x is in the appropriate interval. For the Empty Bin Distinguisher,

ℙ ^ ( x | y ( k ) ) > 0 ⇒ ℙ ( x | y ( k ) ) = 1 2 σ + 1

due to the leakage model. Therefore, we can predict that at least

q ^ = ( 2 σ + 1 ) | Y | 1 min ℙ ( y ) = 3920 profiling traces are needed to make sure that

the Empty Bin Distinguisher becomes as efficient as the Optimal Distinguisher. As profiling consists in random draws with replacement, the D Empty_Bin distinguisher is found very close to the D Optimal distinguisher with q ^ = 4000 profiling traces.

We have chosen to carry out a timing attack on an STM32F4 discovery board [

We used a STM32F4 discovery board by STMicroelectronics^{6}. It contains an STM32F407VGT6 microcontroller, which has an ARM cortex-M4 MCU with 1 MB flash memory for instructions and data, and a 192 KB Random Access Memory (RAM). The RAM is divided into three sections: one of 16 KB, another one of 112 KB, and the last one consisting of 64 KB Core Coupled Memory (CCM). The CCM has a zero flash wait state and is often used to store critical data such as data from the operating system. Since the RAM is divided into three regions, the users are unable to make use of the 192 KB RAM as a continuous memory block.

STM32F4 microcontrollers contain a proprietary prefetch module (Adaptive Real-Time memory accelerator - ART accelerator). ART accelerator contains an instruction cache which has 64 lines and a data cache which contains 8 lines. The line size of both instruction cache and data cache is 128-bits. The precise details about ART accelerator (cache replacement policy and cache associativity) are not mentioned as the module is an intellectual property of STMicroelectronics

The STM32F407VGT6 microcontroller does not have either a CPU cycle counter or a performance register to measure a cycle accurate time. However, the Data Watchpoint and Trace (DWT) unit has a cycle accurate 32 bit counter (DWT_CYCCNT register), which can be used for measuring the duration of critical operations. When processor runs at 168 MHz, the DWT_CYCCNT register will overflow at every 25.5 seconds providing enough time window to measure the encryption/decryption time for an adversary to measure the elapsed time without timer overflowing. In practice, we collected timing data repeatedly within the ARM, and then dump it as large data buffers sporadically. This modus operandi allowed us to reach about 10,000 measurements per second.

We use OpenSSL (version 1.0.2) AES as the cryptographic library, where the SubBytes function is implemented with large 1 KB T-boxes (see [

to the AES computation, thereby keeping the victim circuit run at full speed, without interference.

Time deviations for different configurations of Instruction Cache (IC) and Data Cache (DC) are shown in

• This is a weakness for the security of the processor as two different plaintexts lead to two different time clocks to compute AES.

• Following

• Data presented

Therefore, in a realistic context, we shall assume that both DC and IC are enabled, which we will do in the sequel (see next Section 6 for some indications of how well attacks perform when caches are disabled).

As seen earlier, when the Data Cache is enabled, the AES computation is not time constant. This can be due to the T-boxes called during the computation. Indeed, calling a value in a table also stores this in the Data Cache. If this value is called within the eight next calls, the load will be faster. In Appendix A, we have copied the OpenSSL source code for the AES encryption with a 128 bits key. In this code, we notice that there are 160 calls to the T-boxes.

In order to find a model of the leakage, we inferred the cache policy of STM32F4 ARM micro-controllers based on a thorough study of their timing response to some adaptively constructed requests. We discovered that it is actually a FIFO (First-In, First Out) cache. If one requests a particular table lookup within last eight cache accesses, then the access is a hit (if not, it is a miss).

In case of a hit, the time to access such register is 5 or 6 clock cycles faster than a miss. To show this behaviour, we have done a very simple experiment:

• We generate a table of length 256;

• We generate 16 random values between 0x00 and 0xff;

• We call 16 elements of the table corresponding to the 16 values generated previously;

• We measure the time to call these 16 elements of the table.

We have plotted in ^{7}. We notice that when a hit occurs, the time is faster by 5 or 6 clock cycles. For two hits, there are three possible values: 10, 11 or 12 clock cycles.

Very interestingly, we can observe in this figure high density levels corresponding to the hits:

1) One hit at −5 and −6;

2) Two hits at −10 and −11;

3) Three hits at −15 and −16.

Below −16 clock cycles, the hits are lost into the noise.

The comparison of these two figures show that the FIFO model for table hits is correct, but does not explain all the time leakage due to the cache policy of the processor.

As already noticed above, the leakage model is mostly unknown. We only suppose that the text byte is mixed with the key through a XOR operation. As a

consequence, the optimal distinguisher (giving the limit of performance) is not known. The SNR of the leakage is Var ( E ( x | t ) ) / E ( Var ( x | t ) ) = 0.4 .

In

• The Learned MIA distinguisher is only slightly better than in

• The Soft Drop and Offline-Online distinguishers are the best distinguishers in this scenario, with a small advantage for the Soft Drop distinguisher.

• The Hard Drop distinguisher remains unsuccessful.

We notice that the Soft Drop Distinguisher has been established using the γ parameter defined in Equation (23) such that γ = 1 / q ˜ .

As a conclusion to this study on the STM32F4 discovery board, we have learned the following comparisons between the proposed distinguishers:

• When the profiling stage is poor, the best distinguisher is the Learn MIA Distinguisher;

• When there is enough data in the profiling stage, the best distinguisher is the Soft Drop Distinguisher, closely followed by the OOP Distinguisher;

• The Empty Bin Distinguisher converges to the optimal success rate, but is not as efficient as previously in Section 4. This can be explained by the fact that we skip a lot of data in the computation;

• The Hard Drop Distinguisher is the slowest to converge to 100% success rate.

Remark 4. When comparing

Remark 5. As discussed in Definition 4, the value of γ is important. We

have run the same experience as in

obtained, are presented in

Defined in Subsection 2.2, Empty Bins can appear under two circumstances. The first possibility is insufficient profiling: some rare occurrences are not encountered by lack of training measurements. The second possibility is what we call Structural Empty Bins. They are present whatever the profiling under fixed key and do not depend on the number of traces q ^ in the profiling stage. In order to explain the reason of Empty Bins, we have drawn the number of empty bins for a given key according to the number of traces in the profiling stage.

| { x ∈ { min q = 1 q ^ x ^ q , ⋯ , max q = 1 q ^ x ^ q } , such that ∃ q , x ^ q = x } | .

We can see that the number of empty bins decreases, but never reaches 0. At the beginning, the high number of empty bins is due to both poor profiling and structural empty bins. With a good profiling, we only keep the structural empty bins.

An interesting point noticed in Figures 10-12 is that the Learned MIA distinguisher is working better than the Soft Drop Distinguisher for a poor learning phase (i.e., q ^ = 25600 ). However, with a better learning phase (i.e., q ^ = 256000 and q ^ = 2560000 ), the Soft Drop Distinguisher has a much better success rate. In order to understand why the Learned MIA Distinguisher does not improve that much with a better learning phase, we have computed the Mean-Square Error of these two distinguishers for the three learning phases (i.e., q ^ ∈ { 25600,256000,2560000 } ).

Definition 9 (MSE, Bias and Variance). Let us consider a random variable X and its expectation θ = E [ X ] . An estimator of the random variable is noted X. The MSE is defined as follows:

MSE = E [ ( X − θ ) 2 ] .

The bias of the estimator is the expectation of the difference between the estimator and the mean of the random variable:

Bias = E [ X − θ ] .

At last, the variance of the estimator is:

Variance = E [ X 2 ] − E [ X ] 2

From these definitions, we have the following relation between MSE, bias and variance:

MSE = Bias 2 + Variance (31)

The Mean-Square Error (MSE) is computed using the following method:

1) For the secret key k * , we calculate the value of the distinguisher i.e. the value of ℙ ^ ( x ˜ | t ˜ ⊕ k * ) for the Soft Drop and I ( x ˜ ; ϕ ^ ( t ˜ ⊕ k * ) ) for the Learned MIA. We compute this value for different number of traces q ˜ . This gives an estimation of the normalized distinguisher for the correct key.

2) The most accurate estimation is obtained for the highest value of q ˜ . Therefore, taking the average over a large number of experiences for this highest value of q ˜ gives a good estimation of the Expectation of the estimator.

3) Then we calculate, for every value of q ˜ the bias and the variance of the estimator, and the Average MSE is obtained using the formula: MSE = Bias 2 + Variance .

We have plotted in

distinguishers. In order to be more relevant, we have plotted the logarithm of the MSE. Furthermore, we have chosen to plot the MSE separately as the distinguishers are not comparable.

The MSE for the Learned MIA Distinguisher stays almost constant with the improvement of the learning phase whereas the MSE of the Soft Drop Distinguisher is much smaller. This means that a better learning phase gives a much better estimator of the distinguisher.

To understand more deeply this MSE, we separate bias and variance for these two distinguishers. The results are computed in

We notice the following aspects:

• For the Soft Drop Distinguisher, the bias is almost equal to zero. In fact, the MSE is the variance.

• For the Learned MIA Distinguisher, it is mainly the opposite: the biggest part of the MSE is the bias.

To conclude with the MSE, the Soft Drop Distinguisher improves because the estimator has a much smaller variance with a better learning phase. Meanwhile, the Learned MIA Distinguisher does not improve because it is a biased estimator and a better learning phase does not reduce this bias.

The measurement setup used in simulation (Section 4) and on real-world traces (Section 5) is ideal. Indeed, the only considered noise is said algorithmic, i.e., it consists in the varying timing which arises from the parts of the algorithm not under study. In this section, we analyze the effect of noise external to the monitored cryptographic algorithm. Subsection 6.1 discusses in general terms the effect of noise addition, and Subsection 6.2 details quantitatively how distribution-based distinguishers cope efficiently with noise (while moment-based distinguishers fail to resist noise).

However, in practice, timing measurements contain a noisy part. Let us give three examples:

1) Measure of a difference of timing between request and response from the AES (over a network of unknown latency);

2) Use of a side-channel signal (such as the power or the electromagnetic field) to observe the AES computation; the beginning and the end of an AES are easy to identify, as they consist of sixteen consecutive operations (namely sixteen XOR making up the AddRoundKey operations). As these patterns have a remarkable signature, they can be extracted with great accuracy thanks to a mere cross-correlation. Still, the AES itself might not be executed in constant time, hence some alignments issues;

3) Use of a cache attack, which would disclose that the program flows entered and exited the AES function. However, the timing for access to cache is non deterministic.

Let us denote the variance of the added noise as σ 2 .

Now, it is known that any additive distinguishers (which is the case of our distinguishers), the number of traces to recover the secret for a given success rate is inversely proportional to the inverse of the signal-to-noise ratio (see e.g. Corollary 2 of [

As a direct consequence, we can predict the complexity of the attacks when IC and DC are disabled. It can be seen in

In addition, we can approximate the required number of traces to extract the key in presence of external noise of standard deviation σ . In our case-study of OpenSSL AES on ARM, the algorithmic noise has standard deviation about 20 clock cycles (see

So, if the external noise has standard deviation σ < 20 , the impact is small. But when σ / 20 > 1 , the influence of the external noise becomes preponderant. As the algorithmic noise and the external noise are independent, the number of traces required to extract the key will actually grow linearly with σ as soon as σ / 20 ≫ 1 .

In this subsection, we aim at comparing our distribution-based method with the existing methods (moment-based method mentioned in

represents the number of traces for the profiling phase while the y axis is the number of traces needed during the attack to reach 80% of success rate. We notice that the CPA performs better than the soft drop method, for any profiling (even when learning with several million of traces). This can be due to bias between the profiled distribution and the attack distribution.

However, in a practical case, we encounter noisy timing leakages. In order to compare our methods with the existing methods (such as CPA) in the presence of external noise, we plotted

( 0 added time with probability 50%, T added ( T ∈ ℕ , a number of clock periods ) ,with probability 50% . (32)

This models the interruption of the CPU from a peripheral when AES is bare metal, or a descheduling of the AES process during one time slot on systems with an operating system (OS). Indeed, such events have the consequence, when they occur, to add a long period of time (often as long or even longer than the duration of the AES) to the encryption time, so that the interruption can be served, or so that the OS re-schedules the AES process. We notice that, in such case, it is more interesting to compute one of our methods, rather than previously existing methods such as CPA. Indeed, distribution-based profiling is more accurate than CPA estimation with noisy signals. For instance, the results from Hassan Aly and Mohammed ElGayyar [^{22} encryptions are required for a key extraction on more recent processors (Pentium Dual-Core and Core 2 Duo), which is significantly more than that used by Bernstein CPA

in his original attack [

We have derived several “information-theoretic” distinguishers as possible solutions to the empty bin issue. Some of them, like the Dirichlet Prior and the Offline-Online distinguishers, required the computation of novel distributions. We have shown in particular that the empty bins, previously believed to be an annoyance and dropped accordingly, can turn out to be valuable assets for the attacker as long as they are treated carefully. Throughout the paper, real timing data are used, making the results very practical.

We have also compared the various distinguishers under two frameworks: a simulated test with synthetic leakage and real-world timing attacks. In both cases, we noticed that the outcome of the attacks depends on the quality of the profiling stage. A good profiling improves the results, where the best distinguisher seems to be the Soft Drop Distinguisher. A poor profiling makes the traditional distinguishers break down. More sophisticated solutions like Offline-Online Profiling and Learned MIA distinguishers are very useful in this case. A possible way to investigate more on this aspect is to use more powerful statistical tools in order to extract the most precise model for the Learned MIA Distinguisher.

The interesting aspect of the studied timing attack is that one does not have to make any assumption on the leakage model. In addition to this, the main advantage of the new distinguishers is that the empty bin issue is completely solved. We also introduced distinguishers which can jointly exploit offline and online side-channel measurements. As an interesting perspective, our approach could advantageously be analyzed using the “perceived information” metric recently introduced by Standaert et al. in ( [

Another perspective would be to compare our information-theoretic attacks with attacks based on machine learning techniques. Surprisingly and contrary to results reported in other papers, our preliminary results show that SCA based on support vector machines [

An interesting observation is that writing cryptographic code robust to timing attacks is challenging. While the OpenSSL code for AES has no obvious flaw (such as unbalanced branches which depend on sensitive data), the timing of AES is data-dependent, due to microarchitectural features of the studied ARM core. There seem to exist two classes of solutions against timing attacks: The first aims at randomizing the execution timing, as studied for instance in [

Part of this work has been funded by “Archi-Sec” (Micro-Architectural Security) 2019-2023 Project, within ANR AAP Générique 2019, and by BRAINE Project from European Union’s Horizon2020/ECSEL research and innovation program, under grant agreement N˚876967.

This paper is an extended version of a paper accepted at HASP workshop and presented at Seoul, Korea, on June 18, 2016, under the title “Template Attacks with Partial Profiles and Dirichlet Priors: Application to Timing Attacks”.

The authors declare no conflicts of interest regarding the publication of this paper.

De Chérisey, E., Guilley, S., Rioul, O. and Jayasinghe, D. (2021) Information Theoretic Distinguishers for Timing Attacks with Partial Profiles: Solving the Empty Bin Issue. Journal of Information Security, 12, 1-33. https://doi.org/10.4236/jis.2021.121001

We have copied here the OpenSSL C code for the encryption function. We notice that this is a straight line code, and that there is a use of Look Up Tables (the T boxes) that may cause the non constant time.