_{1}

^{*}

The Advanced Encryption Standard (AES) has been extensively used in many systems to provide security. However, many of the implementations are susceptible to Side Channel Attacks (SCA). To address the susceptibility, several researchers have proposed a number of countermeasures. Masking scheme is one of the countermeasures that are commonly used to counteract such attacks. In this paper, we investigate the overhead of the countermeasure in terms of execution time of the first order-masking scheme for the AES “Encryption part only”. To achieve that, we have implemented the sequential algorithm of the AES with single processor, and after that used OpenMP directives to reduce the overhead of the countermeasure. Subsequently, the result of the sequential algorithm has been compared with its parallel implementations using 2, 4, 6, and 8 processors. We show how parallel implementation of the AES offers better performance yet flexible enough for cryptographic algorithms. Compared to the sequential unmasked AES implementation, the best masking scheme for the first order using data parallelism shows a performance in terms of speed up around 5x when 8 threads are used.

AES (Advanced Encryption Standard), also referred to as Rijndael algorithm is classified as a symmetric key cryptography algorithm. In the year 1998, Rijndael algorithm was established by two Belgian scholars (John Daemen and Vincent Rijmen). This cipher was later on elected by the National Institute of Standards and Technology (NIST) [

AES has been widely employed in numerous embedded devices with the objective of enhancing security [

The rest of the paper is organized as follows. Section 2 discusses the detailed AES algorithm. Section 3 defines and discusses the side channel attacks. Section 4 shows the countermeasure of Side channel attacks through masking. Our proposed parallelism schemes are that algorithm is explained deeply in Section 5. The experimental results for a different number of threads are explained in Section 6. Finally, the conclusion is in Section 7.

In this section the phases of AES Algorithm are explained as follows:

1) Key Expansion:

from the encryption key and by employing Rijndael’s key schedule, the round keys are originated. However, it is important to take note that a distinct 128-bit round key block is needed by AES for every round in addition to one more extra round.

2) Initial Round:

At the beginning, an Add Round Key (pre-round transformation) will be applied with each byte of the state been collected with a block of the round key employing bitwise XOR.

3) Rounds:

The number of the rounds is dependent on the specific length of each key while in every round, the following conversion will be followed:

a) Sub Bytes: this entails a non-linear replacement phase which involves every byte been replaced with an alternative one as per the lookup table.

b) Shift Rows: this is a switch phase where the last three rows of the state are moved sporadically with a specific number of stages. Shift Row changes the second row as 1 byte to the left, 2 bytes to the left for the third row and 3 bytes to the left for fourth row.

c) Mix Columns: this phase sees the 4 bytes of every column of the state been joined by an invertible linear change system. Indeed, the Mix Columns function precedes 4 bytes as input and outputs 4 bytes; with each input byte imitating all the four-output bytes. Collected with Shift Rows, diffusion in the encryption is delivered by the Mix Columns. Furthermore, in the operation, every column is multiplied and increased by a fixed matrix:

[ 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 ]

Matrix multiplication is comprised of addition and multiplying of the accesses, and in this case, the multiplication process is described as follows: multiplication by one implies no modification, multiplication by two implies shifting to the left, whole multiplication by three implies shifting to the left and then presenting XOR with the preliminary unchanged assessment. Further, after shifting, a restricted XOR with 0x1B has to be accomplished if the shifted value is higher than 0xFF. The cases highlighted comprises of the distinct cases of the normal multiplication in Galois Field (GF) (2^{8}). Therefore, addition entails an XOR operation. Generally, every column tends to be considered as a polynomial over GF (2^{8}) and is after that multiplied modulo x4 + 1 with an immobile polynomial c(x) = 0x03 · x^{3} + x^{2} + x + 0x02. The resultant coefficients are demonstrated in their hexadecimal equivalent of the binary demonstration of bit polynomials from GF (2) [x]. In case of the Mix Columns phase, it can also be observed as a multiplication by the illustrated specific Maximum Distance Separable (MDS) matrix in the restricted field GF (2^{8}). This procedure is designated more in the object Rijndael’s mix columns.

d) Add Round Key:

In this phase, sub key is joined with the state. For each round, a subkey is originated from the main key by the use of Rijndael’s key schedule with every subkey been of similar size to the state key. By connecting each byte of the state with the equivalent byte of the subkey, through the use of bitwise XOR, the subkey provides the support.

4) Final Round (no MixColumns)

a) Sub-Bytes.

b) Shift Rows.

c) Add Round Key.

In this section the main idea of Side Channel Attacks (SCA) is explained:

SCA comprises of attacks which happen to be established on the Side Channel Information [

Simple Power Analysis (SPA) can be explained as a side-channel attack, that comprises visual analysis of graphs of the current that is used by a specific device over a given period of time. Differences in power usage occur as the device executes dissimilar operations. For instance, distinctive directions accomplished by a microprocessor will have divergent power consumption outlines. Correspondingly, in a power trace from a smart card execution of an AES encryption, the ten rounds can be seen evidently as shown in

Timing attacks are constructed on calculating the duration it requires a unit to accomplish operations. This data can provide information related to the password keys. For instance: by cautiously assessing the time interval needed to accomplish private key operations, an intruder may find secure Diffie-Hellman exponents, factor Rivest Shamir Adleman (RSA) keys and get access to another cryptosystem. In case a unit is susceptible, the attack is obviously going to be

simple and often necessitates only identified cipher text.

Comparably, squaring and multiplication operations in RSA accomplishments can frequently be illustrious, facilitating an attacker to process the password key. Even in cases when the magnitude of the variants in power usage is insignificant, average digital oscilloscopes could simply demonstrate the data-induced disparities. In order to filter out high-frequency mechanisms, Frequency filters and averaging functions are regularly used.

Differential power analysis (DPA) comprises of a side-channel attack that holds statistically exploratory power usage capacities from a cryptosystem [

In this section, we discuss the masking scheme which has been applied in this work on AES of 128-block size and 128 of key size (10 rounds). Random values “X” are generated at the beginning of the algorithm and XORed with plain text “A”. Subsequently, the masked value (A♁X) is XORed with the cipher key “K”. AES operations will modify the random (mask) values during the AES transformations. Hence, the mask values are corrected after each AES operation. In our implementation, during the execution of 10 rounds, the mask values are updated in Sub-Byte and Mix-Column steps to provide more security. Also, new mask values are generated again for a new plain text block. AddRoundKey is XORed with the plain text to form the input of each round. However, the Mix-Column and the mask values have to be removed from the last round (round 10) to form the cipher text as has been illustrated in

· Rectangular shape: expresses the input and output state after each operation.

· A: Represents the plain text.

· X: Represents the mask value.

· K: Represents the Key.

· X i 2 1 : Represents the linear of mask transformation (“X_{1}”) after sub byte step and simultaneously the mask values are updated second time with the new mask value (“ X i 2 ”).

· X i 2 2 : Represents the linear of mask transformation after Shift row step “X_{2}” but it is not updated.

· X i 3 3 : Represents the linear of mask transformation after Mix-column “X_{3}” step and simultaneously the mask values are updated third time with the new mask value “ X i 3 ”.

· Symbol “B”, “S”, “T” and “AK”: Demonstrates the output state after Sub-Byte, Shift Row, Mix-column and after AddRoundkey steps respectively.

· E: Represents the cipher text.

S-Box implementation:

Here, we highlight the justification in why we chose the Sub-Byte function to be the perfect candidate for parallelism. The Sub-Byte step in the basic AES algorithm replaces each byte in the state matrix from its corresponding look-up value. However, the first-ordered masking includes an additional 16 bytes for the random values as shown in

AES S-box is described by a multiplicative inverse x − 1 and an affine transformation. In order to get the inverse, each byte has to be multiplied by itself 254 times such that x254∙x = 1 in the Galois Field and hence x − 1∙x = 1. Once inverse has been obtained, affine conversion is applied by multiplying multiplicative inverse value with a fixed matrix and then adding it with 0 × 63 as follows:

[ b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 ] = [ 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 ] × [ a 7 a 6 a 5 a 4 a 3 a 2 a 1 a 0 ] ⊕ [ 0 1 1 0 0 0 1 1 ]

The main difficult part while designing a masking is Sub-Byte step because it is nonlinear transformation. On the other hand, the part of multiplicative inverse is the only part that has to be modified. Thus, we used the proposed method (a secure inverse algorithm) to generate the multiplicative inverse for a given number in GF (s8). It is important to note that during the calculation of the algorithm in this operation, the mask values are updated. Therefore, there are lots of computations in the Sub-Byte making it a good candidate to be parallelized.

In this section we present the two methods of parallelism that have been used in

this work. The first method is using dividing data to run independently, and the second method is to apply the parallelism specifically in the Sub-Byte function. The detailed explanation is presented below:

In Data parallelism, the data is divided into different parts with these parts been assigned to various threads for execution. Every node executes the same procedure or function even though on a different data set. The method is quite successful in cases where there is huge data to be processed. AES can be employed in the following way using Data parallelism; in cryptography, the best parallelization means of operation employed with a block cipher to undertake the entire encryption algorithm entails the Electronic Code Book (ECB) mode. The complete plain text is divided into blocks of a certain length which can be processed individually. Every block of plaintext is encoded with a similar key as a unit and transformed into a cipher text block. Each thread takes 16 bytes at a time and executes the AES encryption independently. When the threads finish the job, there is a barrier being used to make sure that the writing to the cipher text has to be ordered according to the order of the plain text.

Here everything is executed sequentially except the Sub-Byte step. As mentioned above, the Sub-Byte stage takes more time to generate the Sub-Byte value. In the Sub-Byte operation, instead of a single thread doing the multiplicative inverse and the affine transformation alone for each byte in the (4 * 4) state matrix; we assign different threads to different bytes. Since each byte can be calculated independently with no dependencies, parallel paradigm can be implemented in this stage. Since we have a two-dimensional array, we have used nested for-loop in order to reach each byte. Thus, we used the proper method that parallelizes the nested for-loop in OpenMP and increases the performance.

The AES algorithm has been efficaciously parallelized using OpenMP API directives and compiled in GCC Linux. The specifications of the system used are: Intel Core i7-3770 that has 3.40 GHz CPU, an 8 GB RAM and a 6 MB cache size. We implemented the sequential part of the code and then worked on the parallelization part. We carried out the experiments on different file sizes using different threads numbers while a number of threads have been used to execute the AES algorithm on different file sizes that vary from 500 KB to 9 MB and have a scale of 500 KB.

The last experiment was applied to an 18 MB file size with 2, 4, 6, and 8 threads in order to observe the behavior of the processing of large data-sets. However, we

are using our machine implying that we have limited resources which are up to 8 threads. By executing the sequential algorithm on the 18 MB, the running time was 616.12 seconds.

Number of Processors | Execution Time | Speed up |
---|---|---|

2 | 388.6366678 | 1.585417227 |

4 | 247.4803907 | 2.489697332 |

6 | 249.1675771 | 2.472838865 |

8 | 208.2105133 | 2.959270685 |

Number of Processors | Execution Time | Speed up |
---|---|---|

2 | 319.1093511 | 1.930846796 |

4 | 171.0243877 | 3.60270998 |

6 | 158.656005 | 3.883567271 |

8 | 120.9055063 | 5.096139019 |

5.09 seconds when 8 threads are executed. However, according to the number of threads being used and its utilization, 2 threads gives more utilization. In data parallelism, the speed up reaches 1.9x, which is almost to 2 when using 2 threads, whereas the task parallelism reaches 1.5x.

As can depict from

The paper has explained in detail the parallelization of the AES algorithm. The AES input is handling 128 bits (one block) at a time. Each block goes through several operations such as Sub-Byte, Shift-Row, Mix-Column and each operation depends on the previous one. The AES algorithm has been parallelized using two methods namely Data and Sub-Byte parallelization. Each block (128-bit) from the plain text that is going to be encrypted is independent. Thus, the first phase is to apply the (single instruction multiple data) method (SIMD) since each block is independent from one another. Since we are using masking technique, we observe that the sub-Byte operation dominated the cost overhead. Thus, we decided to parallelize the Sub-Byte operation because it requires multiplicative inverse and affine transformation for each given byte and that consumes more time in computations. On the other hand, OpenMP directives have been employed to parallelize the code and address the problem of the overhead caused by masking with up to 8 threads that have been used to run multiple experiments. Finally, our work mainly focused more in the Sub-Byte function and we had limited number of threads. Therefore, in the future work, we consider using Field-Programmable Gate Array (FPGA) to utilize the hardware through performing many computations which we are expecting a better enhancement in both speed and security aspects.

The author declares no conflicts of interest regarding the publication of this paper.

Alhumyani, H. (2019) AES Overhead Mitigation Using OpenMP. Journal of Computer and Communications, 7, 206-218. https://doi.org/10.4236/jcc.2019.77017