Simultaneous Hashing of Multiple Messages

,


Introduction
The performance of hash functions is important in various situations and platforms.One example is a server workload: authenticated encryption in SSL/TLS sessions, where hash functions are used for authentication, in HMAC mode.This is one reason why the performance of SHA-256 on modern x86_64 architectures was defined as a baseline for the SHA3 competition [ 1].
Traditionally, the performance of hash functions is measured by hashing a single message (of some length) on a target platform.For example, consider the 2nd Generation Intel ® Core ™ Processors.The OpenSSL (1.0.1) implementation hashes a single buffer (of length 8 KB) at 17.55 Cycles per Byte (C/B hereafter).Recently, [ 2] improved the performance of SHA-256 with an algorithm that parallelizes the message schedule, and the use of SIMD architectures, moving the performance baseline to 11.47 C/B (code version from April 2012 is available from [ 3], and will be updated soon) on the modern processors, when hashing from the cache.
In this paper, we investigate the possibility of accelerating SHA-256 for some scenarios, and are interested in optimizing the following computation: hashing a number (k) of independent messages, to produce k different digests.We investigate the advantage of SIMD architectures for these parallelizable computations.Such workloads appear, for example, during the boot process of an operating system, where it checks the integrity of its components (see [ 4] for example).This involves computing multiple hashes, and comparing them to expected values.Another situation that involves hashing of multiple independent messages is data de-duplication, where large amounts of data are scanned (typically in chunks of fixed sizes) in order to identify duplicates [ 5].In these two scenarios, the data typically reside on the hard disk, but hashing multiple independent messages could also emerge in situations where the data is in the cache/memory.
A SIMD based implementation of hash algorithms was first proposed (in 2004) and described in detail by Aciiçmez [ 6].He studied the computations of SHA-1, SHA-256 and SHA-512, and his investigation was carried out on Intel ® Pentium ™ 4, using SSE2 instructions.Two approaches for gaining performance were attempted: 1) Using SIMD instructions to parallelize some of the computations of the message schedule of these hash algorithms, when hashing a single message (see also later works (on SHA-1) along these lines, in [ 7,8]); 2) Using SIMD instructions to parallelize hash computations of several independent messages.Aciiçmez reports that he could not improve the performance of hashing a single buffer, using the SIMD instructions (while this could not be done on the Pentium 4, we speculate that it would be possible on more recent architectures).However, he reports speedup by a factor of 1.71x for simultaneous hashing of four buffers, with SHA-256 (speedup by a factor of 2.3x for SHA-512 is also reported, but it is less interesting in our context, because the comparison baseline was a (slow) 32-bit implementation).
In this paper we expand the study conducted by Aciiçmez, by demonstrating the performance of Simultaneous Hashing of multiple independent messages, on contemporary processors.We detail a method for a "Simultaneous Update" that facilitates hashing of independent messages of arbitrary sizes.To account for different usages, we investigate the performance of hashing multiple messages (of variable sizes) from different cache hierarchies, system memory, and from the hard drive.

Preliminaries and Notations
The detailed definition of SHA-256 can be found in FIPS180-2 publication [9].Schematically, the computational flow of SHA-256 can be viewed as follows: "Init" (setting the initial values), a sequence of "Update" steps (compressing a 64 bytes block of the message, and updating the digest value), and a "Finalize" step (takes care of the message padding).The padding requires either one or two calls to the Update function, depending on the message's length (see more details in [2]).For SHA-256, the performance is almost linearly proportional to the number (N) of Update function calls, which.For a message of length bytes, the value of N is: length +2 length mod 64 56 64 length +1 else 64 For sufficiently long messages, we can approximate N ~ floor (length/64).For example, this approximation for a 4 KB message gives floor (length/64) = 64, while actual hashing of a 4 KB message requires 65 Update function calls (i.e., a ~ 1.5% deviation).

Simultaneous Hashing (S-HASH) of Multiple Messages
SIMD architectures [ 10] are designed to execute, in parallel, the same operations on several independent chunks of data (called "elements").Modern architectures have variants of SIMD instructions that operate on elements of sizes 1, 2, 4, or 8 bytes.By the nature of the algorithms, SHA-256 (and SHA-1) requires operations on 4 bytes elements, while SHA-512 requires operations on 8 bytes elements.
Figure 1 describes the Simultaneous Hashing algorithm (S-HASH) that hashes k messages and generates k Algorithm 1: Simultaneous Hashing (S-HASH) Input: Buffers -a list with pointers to k buffers to be hashed.Lengths -a list with the lengths (in bytes) of the k buffers.Hashes -a list with pointers to store the k generated hash values.

Notations:
The number of t-bit "words" (elements) that fit in a register is m.(for SHA-256, t=32, and with AVX, m=128/32=4).It is assumed that k > m.The number of bytes, hashed by one "Update" operation is denoted by p. Output: k hash values of the k buffers, stored the at memory locations pointed by Hashes.Flow: If unfinished buffers still remain, finish hashing serially digests, with some hash function.Suppose that the implemented hash function operates on t-bit "words" (elements), and that the architecture has s-bit SIMD registers.Then, the number of words that fit into a SIMD register is m = s/t, which we assume to be an integer.We also assume that k > m.Algorithm 1 starts with the Initialize step for the first m buffers.Then, it invokes the "Simultaneous Update" function (for the specific hash function) every time there are m blocks ready for processing.This is repeated until the shortest buffer (from the m processed buffers) is fully consumed.At this point, a padding block is fed to the Simultaneous Update function, to "Finalize" (that buffer).If the hash is already finalized, a block from a new buffer is fed (after the proper "Init").
The near-future AVX2 architecture [ 11] has integer instructions that operate on 256-bit registers.This allows for doubling the number of independent messages that can be hashed in parallel and would lead to, for example, 8-buffers SHA-256 S-HASH or 4-buffers SHA-512 S-HASH.

Results
This section describes the 4-buffers SHA-256 S-HASH results.

The System's Characteristics
The system that was used for generating the reported measurements had the following characteristics:  An Intel ® Core ™ i5-2500 processor (2nd Generation Intel ® Core ™ Processor; Sometimes referred to as Architecture Codename "Sandy Bridge"). 8 GB RAM (DDR3 1600, 2 Channels). A RAID0 array of two Intel ® SSD 320 drives, each one of 80 GB and combined throughput of 400 MB/sec (indicated by "hdparm-t" [ 12]). Fedora 16 OS.
All the runs were carried out on a system where the Intel ® Turbo Boost Technology, the Intel ® Hyper-Threading Technology, and the Enhanced Intel Speedstep ® Technology, were disabled.
All of the performance numbers reported here, were obtained on the same system, ran on the same processor, and under the same conditions.In particular, we point out that all of the reported hash computations include the overhead of the proper padding, as required by the SHA-256 definition [9].
The tested codes were written in assembly language, so their performance is compiler agnostic.The impact of the operating system is relevant only for hashing files from the hard disk, because some system calls (to access files/directories) are involved.However, we suggest that experiments with other operating systems would show the same performance traits that we report here.

Simultaneous Hashing of Multiple 4 KB Buffers, from Different Cache Levels and Main Memory
For profiling the performance of the 4-buffers SHA-256 S-HASH, we wrote a new implementation which processes four buffers in parallel.In order to estimate the advantage of the parallelization, we compare the resulting performance to serial implementations that hash the same amount of data.
To measure the performance of hashing data that resides in different cache levels, or in memory, we note that the processor has ( [13]): 1) First Level Data Cache of 32 KB (per core); 2) Second Level Cache of 256 KB (per core); 3) Last Level Cache of 6 MB (shared among all the cores).Therefore,  For data that resides in the First Level Cache, we hashed a total of 16 KB of data, split to 4 chunks of 4 KB each. For data that resides in the Second Level Cache, we hashed a total of 256 KB of data, split to 64 chunks of 4 KB each. For data that resides in the Last Level Cache, we hashed a total of 2 MB of data, split to 512 chunks of 4 KB each. For data that resides in the main memory, we hashed a total of 32 MB of data, split to 8192 chunks of 4 KB each.Prior to the actual measurements, we ran the hash, in a loop, 500 times, in order to make sure that our data resides in the desired cache level (or memory).
For comparison, we used the OpenSSL (version 1.0.1)SHA-256 (serial) [14] implementation, and the faster implementation, based on the n-SMS method [2] (a version from April 2012, can be retrieved from [3]; An update will be posted soon).
The results, illustrated in Figure 2, show that hashing from all three cache levels can be performed at roughly the same performance, and there is only some small performance degradation when the data is hashed from the main memory.The 4-buffers SHA-256 S-HASH method is 3.42x faster than OpenSSL (1.0.1), and 2.24x times faster than the n-SMS method.

Simultaneous Hashing of Files from the Hard-Drive
The following results account for the performance of Copyright © 2012 SciRes.JIS hashing from the disk.The numbers were obtained using the following methodology.
For the experiments, we prepared two directories with a different combination of files.The first directory (DI-VERSE hereafter) contained 350 files occupying 79 MB (82,833,132 bytes) in total 1 .The files sizes range from 3 Bytes to 7.18 MB (7,533,568 bytes), with the average size of 0.22 MB (236,666 bytes).The detailed size distribution of the file is provided in Table 1 in the Appendix.The second directory (UNIFORM hereafter) contained 8 (large) files of equal size, each one of 17.76 MB (18,623,835 bytes) 2 .For each directory, we prepared, in advance, the list of its files.
To measure the performance of hashing from the hard drive, we flushed the OS "pagecache" and "dentries" and "inodes" caches, before the measurements were taken (using the Linux directive echo 3 > /proc/sys/vm/drop_caches) [ 15].
We measured the following operations: scanning the list (in the prescribed order), opening the files in the list, reading the size of each file, mapping the files to memory, calculating the SHA-256 values and storing them in appropriate location.
Figure 3, top panel, provides the performance for the "DIVERSE" directory in C/B (which is a frequency-agnostic metric).The performance is shown for several processor frequencies, to demonstrate how the harddrive's throughput limits the overall observed performance.The figure shows that at the native processor speed (3.3 GHz), the S-HASH method outperforms the OpenSSL (1.0.1) implementation by a factor of 1.73x.When the processor is down-clocked to 1.6 GHz, all three implementations improve their C/B count, but the S-HASH improves by a larger margin, becoming 2.16x faster than OpenSSL.The bottom panel of Figure 3 shows the same performance, measured in MB/sec.It is interesting to observe that although the frequency of the processor is reduced by factor of two, from 3.3 GHz to 1.6 GHz, the S-HASH throughput reduces only by a factor of 1.28x.
Figure 4 illustrates the performance for the UNI-FORM directory.In this scenario, the performance of OpenSSL and of the n-SMS method are not limited by hard drive, because we see that reducing frequency does not improve the speed in C/B.On the other hand, the faster 4-buffers SHA-256 S-HASH implementation is affected by the hard drives.It improves (in C/B) when the frequency is reduced, although not as much as it does in the DIVERSE test.The figure shows that the 4-buffers S-HASH is 2.86x faster than OpenSSL, when the processor is clocked at 1.6 GHz, and 2.26x faster at the native processor's frequency.
In general, all implementations improve when the hashed files are large.The reasons are that the overheads for opening files are reduced, and the reads from hard drive are sequential.In addition, the S-HASH is faster when the processed files have equal lengths (UNIFORM directory).This happens because the computations for all the four buffers terminate concurrently, allowing four new buffers to be scheduled together.By contrast, in the DIVERSE directory, when a certain buffer is consumed, operations on the remaining buffers are stopped until a new buffer is scheduled.

Conclusions
We illustrated the general S-HASH approach, and demonstrated the advantage of a 4-buffers SHA-256 S-HASH, running on the AVX architecture.The speedups we observe depend on the location of the data, but are significant in all cases.When hashing equal length messages from any of the three levels of the processor's cache, or from main memory, the 4-buffers SHA-256 S-HASH performs at ~5.2 C/B.This is ~2.24xtimes faster than the best known serial hashing implementation.When hashing data from the hard-disk, the CPU performance is not the (only) limiting factor, because the disk's read performance becomes a bottleneck.Here, the 4-buffers S-HASH method executes at effectively 8.65 C/B at the native processor speed, 3.3 GHz.This performance is 2.26x faster than OpenSSL (1.0.1) and 1.67x faster than the n-SMS method [2] under the same conditions (19.55 C/B and 14.45 C/B, respectively).
We mentioned above two scenarios that require hashing of multiple messages, and can enjoy an S-HASH implementation: An OS check of the integrity of its components (during boot time), and data de-duplication.In addition, SSL/TLS servers that need to support multiple connections could also take advantage of an S-HASH implementation, if their software is set to process data from multiple connections in parallel.We suggest that the potential performance gain might be worth the hassle of tweaking the software to accommodate such parallelization.
Since the 4-buffers S-HASH operates on 4 buffers in parallel, one might wonder why it does not achieve the theoretical four-fold speedup factor, compared to the alternative implementation.We mention here two of the reasons: 1) The 2nd Generation Core ™ Processors have an efficient ALU unit that can process data at a faster rate than the SIMD unit.This closes some of the theoretical four-fold gap that AVX can offer; 2) SHA-256 algorithm has a significant amount of rotations.Compared to a single ALU instruction (ROR), the S-HASH method needs to implement rotation by a flow of two (SIMD) shifts, followed by a (SIMD) xor.
Hashing from a hard-drive introduces a different consideration.The RAID array (of two Solid State Drives) that we used in our experiments had throughput of 400 MB/sec.At 3.3 GHz, this throughput is equivalent to processing at the rate of 7.15 C/B.This explains the results that we obtained: while the processor can hash data at 5.18 C/B with the 4-buffers S-HASH method if the data read from the cache (or memory), this performance cannot be reached when the data is fetched from the disk.This is why we get only 8.65 C/B (for the UNIFORM case), but as already noted, this is still significantly faster than the serial alternative.When the processor is clocked to 1.6 GHz, the same disk throughput becomes equivalent to processing at the rate of 3.81 C/B.Thus, on the under-clocked systems, we were able to hash at 6.73 C/B, which is closer (only 1.31x slower) to the processor's hashing capability (5.18 C/B).The remaining gap between the system-wise performance and the maximal processing capability can be attributed to OS overheads, and to the fact that the accessing data stored in the disk is non-sequential (but rather distributed between four areas).
The soon to be released Haswell architecture [11] will support AVX2 with integer instructions that operate on 256-bit registers.With this architecture, we could upgrade our method to implement 8-buffers S-HASH efficiently-in theory, doubling the performance of the 4-buffers S-HASH.However, for hashing data from the disk, we note that the SSD drives are not expected to double their throughput (at least in this time frame), so we should expect less than a twofold speedup.
Note that we intentionally did not study an S-HASH implementation of SHA-512.The reason is that SHA-512 operates on 64-bit "words", and therefore, the current AVX architecture can support only a 2-buffers SHA-512 S-HASH.This makes the S-HASH method less attractive because 1) The SHA-512 ALU implementations are already fast with the n-SMS method (8.72 C/B); 2) While each SHA-512 Update compresses 128 bytes of the message and a SHA-256 Update compresses only 64 bytes, SHA-512 involves 1.25x more rounds in the processing than SHA-256 (80 rounds versus 64).We therefore speculate that SHA-512 S-HASH implementations would become useful only on the AVX2 architectures (doing a 4-buffers S-HASH), but will be slower than 8-buffers SHA-256 S-HASH on that architecture.
We conclude this study by stating that our results show that for some usages, SHA-256 is significantly faster than commonly perceived.
Finally, we add a few related remarks on the five SHA3 finalists [1].Skein and Keccak use 64-bit words, and the remark we made on SHA-512 holds similarly.J. H. Blake and Grostl already use SIMD instructions in their better performing implementations.Therefore, applying the S-HASH method to these algorithms would create a delicate tradeoff with the S-HASH and the benefits of their current use of the SIMD instructions.Such optimization would be an interesting study to carry out.

Figure 2 .
Figure 2. SHA-256 hashing from different cache levels and memory, Intel ® Core ™ i5-2500 (Architecture Codename Sandy Bridge).The performance of the 4-buffers SHA-256 S-HASH is compared to the (standard) serial hashing with the OpenSSL 1.0.1 implementation, and to the n-SMS method (see explanation in the text).

Figure 3 .
Figure 3. Hashing the files in the directory DIVERSE (see explanation in the text).Measurements are taken on the Core i5-2500, operating at different CPU frequencies.Panel a shows the performance in Cycles per Byte.Panel b shows the performance in MB/sec.

Figure 4 .
Figure 4. Hashing the files in the directory UNIFORM (see explanation in the text).Measurements are taken on the Core i5-2500 operating at different CPU frequencies.Panel a shows the performance in Cycles per Byte.Panel b shows the performance in MB/sec.