Parallelized Hashing via j -Lanes and j -Pointers Tree Modes, with Applications to SHA-256

j -lanes tree hashing is a tree mode that splits an input message into j slices, computes j independent digests of each slice, and outputs the hash value of their concatenation. j -pointers tree hashing is a similar tree mode that receives, as input, j pointers to j messages (or slices of a single mes-sage), computes their digests and outputs the hash value of their concatenation. Such modes ex-pose parallelization opportunities in a hashing process that is otherwise serial by nature. As a re-sult, they have a performance advantage on modern processor architectures. This paper provides precise specifications for these hashing modes, proposes appropriate IVs, and demonstrates their performance on the latest processors. Our hope is that it would be useful for standardization of these modes.


Introduction
This paper expands upon the j-lanes tree hashing mode which was proposed in [1]. It provides specifications, enhancements, and an updated performance analysis. The purpose is to suggest such modes for standardization. Although the specification is general, we focus on j-lanes tree hashing with SHA-256 [2] as the underlying hash function.
The future AVX512f instructions set [3] [5] supports 512-bit registers, ready for operating on 16 lanes. It also adds a few useful instructions that would increase the parallelized hashing performance: rotation (vprold) and ternary-logic operation (vpternlogd). The (vpternlogd) instruction allows software to use a single instruction for implementing logical functions such as Majority and Choose, which SHA-256 (and other hash algorithms) use. Rotation (vprold) can perform the SHA-256 rotations faster than the vpsrld + vpslld + vpxor combination.

Preliminaries
Hereafter, we focus on hash functions (HASH) that use the Merkle-Damgård construction (SHA-256, SHA-512, SHA-1 are particular examples). Other constructions can be handled similarly. Suppose that HASH produces a digest of d bits, from an input message M whose length is length (M). The hashing process starts from an initial state, of size i bits, called an Initialization Vector (denoted HashIV). The message is first padded with a fixed string plus the encoded length of the message. The resulting (padded) message is then viewed and processed as the concatenation M||padding = m 0 ||m 1 ||…||m k−1 of k consecutive fixed size blocks m 0 m 1 ...m k−1 .
The output digest is computed by an iterative invocation of a compression function compress (H, BLOCK). The inputs to the compression function are a chaining variable (H) of i bits, and a block (BLOCK) of b bits. Its output is an i-bit value that can be used as the input to the next iteration. The output digest (of HASH) is f(H k−1 ). We call an invocation of the compression function an "Update" (because it updates the chaining variable).

The j-Lanes Tree Hash
The j-lanes tree hash is defined in the context of the underlying hash function HASH, and j (j ≥ 2) is a parameter. We are interested here in j = 4, 8, 16. The input to the j-lanes hash function is a message M whose length is N bits.
When k ≥ 1, the words m j , j = 0, 1, …, k − 2 (if k − 2 < 0, there are no words in the count) consist of Q bits each. If N is not divisible by Q, then the last word m k−1 is incomplete, and consists of only (N mod Q) bits.
We then split the original message M into the j disjoint sub-messages (buffers) Buff 0 , Buff 1 , …, Buff j−1 as follows: , then one or more buffers Buff i will be a NULL buffer. If N = 0 all the buffers are defined to be NULL, and will be hashed as the empty message (i.e. only the padding pattern is hashed in that case).
After the message is split into j disjoint buffers, as described above, the underlying hash function, HASH, is independently applied to each buffer as follows: The final stage of the process is called the wrapping stage. It hashes a message with a fixed size of j × D bytes. The number of updates required is ⌈(j×D+P)/B⌉ that are likely to be serial updates.

Remark 2:
The API for a j-lanes hash for a fixed j would be the same as for the underlying hash, i.e. for SHA-256, the j-lanes implementation could have the following API: SHA256_j_lanes (uint8_t* hash, uint8_t* msg, size_tlen). Similarly to the serial hashing, the j-lanes hashing can process the message incrementally (e.g., when the messages is streamed). Since the parallelized compression operates (in parallel) on consecutive blocks of j × B bytes, it needs to receive only the "next j × B bytes" in order to compute an Update.

The j-Pointers Tree Hash
An alternative way to define j "slices" of the message M, is to provide j pointers to j disjoint buffers Buff 0 , ..., Buff j−1 , of M, together with k values for the length of each buffer. In this case, it is also required that Ʃ i length (Buff i ) = length (M).

The Difference between j-Pointers Tree Hash and j-Lanes Tree Hash
The j-pointers and the j-lanes tree modes are essentially the same construction, and the difference is in how the message is viewed (logically) as j slices. The j-lanes tree mode has a performance advantage when implemented on SIMD architectures because it supports natural sequential loads into the SIMD registers: each word is naturally placed in the correct lane (see Figure 1). The j-pointers tree mode expects the data to be loaded from j locations. It is more suitable for implementations on multi-processor platforms, and for hashing multiple independent messages into a single digest (e.g., hashing a complete file-system while keeping a single digest). Of course, a j-pointers tree can also be used on a SIMD architecture, but in that case it requires "transposing" the data in order to place the words in the correct position in the registers. This (small) overhead is saved by using the j-lanes tree mode.

Counting the Number of Updates
The performance of a standard (serial) hash function is closely proportional to the number of Updates (U) that the computations involve, namely (1) In Equation (1), each Update consumes B additional bytes of the (padded) message, and the number of bytes in the padded message is at least L + P (with no more than a single block added by the padding).
For the j-lanes hash (with the underlying function HASH), the number of serially computed Updates can be approximated by (2) Note that some of the j-lanes Updates are carried out in parallel, compressing min(S, j) blocks per one Update call. Equation (2) accounts for parallelizing at most min(S, j) block compressions, thus contributing the term ⌈L/(min(j,S) × B)⌉, plus one Update for the padding block. A padding block is counted for each lane, although, depending on the length of the message, some Updates are redundant. The wrapping step cannot be parallelized (in general) and adds ⌈(j × D + P)/B⌉ serial Updates to the count.

Performance
This section shows the measured performance of j-lanes SHA-256, for j = 4, 8, 16, and compares it to the performance of the serial implementation of SHA-256. The results are shown in Figure 7 and Figure 8.
Clearly, the j-lanes SHA-256 has a significant performance advantage over the serial SHA-256, for messages that are at least a few kilobytes long. The choice of j affects the hashing efficiency: for a given architecture, j-lanes SHA-256 with j > S is slower than j-lanes SHA-256 with the optimal choice of j = S, due to the longer wrapping step. However, the differences become almost negligible for long messages.

Conclusions
This paper showed the advantages of a j-lanes hashing method on modern processors, and provided information on how it can be easily defined and standardized.
The choice of j is a point that needs discussion. If a standard supports different j values, then the optimal choice can be selected per platform. This, however, could add an interoperability burden, and we can imagine that a single value of j would be preferable. In this context, we point out that Figure 2 and Figure 3 (theoretical approximations) are consistent with Figure 7 and Figure 8 for j = 4 and j = 8 (actual measurements). Therefore, Figure 4 can be viewed as a good indication for what can be expected when using j = 16 on the future architectures that would introduce the AVX512f architecture (supporting S = 16). Furthermore, j = 16 allows better parallelization on multicore platforms. Consequently, our conclusion is that if only one value of j is to be specified by a standard, then the choice of j = 16 would be the most advantageous.