Optimized Homomorphic Scheme on Map Reduce for Data Privacy Preserving

Security insurance is a paramount cloud services issue in the most recent decade. Therefore, Mapreduce which is a programming framework for preparing and creating huge data collections should be optimized and securely implemented. But, conventional operations on ciphertexts were not relevant. So there is a foremost need to enable particular sorts of calculations to be done on encrypted data and additionally optimize data processing at the Map stage. Thereby schemes like (DGHV) and (Gen 10) are presented to address data privacy issue. However private encryption key (DGHV) or key’s parameters (Gen 10) are sent to untrusted cloud server which compromise the information security insurance. Therefore, in this paper we propose an optimized homomorphic scheme (Op_FHE_SHCR) which speed up ciphertext ( c R ) retrieval and addresses metadata dynamics and authentication through our secure Anonymiser agent. Additionally for the efficiency of our proposed scheme regarding computation cost and security investigation, we utilize a scalar homomorphic approach instead of applying a blinding probabilistic and polynomial-time calculation which is computationally expensive. Doing as such, we apply an optimized ternary search tries (TST) algorithm in our metadata repository which utilizes Merkle hash tree structure to manage metadata authentication and dynamics.


Introduction
The rapid development in outsourcing data processing and storage by distri-buted computing framework, and in addition to complex and huge data collection mining have extended the availability of useful data to various organizations of modern society in the exponential way.But, data privacy insurance is a principal issue in huge dataset management on cloud environment, as the dataset proprietor has not any more physical control of his dataset as per the Cloud Security Alliance (CSA) [1].For instance, the remote body sensed data monitoring system contains delicate data like patient identity and address that can lead to harmful consequences in case of disclosure.Therefore, Mapreduce which is a computing framework that process and create substantial distributed information must be strongly secure for data privacy preserving in cloud.Security assurance issues identified with MapReduce and cloud have started to draw serious consideration.Puttaswamy et al. [2] presented a set of tools called Silverline that can isolate all practically encryptable information from other cloud application information to guarantee data protection.Likewise, Zhang et al. [3] proposed a privacy leakage upper-bound limitation based solution to deal with data privacy preserving issue by just encrypting part of available data on cloud.Roy et al. [4] proposed a framework named Airavat which constrains obligatory access control via differential privacy technic.Blass et al. [5] proposed a data privacy model named PRISM for the Map Reduce structure on cloud to perform parallel word search on encrypted data collections.Ko et al. [6] proposed the HybrEx Map Reduce model that processes very sensitive and private information by a private cloud, while others data can be securely processed in the public cloud.Zhang et al. [7] proposed a system called Sedic which partitions Map Reduce computing tasks in terms of the security labels of data.Thereby, they work on and then assign the computation without sensitive data to a public cloud.
However, conventional data encryptions that are considered as the essential technique to address security issue in cloud can't be straightforwardly implemented in Mapreduce framework.Therefore a new cryptosystem called homomorphic encryption which could process encrypted data is first introduced in [8] to find an effective solution to this challenge.This fully homomorphic encryption (FHE) can compute arbitrary function on encrypted data without using secret key.However, FHE rose two major challenges regarding its implementation and computation cost [9] [10].First, the number of available FHE schemes in the literature is very limited (few).Second, the efficiency of fully homomorphic encryption is the biggest challenge to address.Therefore, the efficiency of FHE has been the key problem since its invention, which hinders its myriad of potential applications such as private cloud computing in practical.Specially, the size of key in FHE scheme is big.Thereby, a FHE scheme based on Learning with Error (LWE) doesn't only include public key and private key but also includes some evaluation keys.For an L-Leveled FHE scheme, there are L evaluation keys.
Each evaluation key is a ( ) ( ) Clearly, these matrixes are high dimension, which not only need a lot of space to store but also affect the efficiency of computation.In 2009, an updated version done by researcher [11] is by all accounts an effective alternative to address data privacy assurance issue which is considered by information security community as a paramount topic.Thereby, schemes like DGHV [12] and Gen 10 [13] are introduced to securely compute data through homomorphic approaches on Mapreduce environment.Unfortunately those mentioned solutions experience some basics shortcomings as far as security concerns [12] [13].The DGHV scheme [12] uses many of the tools from Gentry's construction.But this model does not require ideal lattices.Moreover, they prove that the somewhat homomorphic component from Gentry's ideal lattice-based scheme [13] can be replaced with a very simple somewhat homomorphic scheme that uses integers.Therefore their model is conceptually simpler, but the private key should be transferred to cloud server, which is very insecure.Gentry proposed a homomorphic encryption Gen 10 scheme [13], applicable in cloud environment and conceptually simple.In this scheme the encryption function is homomorphic with respect to addition, subtraction and multiplication.The relationship between c and m is that m is the residue of c with respect to modulus p, that is . To retrieve the ciphertext, the Gen 10 scheme doesn't present the private key as [12].
But a random private key parameter q instead seems quite more secure.Yet the parameter q is sent to the server, by using the formula mod c q, the plaintext m may leaks out.So this scheme suffers from security weakness too.The goal of this paper is to construct an efficient FHE scheme with better key size.Thereby in this work, we introduce our solution by presenting an optimized scalar homomorphic scheme (Op_FHE_SHCR) that first addresses the above mentioned data privacy protection shortcomings through efficient operations over ciphertexts without compromising the cryptosystem like the existing models [12] and [13].Furthermore, we demonstrate how fast our proposed solution retrieves ciphertexts at Reduce stage in an optimized and secure way.At that point of this work, we present the programming and cryptographic primitives in Section 2. A short discussion on related work is presented in Section 3. The Section 4 will give details of the concrete execution and security analysis of our proposed solution.At long last we close this work in Section 5.

Map Reduce
Map Reduce in [14] is defined as a computation framework to process and generate large datasets.In this programming environment, users specify a Map  1).
A homomorphism between two algebras, A and B, over a field (or ring) K, is a map F: A → B such that for all k in K and x, y in A; • ( ) ( ) If F is bijective then F is said to be an isomorphism between A and B.

♦ Homomorphic Encryption
As an ever increasing number of data is outsourced into distributed storage, frequently unencrypted, considerable trust is required in the cloud providers.
The CSA records information breach as the top issue to cloud security [1].Encrypting the data with conventional encryption addresses the issue.But in this case, the end user can't work on the encrypted data and must download them and performs the decryption locally.Therefore, there is a need to allow completely the public cloud server to perform calculations in the interest of the end users and return just the encrypted result.Thereby, the development of homomorphic encryption is a very impressive advance, incredibly amplifying the extent of computation which can be applied to process encrypted data homomorphically.Thus, the enthusiasm in the research community is justified by the various applications in the real world (like medical applications, consumer privacy in advertising, data mining, financial privacy) of this theme.
Homomorphic encryptions allow complex mathematical operations to be performed on encrypted data without compromising the encryption.This homomorphic encryption is expected to play an important part in cloud computing, allowing companies to process and store encrypted data in a public cloud and take advantage of the cloud provider's analytic services.It is first designed in 1978 by Rivest et al. [8] and upgraded by the researcher's community.Thereby, Craig Gentry [11] hypothetically shows the possibility of implementing this kind of encryption scheme [13].In the same way, researcher Jaydip Sen represents homomorphic encryption clearly as a quadruple in [15].However, Homomorphic ciphers typically do not, in and of themselves, do not provide verifiable computing and some variants are not semantically secure.Furthermore, the poor performance is the big disadvantage of this scheme.Ciphertexts are much larger than the plaintexts, so communication requirements typically go up.The computations on these large ciphertexts are typically slower than if you just performed the computation on the plaintext itself.Because of this, in the outsourcing computation model, we typically see a requirement that encrypting inputs and decrypting outputs should be faster than performing the computation itself.
Therefore there is a high need to optimize the data processing in order to reduce efficiently the computation and communication costs.

Related Work
Security insurance issues on Mapreduce framework have started to draw escalated consideration.In this manner data confidentiality protection issues have been widely examined and productive progress have been accomplished by the security community practitioners.We quickly audit few existing models about security protection on Mapreduce framework.

Xu Chen and Qiming Huang Scheme in [16]
The authors in [16] presents a data privacy insurance scheme on Mapreduce utilizing homomorphic encryption.It is a modified Mapreduce model, to guarantee data secrecy, and additionally processing the data in encrypted form.They pick two major prime numbers A, B, and make P = A * B, and afterward generate a random positive integer A, which is the private key, and B ought to likewise be confidential. Encryption: Decryption: But, to apply homomorphic encryption in this scheme, authors do few modifications on ciphertexts to allow the Reduce function to find the identical keys and afterward group them like following: R as random positive number.
Then, the authors compare ( * C ), rather than C to retrieve similar keys.Thereby, it's obvious that their proposed solution needs an additional computation ( * C ) at reduce phase which is costly in term of computation in order to get a probabilistic homomorphic cryptosystem.Review that this homomorphic cryptosystem is exceptionally expensive, therefore their model is inefficient.

FHE_SHCR Scheme in [17]
As discussed in [17] the related work [16] requires additional computation at the reduce stage, the DGHV [12] and Gen 10 [13] schemes send respectively their private key and sensitive security parameters to unreliable public cloud server (compromising the cryptosystem).Additionally, the above mentioned models don't address the security and efficiency ciphertext retrieval issues.Therefore, authors in [17] present their contribution FHE_SHCR, which is based on schemes [13] [18] and [19] to address the privacy shortcomings of models [12] [13] and [16].Subsequently, the main objective of this model is to securely retrieve ciphertexts at reduce stage and enhance the retrieval algorithm accuracy without getting any information about the content of intermediate searchable ciphertext to fix the security shortcomings in [12] and [13].The FHE_SCHR scheme is an efficient candidate for homomorphic encryption to preserve data privacy in cloud by a strong hybrid encryption [17].
Our contribution: Note that, the improvement on this paper is mainly on the optimization of the input file decomposition (map phase) and ciphertext retrieval algorithm (reduce phase) by addressing the metadata dynamics and authentication path through a logical Merkle tree repository structure (optimized space-time cost).

The Optimized FHE_SHCR (Op_FHE_SHCR)
As clearly proved by the research community; the homomorphic encryption can carry some operations over encrypted data effectively, but it is very expensive scheme in terms of computation and communication costs [8] [11] [13].Therefore, we introduce a new Logical agent: Anonymiser (with its three components: Decomposition Table, Query Processing, and the Metadata Repository) at the user side under the control of the master program.Doing so, the user program can efficiently send to the master program the optimal decomposition (Splitting: Key/value) of a given input files before encrypting the data (Shuffle).Thereby, we use our optimized ternary search tries (TST) [20] in a logical Merkle tree structure to optimally address the metadata authentication and dynamics through the Metadata repository component.The architecture overview of the proposed solution can be depicted by Figure 2.
Thus, our optimized algorithm (Op_FHE_SHCR) through successful experiments (see section below) performs 3 times faster the original FHE_SHCR scheme [17] and effectively addresses the metadata dynamics and authentication issues through a secure and efficient metadata repository (Optimized ternary search tries to address the time space constraints).Furthermore, we speed up the ciphertext retrieval algorithm (accuracy and efficiency) at the reduce phase by using the optimal Lagrange multiplier (µ * ) as the optimum number N e (See section below).Note that, the implementation of this Anonymiser as our Trusted Front End Database Management (TFE) is to enhance the security and speed up the data processing of our proposed solution and represents the key point of this extended work.

Optimized Metadata Authentication and Dynamics
As mentioned in the previous section, our proposed scheme further addresses metadata authentication and dynamics issue for strong data privacy protection.
Therefore, we introduce a logical agent: Anonymiser in the master control pro-  as depicted by the above Figure 3.The data matching and authentication starts from root node in a top-down manner and its dynamics process can be described as follows.
File uploading: Suppose that a data owner wants to process a file F identified by i a (leaf node) with public cloud server whose attributes satisfy an access Then, for each where pi k is the symmetric key for each i p .Finally, the private cloud sends { } ( ) a t w to the public cloud as well.
To enhance the searching efficiency, a symbol-based tree is utilized to build an index stored in private cloud (metadata repository).More precisely, divide the output of one-way function f into l parts and predefine a set sisting of all the possible values in each part (an example of such tree can be shown in Figure 3).Initially, the index based on symbol-based tree has only a root node (denoted as  0) which consists of ∅ (an empty set).The search process in a symbol based tree is a depth first search.The tree can be updated and searched as follows.
Update: Assume the data owner wants to outsource a file F identified by i a with keyword set W, the public cloud will receive ( ) Step 2: Public cloud starts with the root node of tree: it scans all the children of the root node and checks whether there exists some child node 1 such that the symbol contained in node 1 equals il a .This action is performed in a top-down manner.In general, assuming that the subsequence of symbols t w the public cloud will perform actions similar to the three steps described above.One exception is that if matching fails (i.e. the current node has no children which can match the symbol), the search for ( ) t w is aborted.Otherwise, get the corresponding ( ) the identifier i a in the leaf node.
So to address Merkle tree traversal problem, our scheme uses some tools from the efficient algorithm in [22] to overcome the space-time issue.Furthermore, to optimize the time space constraints in the Merkle hash tree traversal process, we designed an optimized ternary search tries (TST) [20] which is a sorting algorithm that blends quicksort and radix sort.Thereby, it is competitive with the best known C sort codes.It is faster than the traditional hashing and other commonly used search methods as shown below (Figure 4): The TST is space efficient, but increases with the number of strings (N).
Therefore the traversal problem is how to calculate efficiently the authentication path for all leaves one after another starting with the first leaf up to the last leaf, for minimum amount of space-time cost.Hence, it implies to analyze an optimal distribution of singleton attribute ( i a ) to enhance the efficiency of the proposed solution; that is to find the optimal number of strings or attributes (N) to populate the tree.In this work we use the Karush-Kuhn-Tucker (KKT) condition of constrained optimization problem [24] to solve the above mentioned issue in the section below.Practically, we design our solution using some mathematical tools Figure 4. String symbol table implementation cost summary from [23].
from the scheme in [25] to find the minimum number of singleton quasi-identifier that gives the optimal security level for the proposed traversal algorithm efficiency. Let Therefore the probability for the (i th ) element to be a singleton in the universal decomposition table by selecting one of the (n) choices (entries) is ( ) Let the variable i x be the indicator representing whether (i th ) element is a singleton, then its expectation is calculated as below: Let ∑ , be the variable that counts the number of singleton; its ex- pectation is given by: We aim to find the smallest number of singleton to populate efficiently the Merkle tree in the metadata repository.
It implies to minimize ( ) Therefore we get the optimized number of singleton by rewriting the above distribution as a constrained optimization problem [24].Doing so, we find the dual solution of the primal problem which is fast and reduces the space-time costs as follows: ( KKT condition: Primal variable: i x ; Lagrange Multiplier: , λ µ Let us considering an optimization problem of forms Minimize f(x) Subject to ( ) ( ) Using the Lagrange multiplier and the duality theorem, the solution of the problem (P) is determined as following: ( ) ( ) ( ) q µ is a smooth function, then its gradient equals to zero at the optimal number (x * ): ( ) Then, ( ) ( ) , then 1 i x * = .So we have: ( ) ( ) otherwise contradiction to the KKT conditions.
Finally the optimal number of singleton quasi-identifiers for a decomposition table of (n) entries with maximum total number of distinct values (N) is N e .Using this optimum number ( N e ) to populate the ternary search tree, we im- prove the performance of the Merkle tree traversal algorithm by addressing the space-time cost issue.Thereby, the overall ciphertext retrieval time at reduce phase for our optimized algorithm Op_FHE_SCHR is almost three times less than the exiting one FHE_SCHR refer to Figure 5.

FHE_SCHR Efficiency and Implementation Analysis
Regardless of the advances in remote sensor network (WSN) to controls systems into cloud, there are still enormous challenges in term of security insurance over outsourced data processing and storage [26].Therefore, we work for experiment purposes over trained sensed dataset for cancer pattern monitoring project.So our main goal in this paper is to securely optimize the map phase (input files decomposition) and ciphertexts retrieval (reduce phase) process.Thereby, we implement an optimized scalar homomorphic based Mapreduce scheme Op_FHE_SHCR, which contains four algorithms: KeyGen ( e k ), Encrypt (1 σ , e k , s.m) Decrypt(c) and Retrieval(c).Note that these four algorithms are quite similar like those in [21], with modified mapping and hashing algorithm like below (Algorithm 1).

Pseudo code:
This algorithm initializes the selected feature subset (splitting the input file into subsets) denoted by S F , with the empty set.A candidate feature subset, denoted by C F , will be produced by adding a feature, denoted by d f , with [ ] another feature to generate a new candidate.That is, the new feature in the chosen candidate will be added to the selected feature subset.Thus, this algorithm iteratively adds one feature (or the fixed number of features if the floating strategy has been used) to increase the selected feature subset until the threshold is met.It should be pointed out that the main difference between the proposed algorithm and the existing ones in the literature is that our algorithm produces high correlated data subsets based on the hashing index value.Therefore the ciphertext retrieval process at the reduce stage will be more efficient in terms of speed.
The design of our OP_FHE_SCHR cryptosystem is done using the HElibmaster-2015.03library in Dev C++ IDE.We utilize the WDBC Test training dataset for cancer management project.Our security algorithm is implemented in four steps using Gentry cryptosystem [13] [18] and [19].
The efficient analysis of the candidate solution is proved by its experiments results that are compared with the existing blinding fully homomorphic FHE_DFI_LM algorithm, previous FHE_SCHR, and our new optimized Op_FHE_SCHR algorithm.Recall that, the improvement on this paper is mainly on the optimization of ciphertext retrieval time and metadata dynamics and authentication path in the logical Merkle tree repository (optimized space-time cost).
Table 1 and Figure 5 show the average performance of our proposed solution (Op_FHE_SCHR) in comparison to related works (FHE_DFI_LM & FHE_SCHR).
Recall that, the experimental requirements are to optimize the outsourced data processing at the map stage and prevent intermediate data disclosure at the reduce phase in order to reinforce data privacy on Mapreduce framework.Therefore our Op_FHE_SCHR processes almost the data at Map stage (setup phase refer to Table 1 and Figure 5) three time less (5932 ms) than FHE_DFI_LM (13078 ms) and two time less than our previous work FHE_SCHR (11684 ms).
This result is obtained by an optimized map workers selection using the optimal number N e , for a given decomposition table of (n) entries at the splitting step.
Thereby for a given dataset N, our algorithm calculates in advance the exact optimal number of subsets (feature selection) and map workers to speed up the splitting and data allocation process at the Map stage.Furthermore, each element (feature) of a subset is selected by an efficient feature selection algorithm (refer Algorithm 1).; refer to the Figure 3.So an attacker holding a hash value ( ) hai in order to reconstruct (A); needs some additional values called Auxiliary Authentication Information (AAI) which are kept secret by the metadata repository administrator under the supervision of the Anonymiser Query system.Therefore it is very hard for the public cloud server or outside attacker to reconstruct the input files by the decomposition table (A).
To summarize the security analysis, we can say by implementing a secure front end database management agent (Anonymiser) on top of FHE_SHCR security mechanism [17] that the data privacy insurance has been greatly reinforced in our proposed solution (Optimized FHE_SHCR).

Conclusion
In this paper, the requirements are to optimize the outsourced data processing at the map stage and prevent intermediate data disclosure at the reduce phase in order to reinforce data privacy on Mapreduce framework.Therefore, we implement a secure Front End Database Management agent: the Anonymiser with its three components (Decomposition table, Query Processing, and Metadata Repository.) to enhance the data security mechanism of our proposed solution.The cryptosystem tool is a scalar homomorphic encryption that performs some sorts of calculations over encrypted data in more secure and optimized design.The optimized cryptosystem Op_FHE_SCHR is by the experiments results an efficient candidate for the communication and computation costs reduction.Practically, it takes as input files an optimized decomposition table (for map workers), and improves the speed and accuracy of ciphertext retrieval process (for reduce workers) on Mape Reduce environment.Furthermore, we address the metadata dynamics and time space cost constraints for the traversal of Merkle tree structure in our metadata repository by applying an optimized ternary search tries (TST) algorithm.
[21]eeps data decomposition done by the decomposition table and forwards them to the Query Processing unit to generate new anonymous query request.For the efficiency of the proposed scheme, we use Merkle hash tree structure to deal with metadata authentication and dynamics[21].Thereby the master program assigns a particular input files decomposi- table, Query Processing, and Metadata Repository.Their functions can be briefly described as following: ♦ Decomposition table: It is responsible for defining the exact set of attributes (Α ) for particular input files in the optimal number.♦ The Query Processing: It filters the candidate map workers queries request generated by the master program to produce anonymous query-based request on data location for processing.♦ The Metadata Repository: ,