An Improvement of Data Cleaning Method for Grain Big Data Processing Using Task Merging

Data quality has exerted important influence over the application of grain big data, so data cleaning is a necessary and important work. In MapReduce frame, parallel technique is often used to execute data cleaning in high scala-bility mode, but due to the lack of effective design, there are amounts of computing redundancy in the process of data cleaning, which results in lower performance. In this research, we found that some tasks often are carried out multiple times on same input files, or require same operation results in the process of data cleaning. For this problem, we proposed a new optimization technique that is based on task merge. By merging simple or redundancy computations on same input files, the number of the loop computation in MapReduce can be reduced greatly. The experiment shows, by this means, the overall system runtime is significantly reduced, which proves that the process of data cleaning is optimized. In this paper, we optimized several modules of data cleaning such as entity identification, inconsistent data restoration, and missing value filling. Experimental results show that the proposed method in this paper can increase efficiency for grain big data cleaning.

cleaning system [13]. Yan et al. proposed an iterative data cleaning method based on time sequence analysis because the power device status information can be made equivalent to the multivariate time sequence of each state [18]. Gueta et al. applied User-level data cleaning to biodiversity databases, and presented a new framework to quantify the effect of data cleaning on SDMs [19]. Xu et al. proposed an incorrect data detection method based on an improved local outlier factor (LOF), and used a simulation of vibration data generated by a defective rolling element bearing to verify the effectiveness of the proposed method [20]. All the above scholars studied big data cleaning methods in specific fields, but there is still a lack of general methods for big data cleaning. Nowadays, almost all of data analysis tasks can be finished using MapReduce, but at the same time, a large number of redundancies, with the explosion of data, are also generated. In China, due to the lack of top-level design and related standards, the grain informatization process initiated by the government has produced a large number of dirty data, which has brought negative impact on macro decision-making. It is necessary to generate a better method to specifically clean grain big data. In this paper, for grain big data, we proposed a new optimized method based on task merge to reduce these redundancies.
A traditional big data cleaning system is shown in Figure 1. It runs on Hadoop platform, and deals with different kinds of data with different levels of quality in a flexible way. This system consists of several modules, and each module deals with a type of data quality. The interaction module provides an input interface for files or data that need to be cleaned. The display module gives a comparison between dirty data and cleaned data. Entity identification and true value discovery module are used to reduce redundancy. Inconsistency detection module is used to recover data, and data filling module tests missing data and finishes data filling. Users may select required modules to deal with data quality problem that they encountered.
The above data cleaning system often requires multiple MapReduce operations. For a complicated problem, we need to divide it into many simple tasks, and each task is executed via one round of MapReduce operation. In the majority of cases, this division is excessive, which leads to superfluous MapReduce operations. In this paper, we merged some tasks to optimize the MapReduce operations for the above big data cleaning system. To be specific, the optimized technique proposed in this paper merged redundancy computations or simple computations from the same files to reduce cycle number of MapReduce. Thus, system running time reduced greatly, and system performance has also been improved obviously.

Traditional Missing Data Filling
The automatic data collection produces amount of missing data. Normally, Naive Bayes Classifier (NBC) is applied in resolving missing data. In the traditional data cleaning system, the missing data filling module roughly contains three parts: parameter estimation module, linking module and filling module [21]. The main task of parameter estimation module is to compute the probability of each attribute value, and takes the value with the highest probability as filling value. Specially, when the sample space is large enough, the probability is replaced approximatively by the frequency that the attribute value appears in. The linking module associates attribute value with its probability. The input data of linking module are the output of parameter estimation module as original data to be filled, and the output is a file that contains the relationship between missing value and its probability. Filling module is executed in the manner of a cycle of MapReduce. First, a linking computation is executed between the output of linking module and the original input data via an offset as key value. Its Map stage is similar to linking module, and its Reduce stage uses bayes formula to select a maximum probability value as a filling value. In this paper, we only considered discrete missing data filling.

Analysis and Optimization for the Missing Data Filling Module
We first have analyzed the data stream and relationship between modules in the above data filling framework. The whole filling process needs two kinds of con-ditional probabilities (CP): 1) Parameter estimation module uses tuples that don't contain missing data in its input data to compute a conditional probability (entitled CP1) of an attribute value that needs to be filled; 2) Another conditional probability (entitled CP2), which is one group of special values, is used in the process of filling value. The relationship between the two CPx is determined by the values of dependency attributes. The group of special values is associated with the tuples that need to be filled in original data via the offsets of the tuples. Therefore, the linking module is required between parameter estimation module and filling module.
After observing the data stream in above system carefully, we found that both map input and map output contain the offset of the tuples in parameter estimation, but the output of the Reduce only contains attribute value and CP2, which leads to it is necessary to add a linking computing to link CP2 with the offset of the tuples to be filled.
Aiming at the above circumstance, we proposed an optimized scheme that merged the tasks in parameter estimation module and linking module. Firstly, in parameter estimation module, we associated the exporting conditional probability with the offset of the tuples that contain missing data. Its algorithm is as follow Algorithm 1.
In the above code segment, value is the attribute value of each tuple that contains missing value, and offset is the skewing quantity of original record to original file, probable_value.txt is the text file that contains all possible value for missing data. The Reduce process is shown in Figure 2. For example, Table 1 is the dataset that contains missing data, and the missing data probably may be V1 or V2. The former two tuples don't contain missing data, so we can split them only according to their attributes. The third tuple contains missing value, so we should split it according to each probable value.
The outputs of Map stage are shown in Table 2. In Reduce stage, firstly, the prefixes of all input data are checked. Then, some calculations on contingent    probabilities are executed if missing values are contained in the input data, otherwise, the input data are entered into linklihood that is used to determining whether contingent probabilities should be output. And at last, we selected the contingent probabilities that attribute values are in likelihood to output results, and the output results are shown in Table 3.
In the parameter estimation module, the algorithm complexity on Map stage is where m is the number of tuples that don't contain missing values, N is the number of tuples that contain missing values, and X is the number of probable missing values. Normally, due to N m

Optimization of Entity Identification
Entity identification is to recognize the form of an entity. For same entities, the data from different sources produce different presentations, even probably produce some errors in data storage or transformation. In MapReduce framework, although there are many researches on entity identification [22] [23], these researches basically solved the identification of anonymous entities, but few of them can solve problems on the identification of homonymous entities faultlessly [24] [25]. In this paper, we tried to solve the two problems simultaneously.
We found that both basic cluster module and entity identification module repeatedly use preprocessing result M times (M is the number of attributes that each tuple contains), and the subsequent entity identification module also works for single attribute. If we consider the preprocessing module and the entity identification module as a whole model, we need to scan input files many times, and can only use the part of input data, which results in the low data utilization rate.
In addition, system requires extra resource for each task allocation. In view of this, we need a scheme that can process all of attributes for each tuple in one MapReduce cycle.
For this purpose, we proposed following optimization idea for entity identification module: firstly, use basic cluster module to process all the attributes synchronously, and then produce an attribute index table for all attribute values.
Thus, we can merge these separated preprocessing procedures together. The detailed scheme is described as follows.

Basic Cluster Module
This module doesn't output the ith attribute value, but output all attribute values.
Considering that the entity IDs in attribute index table are from different attributes, we added a prefix to the keys to translate "attribute value" into "attribute sequence * attribute value". Because MapReduce classifies attributes according to keys, the entities with same attribute values will be in same attribute table. On Reduce stage, we added attribute sequence to entities as the prefix of entity IDs. The following is the optimized basic cluster algorithm (Algorithm 2).
In the above algorithm, the entity ID is the code of each tuple. We set an unique entity ID for each tuple (data-in-line) in preprocessing stage. In addition, property ID is the sequence code of an property in tuple.
The following is an illustration of an optimized algorithm flow. Table 4 shows the data that need to be recognized. In Map stage, we split all tuples according to their properties and then output results, as shown in

Entity Identification Module
Because the above improvement on the entity cluster module guarantees that the entities with same attribute values belong to a same attribute index table, the algorithm of entity identification module doesn't need to be changed. Thus, the time complexity of its algorithm is still. Because we only performed optimum operation for partial data on Hadoop, we regarded entity partition module as constant, and the time complexity didn't be changed. Before optimization, the cycle times of MapReduce is 1 + M (1 + 4) = 5M + 1, and after optimization, it is change to be 1 + 1 + 4 = 6, so speed-up rate reaches (5M + 1)/6. Normally, due to M > 1, speed-up rate is bigger than 1, and with the increase of M, the speed-up effect is more evident, and at the same time, the times of IO also reduced from 5M + 1 to 6, which reduced system uptime for IO. In addition, due to the reduction of cycles on MapReduce, the time for task scheduling and the used resource are also reduced.
In general, theoretically, the scheme proposed in this paper can provide obvious optimization result.

Reparation Optimization to Inconsistent Data
In the real application or database, there are amounts of inconsistent data due to various reasons. In the proposed system, we defined an integrity constraint according to conditional function dependency principle in theory of data dependency, and used the integrity constraint to repair inconsistent data. The purpose of this paper is to improve the performance of the reparation module to inconsistence data. As for how to guarantee that the repair process is correct, it is depended on conditional functions. The detailed explanations can be seen in reference [5] [6].

Optimization for Inconsistent Data Reparation
The main steps of the inconsistent data reparation module are as follows [26]: (1) The data files and the CFDs (centralized file directory system) files are input to the system to be executed for preprocessing, then are transformed to a proper format, and are checked for subsequent processing; (2) check and repair the outputs from preprocessing and get primary reparation results; (3) check primary reparation results, determine whether introduce new inconsistent. If produce new inconsistent, execute step (1) again, otherwise continue to execute step (3). To avoid trapping into endless loop, an upper limit for these steps is necessary; (4) Process the reparation results, and change data format to original format to be used by another system normally.

Analysis and Optimization for Inconsistent Data Reparation
A main shortcoming of the inconsistent data reparation module is to divide a task that can be finished by only one cycle of MapReduce into several tasks, which results in the reduction of system performance. Therefore, we merged multiple tasks into one task for the optimization under the condition of not changing algorithm complexity.

1) Preprocessing Module
The function of the preprocessing module is only to setup index for input data, and doesn't involve data decomposition and merge. We may use a map function to realize this process, and the algorithm is shown as follow Algorithm 3.
Obviously, the algorithm complexity of this module is ( ) O m .

2) Detection and Reparation Module
Detection and reparation of constant violation can be finished only by one cycle of MapReduce. Because the Map stage dispenses one tuple N copies (N is the number of constant violations), and although N value is not big and has almost no influence to algorithm complexity of the reduce stage, it still can enlarge middle data quantity N times, which causes large load to communication. We found that we can repair these constant violations when calculating their suggested values, so it is not necessary to separate seeking process and reparation process. Therefore, we proposed an optima scheme that finished the process of the constant violation detection and reparation using one map function.
After finishing constant violation reparation via one map function, the data are guided directly to variable reparation module. Where offset is the index of the tuples, and fixFlag is the reparation flag that indicates whether or not we need the reparation, and "0" indicates we need the reparation due to a violation, "1" indicates we needn't the reparation. cfdseq is a cfd serial number of a tuple with violation, ptseq is a serial number of the tuple in mode table, and propseq is an attribute serial number of the inconsistent data of the tuples, and the output_value is the reparation of the attribute.
The computation complexity of Algorithm 4 is ( ) O m , and the flow chart of the algorithm is shown in Figure 3.
In terms of time complexity, the algorithm did not change the computational complexity of each module and each MapReduce within each module before and after optimization. In terms of MapReduce rounds and IO times, the MapReduce rounds of the system changed from 1 + 1 + 2 + 1 + 1 + 1 = 7 before optimization to 1 + 2 + 1 + 1 = 5 after optimization. From the perspective of Ma-pReduce rounds alone, the acceleration ratio of the system is 7/5 = 1.4. In addition, the optimization of the system also makes the MapReduce of the preprocessing module become a map, which will correspondingly reduce the running time of the system. With the reduction of MapReduce rounds, the IO times of the system are correspondingly reduced, which also reduces the IO burden of the system.

Experimental Results
The computer cluster that used to do experiments was composed of ten nodes, including one task-tracker (name node) and nine job-trackers (data node). The  hardware configuration of each node was as follows: Intel i7 7700k processor, 4.2 GHz main frequency, 8 G memory, and 1TB hard disk. The whole system was developed using java on Eclipse environment, and run on Hadoop 3.0 platform based on Centos 7.5.

Experiment for Entity Identification Optimization
Considering the optimization effectiveness for dataset scale, we used the real dataset that is from National Warehouse Grain Condition Monitoring Project (WGCM). It is a grain storage information depository in China. We selected three attributes, including warehouse temperature, grain temperature, and warehouse humidity, and five data scales, including 11.  Figure 4 shows the time consuming on different scale dataset using Naïve, BlockSplit, PairRange that used in Ref. [22], and the proposed method in this paper based on tasks merge.
Following the increase of dataset scale, both the unoptimized system and the optimized system increase their run-time, but the run-time ratio of the unoptimized system to optimized system is about 2.3 due to the only three attributes that were used in this experiment for each data. Based on the analysis to optimization effect in section 2.2, the theoretical ration value is (5 * 3 + 1)/6 = 2.7, which is in accordance with experimental results. Because the entity identification based on BlockSplit and PairRange is more complicated than the method based on tasks merge, their run-times are longer than that of the method that was proposed in this paper. In summary, this experiment illustrated the good expandability of the optimization scheme.

1) Influence of Parallelization Level to Optimization Effectiveness
Considering the influence of Reduce number in cluster to optimization results, we used the real WGCM dataset as experimental data. We selected three attributes of the experimental data that are titled warehouse temperature, grain temperature, and warehouse humidity to construct an experimental dataset with 100,000 records. We set weight values for each attribute 0.9, 0.1 and 0.1 respectively, and set Reduce number 2, 4, 6, 8 and 10 respectively. As shown in Figure  5, the optimization effectiveness of the proposed method is obvious under the different parallelization levels. From Figure 5, we can see that, following the increase of parallelization degree, system uptime increases. The main reason for this phenomenon is that the data size for experiment is too small to provide benefits. However, system still reaches 2.3 speed-up ratio under different parallelization modes, which indicates the optimization result is in line with forecast.

2) Influence of Parallelization Level to Optimization Effectiveness
In this experiment, we studied the influence of feature number of the input  When dealing with the same scale records, the optimization effectiveness got better and better with the increase of the feature number. From Figure 6, we can see that, the optimization results are the worst, even are lower than the un-optimized when processing only one record, but the optimization results became better after increasing feature number. The reason that caused the above experimental results was that the optimized scheme generated more middle data than non-optimized scheme, and was more complicated than the non-optimized scheme. Because the purpose of the optimized scheme proposed in this paper was to utilize the input data adequately when dealing with multiple attributes, the optimized scheme presented more advantages than non-optimized scheme with the increase of the attributes.

Optimization Experiments for Inconsistent Data Reparation
We used a real dataset that is from Zhengzhou grain trade market in China, which is named ZZGR, and an artificial dataset that generated from Transaction Processing Performance Council (TPC-H) to verify the operative mode of the system in real environment. We did an experiment for speed-up ratio verification on the real dataset, and at the same time, we also did an experiment for expansibility and parallelism verification on the artificial dataset.

1) Speed-up Ratio Experiment
In this experiment, six tuples without missing values were selected from the ZZGR dataset and were put into some errors intentionally to violate several restraints. The experiment conditions and results are shown in Table 6. The experimental results show that there is prominent speed-up effectiveness on the real dataset. The optimization scheme provided actual 1.3 speed-up ratio and less than 1.4 speed-up ratio in theory (see Algorithm 4).

2) Expansibility Experiment
The aim of this experiment was to verify the same effectiveness on different scale dataset. The experiment dataset is composed of six attributes from  lineitem.tbl that was generated by TPC-H, and CFDs are composed of one cfd including three lps. The experimental results that are shown in Figure 7 indicate that the optimization effectiveness gets better with the increase of dataset scale.
Thus it can be seen that our optimization scheme is extensible easily. From Figure 7, we can see that the uptime of the non-optimized system increase with the enlargement of the dataset, and the uptime of the optimized system also increase but the slope is lower than the former. Compared to the former, the speed-up ratio of the optimized system improved from 1.6 to 2.2. All modules are in overloaded works before optimization, but the optimized system reduces the burdens of various modules except the modules of data inconsistency test and reparation. In addition, we can also see that the non-optimized system firstly enters full load status with the enlargement of the dataset compared to the optimized system, as shown in Figure 7.

3) Parallelism Experiment
To verify the influence of parallelism to optimization effectiveness, this experiment used six attributes from lineitem.tbl that was generated from TPC-H to F. Y. Lian et al. form dataset. CFDs were composed of one cfd including three tp samples, as shown in Figure 8. From Figure 8, we can see that the system speed-up ratio reached up to 2.3 under the low degree of parallelism with 2 reduces, then with the increase of degree of parallelism, the system speed-up ratio reduces. In addition, for the non-optimized system, the uptime become shorter with the increase of the degree of parallelism, but the uptime of optimized system remains unchanged. The reason why the above phenomenon generates is that the non-optimized system has the weak processing capacity. We may only add the degree of parallelism to improve processing capacity, which means the system uptime becomes shorter with the increase of the degree of parallelism. But for the optimized system, due to the big handling capacity, it always is in the underloading status when processing the same data size, which means the advantages are not obvious when increasing the degree of parallelism.

Optimization Experiment for Missing Value Filling
In this experiment, the used data are from the real dataset entitled ZZGR and the artificial dataset that generated from TPC-H. To verify the operation status in real environment for the optimization system, we used the two datasets to verify the impact of missing rate on optimization result on the ZZGR dataset, and to verify expansibility and parallelism on the artificial dataset.

1) Impact of Miss Rate on Optimization Effectiveness
We have also studied the impacts of various miss rates on optimization results. The data for the experiments were generated by emptying some data in used dataset based on certain proportion. In this experiment, we selected eight discrete features and missing features with six values. The experimental results are shown in Figure 9. From Figure 9, we can figure out that the speed-up ratio stabilizes at 1.5 roughly, which matches theoretical value of 3/2 in this module, under the miss rates that are shown in Figure 9.

2) Verification Test for Expansibility
In this experiment, we selected six features from lineitem.tbl that is generated   from TPC-H to generated dataset. The experimental results are shown in Figure  10. From Figure 10, we can see that the system uptime increases with the enlargement of the dataset, but the speed-up ratio stays around 1.5, which coincides with the theoretical value of this module.

3) Verification Test for Parallelism
This experiment was designed to test the optimization effectiveness of the system under the different parallelism degrees. In this experiment, we used TPC-H to produce a data table entitled lineitem.tbl, including six attributes and 1,000,000 tuples. We randomly emptied 5% of the data in first column of the table and recorded the optimization results under different parallelism degrees.
The experimental results are shown in Figure 11.
On the provided dataset, the operating efficiency of the non-optimized system and the optimized system are not to become better with the increase of parallelism degree yet. This is because for the given scale of dataset, the most appropriate number of Reduce is determinate, and only increasing parallelism degree will bring about more spending on task allocation for the system. In any case,