Approximate Continuous Aggregation via Time Window Based Compression and Sampling in WSNs Abstract

In many applications continuous aggregation of sensed data is usually required. The existing aggregation schemes usually compute every aggregation result in a continuous aggregation either by a complete aggregation procedure or by partial data update at each epoch. To further reduce the energy cost, we propose a sampling-based approach with time window based linear regression for approximate continuous aggregation. We analyze the approximation error of the aggregation results and discuss the determinations of parameters in our approach. Simulation results verify the effectiveness of our approach.


Introduction
Wireless sensor networks (WSNs) offer a powerful and efficient approach for monitoring and collecting information in a physical environment.To extract the summary information about the monitored environment, the aggregations of sensed data, such as sum and average, are common interesting queries for users.Therefore, a lot of algorithms and protocols for aggregate query processing in WSNs are proposed [1][2][3][4][5][6][7][8].
The existing works addressed two types of aggregate queries which include exact and approximate aggregate queries.The exact aggregate query requires all the sensed data to be involved in aggregation computation to obtain the exact aggregation results [1,2].However, the exact aggregate query processing often incurs great energy consumption and is also very sensitive to the packet loss and node failure during the data aggregation.Considering the approximate aggregation results would be enough to reflect the information of the environment, approximate aggregate query processing is addressed to save energy and achieve robustness against the failure of the links and nodes [3][4][5][6][7][8].In the research of the approximate aggregate query processing in WSNs, sampling is widely used as a powerful and energy-efficient technique to obtain the statistical information of the environment.A number of sampling based schemes have been proposed for approximate query processing in WSNs [8][9][10].
In the applications of WSNs such as monitoring air pollution and water quality, the users are often interested in understanding how the environment changes over time and observing data trend in a time window.In such cases, continuous aggregation of sensed data is usually required.In a continuous aggregation, the query aggregation period is divided into epochs and one aggregate answer is provided at each epoch.The existing aggregation schemes usually compute every aggregation result in a continuous aggregation either by a complete aggregation procedure [1,[2][3][4]7] or by partial data update [8] at each epoch.However, the users, who are interested in the time-evolving characteristic of aggregation results, are more concerned about the data trend rather than each individual accurate aggregation result.On the other hand, the communication cost of the existing schemes could be substantial, especially for continuous query with a short epoch and a long period.Motivated by such circumstances, we propose a sampling-based approach with time window based compression for approximate continuous aggregation.
Our approach leverages the batch-based design to compute a period of aggregation results at one time.While giving a series of good approximate aggregation results to provide accurate data trend information, it achieves greater energy-savings than the existing approaches by avoiding individual computation cost of every epoch.In our approach, the combination of data compression and sampling techniques is exploited.A small portion of sensor nodes transmit to the base station (BS) a compact description of their sensor readings during a time window.The BS computes approximation aggregation results of every epoch in this time window.In this paper, linear regression modeling is adopted by sensor nodes to compress their sensor data in a time window.We analyze the approximation error of the aggregation results and discuss the determinations of parameters in our approach.
The rest of the paper is organized as follows.We present our approach and approximation error analysis in Section 2. We discuss the determination of parameters in our approach in Section 3. Simulation results are presented in Section 4. Finally, we conclude this paper in Section 5.

System Model and Time Window Based Framework
We assume a multi-hop sensor network with N number of sensor nodes.The BS knows N.All the sensor nodes and the base station are loosely time synchronized.Each node has the same communication radius R c .We assume a continuous querying environment for sensor networks.For a continuous aggregation query, the base station initially disseminates a query into the network, consisting of the epoch duration, the lifetime of the query evaluation and a sampling ratio  .
During the period of a continuous aggregation query, aggregation computation is conducted at time intervals.Each time interval consists of l number of successive epochs.The BS computes the aggregation result of every epoch in a time interval at one time.Such a time interval is referred to as time window and represented by [  In the network, the aggregation computation involves sampling sensor nodes that participate in answering the aggregation query, and collecting a compressed representation of sensor readings within a time window from each sampled node.
After receiving the query from the BS, each sensor node u generates a random number rn u in the range of [0, 1).If ( m is the sample size) be the set of sampled nodes.At the end of a time window [

Modeling Sensor Data with Error Constraint
In our framework, a sample is not a single sensor reading but a compressed representation of the sensor readings, which enables a sensor node to transmit its sensing readings in a time window with less communication cost.It can be built by either lossy or lossless compression methods.
Considering the inherent redundancy of sensor data and the fundamental limit of lossless compression in information theory, we use a data modeling approach, linear regression, to achieve a lossy compression of sensor readings.Linear regression has been widely used to characterize data in sensor networks and answer aggregation queries [11][12][13].On this basis, lossless compression methods always can be used for any possible further size reduction.Nevertheless, we note that our framework does not depend on any particular compression method.However, data compression with linear regression modeling would introduce errors in the reconstructed data.Therefore, we put error constraints on the modeling process in our approach.If sampled nodes find that the variance of error incurred by modeling exceeds some threshold 2 T  , referred to as error constraint, they choose to transmit their original data.Otherwise, model parameters including error variance are transmitted.

Linear Regression Model
Regarding the sensor readings as a function of the sequence number from 1 to l , a linear regression model [14] for these sensor readings is built in the following form are regression coefficients, and  is a random error vector.Besides, the time win- dow size l is larger than 1 p  .According to Gauss-Markov conditions [14], we also have ( ) By the least square estimate, the estimation of regression coefficients, denoted by , can be computed by solving the following matrix equation, using, for example, Gaussian elimination: where Once determining l and p , we can see that the matrices X and A do not change with R , so they just need to be computed only once for an aggregation query.

Error Variance and Data Reconstruction
Besides computing regression coefficients   , each sampled node also needs to estimate the variance of the errors, denoted by 2  , to decide whether to transmit original data or regression coefficients.Under Gauss-Markov conditions [14], an unbiased estimator of error variance 2   can be computed by Given an error constraint 2 and  2  to the base station.Other- wise, it transmits l number of original sensor readings.By the regression coefficients of   received from a sampled node, the BS can reconstruct its sensor readings  1 ( ... ) where X can be pre-computed by the BS with l and p .
In the rest of this paper, we regard both the original readings and the regression coefficients as model parameters and do not distinguish them.A sample transmitted by a sampled node M represents the original sensor readings.

Aggregation Estimation
At the end of each time window, the BS waits for the arrivals of all samples for some time w t .The waiting time w t should be larger than the maximum time needed for the message delivery from the samples node to the BS.
After reconstructing sensor readings { 1 } by Formula (4), the approximation aggregation result where F is the estimator function of aggregation results.Now we specifically discuss how to estimate the results of aggregation queries including Average and Sum respectively.
Average Average aggregation is estimated by Sum Sum aggregation is estimated by

Approximation Error Analysis
can be rewrote as where i k r  is the original data of epoch k and  i k   is the residual in the linear regression model (1) of node i s .
Then, the approximation error of sampling estimation error modeling estimation error where 1 ( ... ) . By Formula (8), we have . According to the linear regression theory, under Gauss-Markov conditions, the residual  i k   follows a normal distribution 2 (0 ( 1))  is the error variance in the linear model at node i s , k k t    and k k p   is the k -th element on the prin- Since k R is the mean of original data samples, according to the general results in the sampling theory [15], we have the following results ( ) where is the (unbiased) sample variance and is an unbiased estimator of the population MSE(Mean Square Error) By the above discussions, we have the following results Lemma 1.Under Gauss-Markov conditions,  , since the samples i k r  and j k r  are assumed to be independent random variables in the sampling theory, i k r  and  j k   are independent and we have If i j  , according to the linear regression theory, we Proof.It can be easily shown that ), the probability that they are both being reconstructed due to the corresponding nodes ( i , j ) being sampled, is m(m-1)/(N(N-1)).Then, with Lemma 1, we have 1 can not be obtained since sampling all nodes is prohibitive in our approach.Thus, we use an upper bound of 2  (1 ) (1 ) Proof.Define the events A , B and C respectively as When inequalities B and C are satisfied, A must hold.Because sampling and modeling errors are independent random variables, so B and C are independent events.Then, we have , we can easily derive the following results from the above analysis of average: Here Formulas ( 13) and ( 14) give the approximation error k  ( k N ) of Average (Sum) aggregation with the probability guarantee ( 1)(1 )

Parameter Determination
From Formulas ( 13) and ( 14) we can see that with given the probability guarantee, i.e., r  and z  , the approximation error depends on the error constraint 2 T  and the sample size m .In this section we discuss the selection of their values with the desired error bound for k  by users, denoted by T  .


As shown in Formula (3),  i  indicates the average er- ror for the data reconstructed in a time window.Thus, T  specifies the maximum degree of the average error that the user can tolerate for the reconstructed data.A larger T  would allow larger errors in the reconstructed data and may enlarge the approximation error.On the other hand, a larger T  gives the sampled nodes more chances to transmit their model parameters instead of their original data and further reduce the communication cost.Thus, the trade-off exists between communication cost and approximation error.
Here we provide one possible solution to determine 2 T  .
During the first time window of aggregation, all sampled nodes transmit their original data to the BS.The BS fits the specified model to these data and computes the modeling errors for all sampled nodes.A histogram is computed to count the number of error values falling into each bin, which reflects the quality of data modeling for the sensor network.According to this frequency distribution, the user can select a value of 2 T  as large as possible while ensuring an acceptable approximation error.Finally, the BS broadcasts 2 T  to the sensor network and each sensor node works on the new error variance constraint.This procedure could be conducted reactively when substantial sampled nodes start to continuously transmit their original data, which indicates the changes of the nature of data in the sensor network.
In our experiments on real data set, we show linear regression well characterizes the sensor data and incur few original data transmissions even with a small error variance constrain.

Sampling Ratio 
From Formulas ( 13) and ( 14), a larger sample size m enables a smaller approximation error.
It is easily shown that we can relax k without changing the inequality relationship with the probability guarantee (1 )(1 ) in Formulas ( 13) and (14).We consider the least sample size to satisfy

(
), 1 For each epoch k in [  We can obtain an estimation of the required sample size, denoted by m r , for all epochs in the time window [ by inserting Formula (15) into Formula (16).The sampling ratio  is set to be not less than m r /N.

Time Window Size l
When all sampled nodes transmit their original data, the approximation error includes only the sampling estimation error and no modeling estimation error.Thus, the aggregation computation with original data needs a less sample size than with the compressed data by modeling to achieve the same approximation error.Let o m be the sample size needed to obtain ( ) aggregation by collecting the original data, then ) On the other hand, we have As above, we also have According to the above discussion on sampling ratio, we have Without data modeling compression, the aggregation requires

Simulation Evaluation
To measure the performance of our secure aggregation To show the performance of linear regression model for describing sensor data, we investigate the distribution of error variance and its impact on data transmission for all sensor nodes.
Figure 1 shows temperature readings (in degrees Celsius) of 52 sensor nodes in 2000 successive epochs, which are used for our simulation.Figure 2 shows the error variance of linear regression model in every time window for all sensor nodes.For all time windows, all the sensor nodes have error variances less than 0.2.

Conclusions
In this paper we propose a sampling-based approach with time window based linear regression for approximate continuous aggregation.The approximation error of the aggregation results is analyzed.The determination of parameters in our approach is also discussed.By simulation results on real data set we verify the effectiveness of our approach.

2 z
  point on the standard normal distribution.
original data transmissions for a time window to achieve the approximation error T  .With the data modeling compression, our scheme requires ( 2) m p  data transmissions to achieve the approximation error T  .To achieve energy savings, we should have (

Fig- ure 3 T 2 T ; when 2 0 1 T
shows under different choice of error constraint 2 T  , the number of sensor nodes which has a larger mod- eling error variance than 2 in each time window.We can conclude most of sensor nodes at most of time windows are consistent with variance constraint.When 2 0 07 T    , less than 10% of sensor nodes exceed    , the number decreases to 2% .Our experiment indicates only a small portion of sampled node will transmit their original data.

Figure 1 .
Figure 1.Temperature readings (in degrees Celsius) of 52 sensor nodes in 2000 successive epochs (excluding two nodes with incomplete data and one node with abnormal data).

Figure 2 .
Figure 2. The error variances of linear regression model in all sensor nodes for each time window.

Figure 3 .Figure 4
Figure 3. the number of sensor nodes with error variance 2 T  

Figure 4 .
Figure 4.The difference between two average aggregation results respectively estimated by the approaches with and without data compression in every epoch.