A Short-Term Traffic Flow Forecasting Method Based on a Three-Layer K-Nearest Neighbor Non-Parametric Regression Algorithm

Short-term traffic flow is one of the core technologies to realize traffic flow guidance. In this article, in view of the characteristics that the traffic flow changes repeatedly, a short-term traffic flow forecasting method based on a three-layer K-nearest neighbor non-parametric regression algorithm is proposed. Specifically, two screening layers based on shape similarity were introduced in Knearest neighbor non-parametric regression method, and the forecasting results were output using the weighted averaging on the reciprocal values of the shape similarity distances and the most-similar-point distance adjustment method. According to the experimental results, the proposed algorithm has improved the predictive ability of the traditional K-nearest neighbor nonparametric regression method, and greatly enhanced the accuracy and real-time performance of short-term traffic flow forecasting.


Introduction
With the economic and social developments, the cities have been expanding constantly and urban transport problems now are becoming increasingly serious.Intelligent transportation system is generally recognized as an important mean of solving traffic jams.As the related techniques in each fields of intelligent transportation ad-vances, both the travelers and administrators urgently want to acquire the dynamic traffic flow conditions in real time, and a real-time and dynamic traffic assignment has become a key technology in intelligent transportation system.To achieve a favorable traffic assignment, we should predict the traffic flow information at the next decision moment ( 1 t + ) and even several future moments when making decisions of control variables at the mo- ment t.In general, the short-term traffic flow forecasting refers to the case in which the time span between t and 1 t + does not exceed 15 minutes (or even is smaller than 5 minutes).
Currently, the short-term traffic flow forecasting models were mainly constructed based on parametric regression methods [1] [2] such as history average model, time series model, Kalman filtering model, wavelet theory, neutral network model and etc. Non-parametric regression is another kind of forecasting model.Unlike with parametric regression method, non-parametric regression method sets no strict limits on the data and describes the system based on the sufficient historical data.Moreover, using non-parametric regression method, the relationship between input and output is determined only based on the existing data and the time-consuming adjustments are not required when new data are generated.Abroad, the study of non parametric regression short-term traffic flow prediction is in the leading position.In 1987, Yakowit first proposed the K nearest neighbor method used in time series prediction.In 1991, Davis and Nihan applied the method of non-parametric regression into traffic prediction .They pointed out that the K nearest neighbor method is suitable for traffic prediction, because the traffic data itself reflects the nonlinear characteristics.
K-nearest neighbor non parametric regression method is proved to be a reliable method for short-term traffic flow forecasting [3] [4], which can favorably reflect the traffic flow's non-linearity, time-dependent characteristic and uncertainty.Non parametric regression method does not require a priori knowledge, only sufficient historical data, it is looking for the similar nearest neighbors between historical data and the current point, and use these "neighbors" to predict the traffic flow of the next time.The non parametric regression algorithm considers that the intrinsic link between all factors of the system is contained in the historical data.Therefore, the non parametric regression method directly obtains the information from historical data instead of the historical data to establish an approximate model.
However, road traffic system is a nonlinear system characterized by time-dependence and complexity and exhibits a distinctive feature-high uncertainty, which makes the forecasting model based on a single-layer K-nearest neighbor non-parametric regression present low stability in predicting complex traffic flows.On the other hand, some combined short-term traffic flow forecasting methods exhibit complex algorithms and heavy calculation burdens; additionally, the forecasting accuracy and real-time requirements always cannot be satisfied simultaneously.
Therefore, this article proposed a short-term traffic flow forecasting method based on a three-layer Knearest neighbor model.In view of the fact that the traffic flow variations are repeatable, two layers with shape-similarity screening function were introduced in K-nearest neighbor non-parametric regression method, in which the shape similarity between the current point and the data in historical database was measured by similarity deviation and correlation coefficient, respectively.The hit rates in the screening results using two shape-similarity measurement methods were calculated and their respective similarities were ranked.Furthermore, the forecasting results were output based on the weighted averaging on the reciprocal values of the shape similarity distances between the traffic flows at each nearest neighbors and the next moment of the current point.

The Improved Short-Term Traffic Flow Forecasting Method
The short-term traffic flow forecasting method based on a three-layer K-nearest neighbor K-nearest neighbor nonparameteric regression includes the following steps as shown in Figure 1: 1) make the statistics of the traffic flows within a fixed time interval and then construct the historical sample database; 2) evaluate the shape similarities between the current point and the points in historical database using the similarity deviation and correlation coefficient; 3) give a comprehensive evaluation on the points screened through the first layer based on the calculated hit rates and shape similarity distances and conduct the screening in the second layer; 4) assess the matching distances between the current point and the points screened through the second layer according to the calculated Euclidean distance and output the forecasting results based on the weighted average values of the reciprocal values of the shape similarity distances between the traffic flows at each nearest neighbors and the next moment of the current point.

The First-Layer Screening
Euclidean distance can only reflect the closeness between the current point and the point in historical database [5], but cannot directly reflect their similarity.Shape similarity can directly reflect the traffic flow's variation and development rules.The similar variations in traffic flow parameters reflect a similar evolutionary physical process in traffic flow and the similar traffic flow variation rules can produce the similar results.
In this article, using the shape-similarity-based K-nearest neighbor non-parametric regression method, the points in the historical database were firstly matched and screened, and the algorithm is detailed described below.
The traffic flow time series was used as the traffic flow state vector V and can be written as: In which, V(t) denotes the traffic flow state vector of the current road at the moment t, ( ) denote the traffic flows of the current road at the moments 1, 2, , t l t l t − + − +  , respec- tively, and l denotes the dimension of the state vector.
In the present study, the shape similarity between the current point and the point in historical database was evaluated using the similarity deviation R. The points in historical database were screened based on the calculated R values, and the set of points after the first-layer screening was denoted as A. R can be calculated by: In which, R denotes the similarity deviation between the current point and a point in historical database, l denotes the dimension of the traffic flow state vector, E denotes the overall mean difference of all the components between the current point and a point in historical database, d i denotes the difference of the (l + 1 − i)th Vector component between the current point and a point in historical database and i ranges from 1 to l.Then, the calculated R values between the current point and the points in historical database were sorted in the order of smallest to largest and n points with the nearest distances were selected.The set of points after the similaritydeviation-based screening was then denoted as The shape similarity between the current point and a point in historical database was assessed by the correlation coefficient R′ .R′ can be calculated by: ( 1 1 In which, R′ denotes the correlation coefficient between the current point and a point in historical database and l denotes the dimension of the traffic flow state vector. ( ) v t represents the average value of all vectors in the current traffic flow state vector ( ) V t , ( ) v t is ob- tained by the Equation ( 5).

( ) ( )
( ) h v t represents the average value of all vectors in the historical traffic flow state vector ( ) v t is obtained by the Equation ( 6).

( ) ( )
Then, the calculated R′ values between the current point and the points in historical database were sorted in the order of largest to smallest and n points with the nearest distances were selected.The set of points after the correlation-coefficient-based screening was then denoted as .The first layer of the matching and screening is completed after getting the set A and A′ .

The Second-Layer of Screening
Subsequently, the points in A and A′ were evaluated comprehensively based on the calculated hit rates and shape similarities, and then the set of points after the second layer of matching and screening was acquired and denoted as B.
As shown in Figure 2, the comprehensive evaluation on the sets A and A′ based on the calculated hit rates and shape similarities includes the following steps: 1st step: set i = 1; 2nd step: for any a point ( ) and repeat the 2nd step; 4th step: set j = 1 and ( ) 5th step: for any a point ( ) ∉ , we can obtain ( ) ′ ∉ , we can obtain ( ) If a point appeared both in set A and set A′ , we can consider that this point has a high shape similarity to the current point; if a point only appeared in set A or set A′ , we can consider that this point has a high shape simi- larity at a certain shown in Figure 2, the coincident points in set A and set A′ were firstly identified to form the set C; then, a same number of points which were not included in set C were selected suc-cessively from set A and set A′ to form the set D. Totally n points were included in set C and set D, which were thus the results after the second layer of matching and screening.

The Third-Layer Screening
The third-layer matching of the points were conducted using the improved K-nearest neighbor non-parametric regression method, and thus the traffic flow at the next moment could be predicted.
The similarity between the current point and any a point in set B was evaluated by calculating the Euclidean distance between them.The Euclidean distance between two points can be calculated by [6] [7]: In which, d denotes the matching distance between the current point and any a point in set B and l denotes the dimension of the traffic flow state vector.
Then the points in set B were sorted according to the matching distances with the current point in the order of smallest to largest, and k points with nearest matching distances were selected.
The forecasting function was then constructed using the weighted averaging on the reciprocal of the shape similarity distances and the most-similar-point distance adjustment methods.The specific formula can be written as: In which, ( ) ˆ1 v t + denotes the predicted traffic flow at the next moment using the three-layer K-nearest neighbor non-parametric regression method, k denotes the number of the selected points with the nearest distances with the current point in set B, R′ denotes the correlation coefficient between the current point and the nearest neighbor, R denotes the similarity deviation between the current point and the nearest neighbor, j b de- notes the overall average difference of all the components between the current point and the nearest neighbor and l denotes the dimension of the traffic flow state vector.

Conclusions
Short-term traffic flow forecasting is an important part in intelligent traffic forecasting system.The short-term traffic flow forecasting results can be directly input to the advanced traffic information system and traffic management system.The forecasting results can provide the travelers with real-time and effective information, help the travels select better routes and acquire route guidance, so as to shorten the travel time and relieve traffic jams [8]- [10].
In order to improve the forecasting accuracy of the short-term traffic flow, this article modified the traditional K-nearest neighbor non-parametric regression method and proposed a short-term traffic flow forecasting method based on three layers of screening.The experimental results indicate that the proposed algorithm can further enhance the accuracy in traffic flow forecasting.
With the development of computer technology, the data size increases significantly and how to enhance the algorithm's accuracy more effectively appears to be particularly important.We should improve the algorithm constantly so as to be adaptive to the traffic's real-time characteristic and accuracy, and finally make the shortterm traffic flow forecasting be widely applied in traffic guidance.

Figure 1 .
Figure 1.Flow chart of the improved method.

Figure 2 .
Figure 2. Flow chart of the second screening.
the union set of set C and set D was acquired and denoted as B, i.e., B C D =  .B is the set of points after the second layer of matching and screening.