Traffic Congestion and Duration Prediction Model Based on Regression Analysis and Survival Analysis

With the 
current situation of traffic congestion becoming more and more serious, how to 
accurately predict the time of traffic congestion has been widely concerned. In 
this article, we will build two models to better predict traffic congestion 
time. First, we use methods to collect the data we need, and through the 
preliminary cleaning, processing, deletion of missing data, combined 
calculation of data according to indicators and other steps to screen and 
integrate the data we need. Then, the multivariate linear regression method is used to construct the 
traffic prediction congestion model for the existing data, and the actual 
situation of traffic congestion is obtained. Secondly, 
the non-parametric method Kaplan-Meier model in the survival analysis method is 
used to obtain the survival function of traffic congestion duration, and the 
traffic congestion duration model is constructed. The software programming is 
solved by MATLAB, Stata, SPSS, etc., and the congestion prediction is obtained. 
The fitting degree between the predicted value and the actual value of the 
model is above 0.96, which can better quantify the conclusion that the road 
traffic operation congestion degree and congestion duration model can identify 
the characteristics of congestion 
distribution and duration. Finally, the paper evaluates the advantages and disadvantages of the 
model objectively, and considers the aspects that can be promoted and applied. 
I hope that this model can contribute to the prediction research of traffic congestion time!

standards, more and more families have the ability to purchase small cars. The number of cars in the city is increasing day by day, but there are no corresponding supporting measures for urban roads. Urban roads are overwhelmed and congested. Or traffic accidents are happening every day, threatening urban traffic safety. In order to reduce the impact of sub-categories on urban traffic, navigation software is particularly important.
GPS is radio navigation and positioning system based on 24 global positioning satellites to provide three-dimensional position, three-dimensional speed and other information around the world. The positioning principle of GPS is: the user receives the signal transmitted by the satellite, obtains the distance between the satellite and the user, the clock correction and the atmospheric correction, and determines the position of the user through data processing. Now, the positioning accuracy of civilian GPS can reach 10 m or less. The special function of GPS has long attracted the attention of the automotive industry. When the United States announced the opening of a part of GPS system after the Gulf War, the automobile industry immediately seized this opportunity. Investing in the development of car navigation systems, and quickly put into use.
In navigation software, the estimation of travel time is often a function that is important for people's driving. Existing navigation software often obtains real-time GPS data to determine current road conditions by installing software taxis or vehicles. However, in the case of severe traffic congestion, the speed of the vehicle is slow, and the GPS estimation of the vehicle speed is very inaccurate. This will result in accurate prediction of the vehicle's travel time by the navigation software, thus affecting the customer's use and even the operation of the traffic. This paper models and discusses this issue and establishes a more accurate model to predict traffic congestion time.

Reasons for the Research
First of all, the navigation system's prediction of the owner's driving time is not accurate, which will bring the owner a wrong time illusion deviation, which leads to the deviation of the owner's schedule, which is not conducive to personal life, resulting in poor user experience and loss of trust in navigation software.
That will reduce the use of navigation system software and affect the development of navigation software.
Secondly, the navigation system is inaccurate in predicting the travel time of the driving route, which leads to the lack of road conditions in the driving time, and the situation of the road section leading to congestion is worse, greatly reducing the navigation system's promotion effect on traffic driving, which is not conducive to urban traffic. Improvement has affected social order.

Summary of Research
Literature [1] Wang Yuying et al., the factors affecting traffic congestion and the evaluation model, the ratio of average travel time to free-flow travel time, the ra-tio of travel time and free-flow time to ensure 95% on time, delay, congestion time The Beijing congestion indicator system is constructed in five aspects of the number of congested road sections.
Literature [2] Chen Jian et al. based on the problem of unbalanced travel time demand for public transportation; a two-way planning model was used to construct a bus time differential pricing model. The study pointed out that the implementation of peak increase in fares and flat peaks to reduce fares can help ease traffic pressure. All in all, the existing literature has more or less its deficiencies and needs to be improved.

Model Assumption
First, assume that the road surface conditions are the same and the road surface is in good condition.
Second, assume that the roads in the city are in good condition, and there is no control over traffic and occupation of traffic.
Third, it is assumed that the influence of pedestrian traffic flow and bicycle traffic around the road on traffic can be neglected.
Fourth, assume that the driver strictly abides by the traffic rules during the driving process, and there is no violation of traffic regulations such as red lights.

1) Average driving speed
The average driving speed refers to the average value of driving speed of all motor vehicles in the same time and the same distance, in km/h. The calculation formula is as follows: where: is the average driving time, h; L indicates the driving distance, km; indicates the number of vehicles passing every hour; indicates the speed of the first car.
3) Average loss time The average loss time refers to the time lost by the vehicle due to certain external factors (such as bad weather, traffic accidents, etc.) within a certain unit of travel. This indicator can reflect the smooth flow of traffic and the queuing situation. The calculation formula is as follows: where: is the average loss time, h; represents the actual speed of the vehicle, indicates the speed of the motor vehicle in the non-congested state, km/h.

4) Traffic density
Traffic density, also known as traffic flow density, refers to the number of vehicles in a lane over a certain distance. This indicator can reflect the intensity of vehicles on a road. Its calculation formula is as follows: where: TD indicates the traffic density, vehicle/km; indicates the number of vehicles in a certain moment in the lane; L indicates the driving distance.

5) Traffic volume
Vehicle traffic refers to the number of vehicles passing through a certain road over a period of time. Its calculation formula is as follows: where: TV represents the traffic volume, vehicle/h; indicates the number of vehicles passing the road; indicates the observation time.

6) Morning and evening peak
Affected by the commute time, the most congested time of day will be concentrated in the peak of work and the peak hours of work. This paper assumes that the morning peak hours are from 7:30 to 9:30 and the evening peak hours are from 16:30 to 18:30. 7) The number of weeks In this paper, the number of weeks is divided into two types: weekdays and weekends. Due to different travel purposes, different traffic congestion distribution characteristics will be presented on weekdays and weekends.

8) Traffic congestion index
The traffic congestion index refers to the ratio of excess time to the original time to measure traffic congestion in an area. The traffic index value is expressed by a value between 0 and 10. The larger the value, the more congested the road traffic. The smaller the value, the smoother the traffic, as shown in Table 1 and   Table 2. Table 1. Traffic congestion index rating chart.
There is basically no congestion. You can drive according to the road speed limit.
[2, 4) Basic smooth A small amount of congestion, you need 0.2 to 0.5 times longer than smooth traffic.
[4, 6) Mild congestion Some roads, you need more than 0.5 to 0.8 times longer than unblocked traffic. [6,8) Moderate congestion A large number of roads, you need more than 0.8 to 1.1 times longer than smooth traffic.
[8, 10) Severe congestion Most of the city's road are congested, you need 1.1 times more than smooth.  For data collection and processing, we used various channels to obtain GPS. Trajectory data for 10,357 vehicles from February 2 to February 8, 2008. The total number of points in the data set is about 15 million, and the total distance of the trajectory reaches 9 million kilometers. We first perform preliminary cleaning, processing, and deletion of missing values. Using the relevant software to represent the trajectory data in the data set on the map, and knows the trajectory range of the data set, as shown in Figure 1. The data in the processed range is combined and calculated according to the traffic congestion index, time attribute, and related traffic congestion indicators.

Analysis of the Problem
This problem requires us to make a multiple logarithmic linear regression traffic congestion prediction model and a traffic congestion duration model, so as to improve the accuracy and real-time of traffic congestion prediction by using the model, and estimate the traffic congestion time by using the traffic congestion duration model. Aiming at this problem, we solve it in two steps.

Model Preparation
Traffic congestion index has many influencing factors, including average driving speed, average driving time, average lost time, vehicle flow, traffic density, morning and evening peak, and the number of weeks. There are many ways to calculate the traffic congestion index. In the United States, traffic delay time is mainly used to calculate the severity index of traffic congestion, combined with the actual situation and existing data in China, the algorithm adopted in this paper is based on the road speed calculation of traffic Open Journal of Business and Management congestion index to calculate, as shown in Figure 3.

The Foundation of Model
The traffic congestion index is a conceptual index set by some cities to comprehensively reflect the unimpeded or congested road network according to the road traffic conditions, which is equivalent to digitizing the congestion situation. The calculation method of traffic congestion index is based on section speed: traffic congestion index calculation method: Let y estimate be ŷ , then the residual between the observed value and the es-  Table 3.
2) Survival analysis methods The methods of survival analysis mainly include parametric method, Table 3. Basic conception.

Basic Concept Description
Survival time Generally it refers to the time elapsed from the start of a certain Starting event to the end of an event. The unit of measurement can be year, month, day, hour, etc., as indicated by the common symbol t.
Censored data It refers to the data that failed to obtain the exact time of survival of the study individuals for some reasons during the research, such as individuals who died in the middle of an accident, and survived after the test time expired.
Survival probability The probability that an individual in the subject will still survive the beginning to the end of a unit time.
Survival function Also named the survival curve, it is mainly used to describe the probability distribution of the object failure time and it is a monotonous non-increasing function.
Dangerous function Also named the risk function, it refers to the probability of instantaneous death of an individual surviving at a certain moment during the survival analysis.
semi-parametric method and non-parametric method. When the distribution type is unknown, the non-parametric method has a high computational efficiency, as shown in Table 4.

3) Congestion duration model based on survival analysis
Based on the survival analysis method described above and combined with the actual research situation of this paper, the actual situation of traffic congestion is defined as follows: a) The survival time of traffic jam refers to the duration from the occurrence of traffic jam to the end of traffic jam. b) Traffic jams deletion data. The data of traffic jam duration has the deletion feature, which means that traffic jam events occur earlier than the beginning time of the study or the congestion continues after the end of the study time, or incomplete data cannot be accurately recorded due to some factors.
c) The traffic jam survival function is s(t). The traffic jam survival function refers to the sample probability distribution of the existence of congestion from the beginning of traffic jam to time t, also known as the cumulative survival function, as shown below: where: ( ) F t represents the distribution function; P stands for probability; T represents the duration of traffic jam; f(x) is the probability density of T evaluated at time x. When the survival probability is low, the survival curve s(t) is steep, while when the survival probability is high, the survival curve s(t) is flat. d) Traffic congestion risk function h(t), the risk function refers to the probability of traffic congestion not disappearing after the occurrence of time t, but disappearing within a minimum time t ∆ , also known as conditional survival probability, as shown in the following formula: The cumulative risk function curve is obtained from the integral of the risk function. The higher its position, the higher the probability of ending the traffic

Model Solving
Multiple logarithmic linear regression traffic prediction model solution The above variables are assumed to have logarithmic linear correlation, and the multiple logarithmic linear regression traffic congestion prediction model is established as follows: where, v represents average driving speed; t represents the average driving time; l T represents the average loss time; TD represents traffic density; TV represents traffic flow; 1 D represents morning peak; 2 D represents evening This paper divides the data set into 60% training set and 40% test set. By using Stata software and training set data, the multiple logarithms linear regression traffic congestion prediction model was regressed, and the estimated value matrix of regression coefficient b was obtained: The preliminary multiple logarithm linear regression traffic congestion prediction models are thus obtained: ( ) According to the test results, as shown in Table 4 and Table 5, the residual mean square value of the model was only 0.0525, R 2 was 0.9688, fit degree was 0.969, and the model p < 0.0001. So the regression model is meaningful. Then t test of model regression coefficient, the analysis results can be seen, as shown in Table 6, the regression coefficient P values were less than 0.01, can explain that variable average driving speed, the average loss time, density of traffic, and whether late to peak traffic congestion index to be explained variables influence significantly, and affected by the average driving speed, the largest and P values are greater than 0.50, can be thought of as the average driving time of traffic congestion index had no significant effect, relative to the peak in the morning and evening, and morning rush research sections within the scope of the influence factors is small, It is affected by evening peak, as shown in Tables 5-7. Y. D. Liu et al. The model was further analyzed, the insignificant variables were removed, and the regression model was fitted again. Regression diagnosis and model analysis was carried out for the new regression model. According to the analysis results of significance test and fitting degree test, the model had a good fitting effect. Therefore; the city's multiple logarithms linear regression traffic congestion prediction models is as follows:

Model Prediction
Through the above tests, the traffic congestion model was obtained, and the corresponding predicted value was obtained through the corresponding tests with 40% of the test set, which was compared with the actual value of the traffic congestion index, and the results were shown in Figure 4. Among them, the test set = 0.9734, and the fitting degree was 0.9727, as shown in Figure 4.
In order to further verify the prediction accuracy of the model, the fitting degree of the training set sample and all samples was analyzed in this paper. The fitting effect of the training set sample was shown in the figure, with =0.9688 and fitting degree 0.969. The fitting effect of the total amount of all samples is shown in Figure 5 and Figure 6, with =0.968 and the fitting degree 0.968.   It can be seen from the figure that the predicted value obtained by the traffic congestion prediction model in this paper has a good fitting result with the actual value. Therefore, it is feasible and effective to use the multiple logarithmic linear regression model proposed in this paper to predict traffic congestion index.
Congestion duration prediction based on survival analysis Due to traffic congestion duration distribution function is unknown, this paper USES the method of nonparametric Kaplan Meier model for traffic congestion duration of survival function, its principle is: suppose you have n congestion duration samples, duration time period have different k values, making, is traffic congestion duration of survival function s(t) estimate function is as follows: where, n is the number of samples before time, that is, the number of samples that traffic congestion still persists; s(t) is the probability of survival at time.
There is an obvious difference between weekday traffic and weekday traffic in the section studied in this paper. Weekday traffic congestion is significantly more serious and lasts longer than weekday traffic. The morning rush is later than weekday traffic, and the travel rush is near noon. As can be seen from Figure 9, the survival function of weekdays is above the survival function of weekends, indicating that the frequency of traffic congestion in weekdays is higher than that in weekends, and the duration of traffic congestion   is longer than that in weekends. At weekends, 87.5% of the traffic jams lasted less than 250 minutes. It is more likely to end the traffic jam within the duration of the traffic jam, and less likely to end the traffic jam beyond the duration of the traffic jam. 87% of the working day traffic jam lasts less than 300 minutes, which means that it is more likely to end the jam within the duration of the traffic jam, while it is less likely to end the traffic jam beyond the duration.