Spatial Regression Analysis of Pedestrian Crashes Based on Point-of-Interest Data

Pedestrian safety has recently been considered as one of the most serious issues in the research of traffic safety. This study aims at analyzing the spatial correlation between the frequency of pedestrian crashes and various predictor variables based on open source point-of-interest (POI) data which can provide specific land use features and user characteristics. Spatial regression models were developed at Traffic Analysis Zone (TAZ) level using 10,333 pedestrian crash records within the Fifth Ring of Beijing in 2015. Several spatial econometrics approaches were used to examine the spatial autocorrelation in crash count per TAZ, and the spatial heterogeneity was investigated by a geographically weighted regression model. The results showed that spatial error model performed better than other two spatial models and a traditional ordinary least squares model. Specifically, bus stops, hospitals, pharmacies, restaurants, and office buildings had positive impacts on pedestrian crashes, while hotels were negatively associated with the occurrence of pedestrian crashes. In addition, it was proven that there was a significant sign of locali-zation effects for different POIs. Depending on these findings, lots of recom-mendations and countermeasures can be proposed to better improve the traffic safety for pedestrians.


Introduction
Traffic safety has recently been considered as one of the greatest issues in urban management worldwide. Due to the influencing factors from different aspects, How to cite this paper: Chen, Y.Y., Ma In order to capture contributing factors for pedestrian crashes, their locations are usually aggregated into different spatial units [3], such as segments, intersections, mid-blocks, corridors, zones and so on. Many studies have conducted the safety analysis of pedestrian crashes based on zone-level data and examined a lot of related features. A spatial analysis effectively allows for identifying spatial distributions and trends in a larger area, which could help establish long-term planning schemes to improve pedestrians' safety. Since crash occurrences are not independent across space, pedestrian crash risk may vary significantly in different urban areas. It has been shown that spatial autocorrelation and spatial heterogeneity in crash data are two critical properties when developing statistical models for macro-level safety analysis. Fortunately, important improvements in analytic methods facilitate procedures of safety research on pedestrian crashes.
Geographic information system (GIS) is powerful platform supporting lots of spatial regression models. GeoDa software adopted in many recent studies can be used to establish Bayesian models on spatial correlation. The reproducibility of using R language to perform spatial data analysis is unparalleled, which includes plenty of spatial packages for different purposes.
Pedestrian crash occurrences are correlated with various kinds of attributes, such as land use, vehicle kilometers traveled, road features, traffic volume, socio-demographic characteristics and so forth. However, the accuracy and reliability of these data can hardly be assured. Besides, the unavailability is another concern; the sources are not open for the public by relevant authorities. Instead, point-of-interest (POI) data from anywhere in the world can be collected with help of web scraping and other open sources. Although POI data may not include traditional information used in traffic accident analysis, they can represent specific land use factors with precise locations, which are expected to be highly related to pedestrian crashes in both macro-and micro-level aspects. Additionally, making good use of POI data may be effective in practice, for instance, as an assistance for transportation planning.
This study has two main goals: one is to examine whether spatial autocorrelation and spatial heterogeneity exist in pedestrian crashes within urban area of Beijing; the other is to find out factors contributing to the number of pedestrian

Literature Review
In order to find out the contributing factors for pedestrian crashes, a lot of approaches have been proposed. Some researchers used statistical analysis to model accident frequency [4] [5] [6]. There are also studies focusing the accident injury severity [7] [8] [9] [10] [11]. However, spatial and temporal attributes of accident data are considered by many researchers all the time [12]- [17]. From a macro scale, accident distribution characteristics can be depicted by spatially clustering methods. Using time series models, we can recognize the trend and seasonality of accident occurrence in one certain place.
An accident's location is always an important factor, which can be used for identifying black spots. In many existing studies, kernel density estimation (KDE) is a very common method for identifying gatherings of traffic crashes.  [29]. They found that the performance of spatial lag model (SLM) and spatial error model (SEM) were even close, and bank density and hospital density had significantly positive impacts on road accidents.
There are also a group of studies using conditional autoregressive (CAR) to analyze crashes from different perspectives. Kaplan [37]. The results showed that GWNBR model was more appropriate for capturing the spatial heterogeneity of accident occurrences than the GWPR model. In the study of Bao et al., the relationship between twitter-based human activity variables and crash counts in urban areas were mainly considered [30]. They found that human activity has a significant effect on the crash frequency in their analysis.
Since the high severity of crashes involved pedestrians, there are a good number of spatial modelling for this crash type. Siddiqui et al. used a Bayesian Poisson-lognormal model to examine the impact of different variables on the pedestrian crash frequency in consideration of spatial correlation [12]. They found that roadway characteristics, demographics, socio-economic and neighborhood-related variables are statistically significant. The results also indicated that modelling pedestrian crashes should account for the spatial correlation for spatially aggregated data. In the study of Cai et al., spatial spillover effects were considered in the dual-state models [38]. They used zero-inflated negative binomial and hurdle negative binomial models to analyze the pedestrian crash frequency for TAZs. The model results emphasized the impact of traffic, roadway, socio-demographic and neighboring TAZs on the occurrence of pedestrian crashes. Conditional autoregressive models were also adopted in the macro-level spatial analysis of pedestrian crashes. In order to investigate the association between explanatory variables and the number of pedestrian crashes, Wang et al. developed a Bayesian CAR model with seven different spatial weight features to characterize the spatial dependence [16]. The Bayesian Poisson-lognormal (PLN) models with conditional autoregressive (CAR) prior were established in the study of Guo et al. to examine the influence of multiple factors on the pedestrian crash occurrences [39]. The model results reflected that the greater global integration was positively related to the higher frequency of pedestrian crashes, and the irregular road network was much safer than the grid pattern.  65. The pedestrian crash number per TAZ varies a lot spatially, which will be analyzed in later sections.

Data
As mentioned above, many studies have combined accident data with traditional traffic flow data and land use data to perform the spatial safety analysis.
However, owing to the unavailability of these data at the TAZ level, we mainly focused on the POI data in this study. POI data are obtained from Google Application Programming Interface (API) through web scraping. After data tidying and categorizing, eleven kinds of POI were chosen, including bus stops, parking lots, hospitals, pharmacies, schools, hotels, supermarkets, banks, restaurants, parks and office buildings. Virtually, POI data are a special kind of land use data with concrete location attributes, which can be used to reflect the relationship between pedestrian crashes and user characteristics. Table 1 shows the statistical description of these POI data. Noticeably, mean values of crashes and POIs vary significantly, which indicates that the spatial distribution of data is exceptionally unbalanced at the spatial level.

Methodology
In this study, pedestrian crash data were analyzed from two aspects: spatial autocorrelation and spatial heterogeneity. According to Tobler's first law of geography, near things are more correlated than distant things [40]. In other words, locations of traffic accidents are probably autocorrelated at the spatial level, especially for different areas within a city. A group of spatial econometrics approaches can be used to take the spatial autocorrelation into consideration, based on the traditional regression models. Spatial heterogeneity is the variation of relationship between variables due to variation of geographical positions. This

Spatial Autoregression
In spatial analysis of traffic crashes, spatial dependence occurs when accidents of neighboring areas are correlated to each other. With this phenomenon existing, pedestrian crash data are not supposed to be directly analyzed by regular regression models, such as ordinary least squares (OLS). OLS is a type of linear regression model for estimating unknown parameters, which can be expressed in a vector form as where y is the dependent variable of crash count, X represents explanatory variables and β is a vector of coefficients of explanatory variables. ε is an error term, which is subject to normal distribution.
In a time-series context, the OLS estimator remains consistent even when a lagged dependent variable is present, as long as the error term does not show serial correlation. While the estimator may be biased in small samples, it can still be used for asymptotic inference. In a spatial context, this rule does not hold, irrespective of the properties of the error term. In spatial analysis of traffic crashes, pure OLS can be used to find out the variables significant to the crash count of each TAZ, without considering any spatial relations of different areas. However, spatial autocorrelation in the crash data cannot be reflected just by the OLS. In order to deal with this issue, many transportation departments used OLS with a spatial weight matrix to model the number of crashes or the crash rate in a vast spatial scale over long time periods. To be more specific, several spatial econometrics approaches can be used for analyzing spatial autocorrelation in crash data on the basis of OLS results [42]. The spatial lag model (SLM) can be given by y Wy X where y is the dependent variable, a vector (n × 1) of pedestrian crashes in one year. W is a n × n spatial weight matrix, representing spatial relations between spatial units. ρ is a spatial autoregressive coefficient. X is a n × k matrix of explanatory variables, and β is a k × 1 vector of parameters reflecting the impact of explanatory variables on the y. ε is a n × 1 vector, defining unobserved error terms that are independent and identically distributed.
Use of spatial error model (SEM) may be motivated by omitted variable bias.
SEM is a regression model with spatial autocorrelation in the residuals defined by y Xβ µ where y is the dependent variable, X represents explanatory variables, and β is a vector of coefficients of X. W is a known spatial contiguity matrix. The parameter λ is a coefficient on the autocorrelated residuals μ, and ε is an error term.
The spatial Durbin model (SDM) adds spatial lag of both the dependent variable and explanatory variables into a traditional linear model, which can be expressed by 1 2 y Wy X WX where y is a n × 1 vector of the dependent variable, X is the corresponding n × k matrix which contains the observed explanatory variables, and β 1 is a n × 1 vector of associated parameters of X. W is a n × n spatial contiguity matrix, and ρ is the coefficient of spatial lag of the dependent variable. The matrix product WX is indicated for a spatial lag of the explanatory variables, and β 2 is a k × 1 vector of associated parameters.
In the SLM, the number of pedestrian crashes in a specific area is subject to spill-over effects from the number in adjacent regions. The spill-over effect can be realized by spatial weight matrix W in the Equation (2). Similarly, the SEM model assumes that the error in one region depends on the errors from neighboring regions by W in the Equation (4). In this study, queen's case [43] is used to define the spatial weight matrix. The Queen's case defines that regions sharing a common edge or common vertex are considered contiguous, and then the corresponding element of the spatial weight matrix W ij is 1 but 0 otherwise. The spdep package in R was used for this analysis.

Spatial Heterogeneity
A key assumption that we have made in the models thus far is that the structure of the model remains constant over the study area (no local variations in the parameter estimates). However, spatial heterogeneity may exist across the spatial distribution of traffic crashes. Accounting for this, a GWR model mentioned before can be used to examine the potential spatial heterogeneity in parameter estimates. GWR permits the parameter estimates vary locally, similar to a parameter drifts for time series model. GWR rewrites the linear model in a slightly different form, which can be expressed as where i is the TAZ at which the local parameters are to be estimated. Here, coefficient β i is allowed to be different between different TAZs. Parameters are solved using a weighting scheme, which can be defined by where W i , the weight matrix, is denoted as where the allocated weight w ij for j observation at TAZ i is calculated with a Gaussian function in this study, which can be expressed as where d ij is the distance between the location of observation i and location j, and the parameter h is the bandwidth.  [46]. The model of best fitting can be selected by the lowest AICc.

Spatial Econometrics Analysis Results
Firstly, an OLS model was established to examine points of interest whether they are significant to the number of pedestrian crashes in each TAZ. Results are shown in Table 2. Without considering spatial effects, pedestrian crash number of each TAZ is associated with the density of bus stops, hospitals, pharmacies, hotels, restaurants, office buildings, while parking lots, schools, supermarkets, banks, parks are not significant to crash count in the model.
These POIs that do not pass the significance level in OLS were removed in the further analysis. The results of model performance of SLM, SEM and SDM are shown in the Table 3. The OLS was also re-established using six variables mentioned before in order to compare with those models having considered the spatial autocorrelation. Almost all remained variables are statistically significant above 90% in four models. The coefficients of these variables are given in the   p-value 0.000 *represents the significance level of 10%, **represents the significance level of 5%, and ***represents the significance level of 1%. clustered in SLM, SEM and SDM respectively. Log likelihood and Akaiki information criterion (AIC) are chosen to compare the model performance. The lower the log likelihood and AIC, the better the model fit. From Table 3, three spatial regression models are better fit than OLS for both log likelihood and AIC.
However, the performance of each spatial regression model is just close. Considering only log likelihood, SDM results in a slightly better fit, while SEM results in the best fit for AIC evaluation.
As results presented in Table 3, bus stop density, hospital density, pharmacy density, restaurant density and office building density are found to be positively associated with the increase of pedestrian crashes, while hotel density has a negative correlation with pedestrian crashes. It would not be hard to understand that a pedestrian crash has a larger chance to occur at areas with more bus stops.
Since people usually go to or leave bus stops by foot, it is more common for pedestrians getting involved in a crash around bus stops. In addition, because of buses entering and exiting bus stops, the complexity of traffic flow may be another contributor for pedestrian-involved risky scenarios. Thus, setting up more efficient traffic facilities for pedestrians' access to bus stops should be recommended, and warning sign for road drivers near bus stops are also necessary.
One explanation for higher occurrence of pedestrian crashes in areas having more Journal of Data Analysis and Information Processing Table 3. Results for spatial models based on selected POI data.

Geographically Weighted Regression (GWR) Results
The results of the GWR model for pedestrian crashes are presented in Table 4.  Considering the different effects of independent variables on the each TAZ, some localized strategies could be proposed to improve the traffic safety for pedestrians. For instance, TAZs in the northeast of Beijing should make a safer environment of pedestrians taking buses. Northern areas may need more traffic management around hospitals to better protect pedestrians. These recommendations require many practical experiences and need to be adjusted, and they are worthy of taking into account in urban management plans.

Conclusions
This study mainly used several spatial regression models to estimate the correla-  Figure   3).
Eleven kinds of POI were tested in OLS, while only six of them were significant to the pedestrian crash count of TAZs, including bus stops, hospitals, hotels, pharmacies, restaurants, and office buildings. All these POIs were proven to have a positive correlation with the number of pedestrian crashes for each TAZ except hotels. With spatial dependence into consideration, the results of spatial model SLM, SEM and SDM demonstrated that these six POIs are still credibly correlated with the occurrence of pedestrian crashes, while only the significance level varies a little. The GWR model revealed that the effect of different POIs on each TAZ is generally different. For example, the occurrence of pedestrian crashes in northern urban areas of Beijing is more subject to hospitals than in the south, while bus stops have a stronger effect on the south-eastern regions than the north-western regions.
Practical implications could be proposed for relevant transport departments based on the analysis in this study. The spatial scale used here is TAZ level, which is also adopted in many transportation planning. Besides, the POI data were proven to be adaptable and effective for spatially analyzing pedestrian crashes. Although only eleven categories of POI data were considered in our study, it can be increased through stronger web scraping techniques and various map application programming interfaces (APIs). It is meaningful to conduct spatial regression analysis on the correlation between different POIs and pedestrian crashes. However, to many planning agencies, these approaches for traffic safety analysis are still at early stages. It requires more theoretical innovations, Y. Y. Chen et al.