Demand Prediction of Ride-Hailing Pick-Up Location Using Ensemble Learning Methods

Ride-hailing and carpooling platforms have become a popular way to move around in urban cities. Based on the principle of matching riders with drivers, with Uber, Lyft and Didi having the largest market share. The challenge remains being able to optimally match rider demand with driver supply, reducing congestion and emissions associated with Vehicle clustering, deadheading, ultimately leading to surge pricing where providers raise the price of the trip in order to attract drivers into such zones. This sudden spike in rates is seen by many riders as disincentive on the service provided. In this paper, data mining techniques are applied to ultimately develop an ensemble learning model based on historical data from City of Chicago Transport provider’s dataset. The objective is to develop a dynamic model capable of predicting rider drop-off location using pick-up location data then subsequently using drop-off location data to predict pick-up points for effective driver deployment under multiple scenarios of privacy and information. Results show neural network algorithms perform best in generalizing pick-up and drop-off points when given only starting point information. Ensemble learning methods, Adaboost and Random forest algorithm are able to predict both drop-off and pick-up points with a MAE of one (1) community area knowing rider pick-up point and Census Tract information only and in reverse predict potential pick-up points using the Drop-off point as the new starting point.


Introduction
In recent years, ride-hailing and carpooling platforms have become increasingly popular and convenient way of moving around in most modern cities, matching riders with drivers, with Uber, Lyft and Didi being the biggest providers within the industry. In light of increased environmental awareness as well as concerns on minimizing carbon footprint, ridesharing and carpooling has become increasingly important.
Carpooling has numerous societal and individual benefits, including but not limited to reduction of Greenhouse-Gas emissions, cost savings in terms of shared travel costs for public agencies and employers [1].
In their paper, [2] present salient points in the understanding of the key aspects of the existing ridesharing system, going on to design a framework to identify challenges in the use of ridesharing thus fostering the development of mechanisms to overcome and promote widespread use.
Emerging studies [3] demonstrate psychological factors such as monetary and time benefits becoming more dominant factors in decisions to use ride-hailing and carpooling services. In relation to rider satisfaction, [4] found surge pricing not to bias Uber towards riders of higher income threshold, but rather, homophilous matching that is, matching riders to drivers of a similar age resulted in higher ratings and further went on to use these insights to predict driver and/or rider retention. Examining ridesharing platforms, [5] concluded moving forward, these platforms will do more good than harm, also, it was found that relatively little is known about their efficiency and equity but is likely to change with growing research interest. Using online reviews of drivers of popular ride-hailing companies, Uber and Lyft, [6] was able to demonstrate preference of Uber to Lyft. In addition, analysis show increased competition to attract more drivers, for which drivers counted job flexibility, and meeting new people as main advantages. In contrast, insufficient compensation, poor job security, poor rider behavior and poor customer service as impeding factors.

Problem Statement
The Braess Paradox [7] [8] is a network phenomenon in which it is observed that the addition of extra capacity reduces overall network performance over time with lack of cooperation of users being the ultimate culprit for network breakdown Congestion & Vehicle-Clustering: Over the years, the number of vehicles engaged in ride-hailing has increased astronomically, surpassing taxis in many urban cities, [9]. A report from the (Union of Concerned Scientists, 2020) shows that ride hailing trips are responsible for 69 percent more emissions than the trips the service displaces with a significant amount of trips being Deadheading (Dead-mileage). This constitutes the period between drop-off and pick-up and is associated with increased costs [10]. Surge pricing i.e. where prices are adjusted upwards to meet acute driver shortage is viewed a disincentive to many riders, leading to lost revenue.
Solving the problem D. Carson-Bell et al.
In order to combat the problems above, it is necessary to develop a sound driver deployment strategy. Collective Intelligence (COIN) [11] was first suggested as a way of solving Braess paradox. This involves all networks users acting centrally for the benefit of all. [12], Observed that strategic repositioning is key to maximizing driver earnings as against surge chasing which increases Deadmileage. First and foremost will be to be able to predict and deploy vehicles accordingly. [13], conclude that centralized fleet coordination offers substantial benefits towards sustainable growth and market share.

Research Purpose and Objective
The objective is to develop a city-wide prediction algorithm capable of predicting trip pick-up and drop-off points, as well as potential pick-up locations after each drop-off based on historical data using Data mining Techniques.
Case study: City of Chicago, Illinois.

Related Works
The growth of demand for ride-hailing services has disrupted urban transportation and is changing the way in which people travel. Modern ride-hailing services require the development of efficient recommendation systems in order to improve both riders and driver experience. In response, many researchers have conducted various experiments to help predict ride hailing demand in order to improve effective ride-hailing vehicle deployment.
In attempting to optimize the number of pick-ups whilst minimizing waiting time for taxi services, [14] developed a ride-hailing recommendation system. This is completed in 3 phases. The model starts by first effectively estimating future customer demand in different clusters within the area of interest. This is followed up with a taxi-to-region matching according to preset rules and conditions including driver preference and finally concluded with the design of an optimized geo-routing algorithm to help drivers minimize dead-mileage. The problem with this mainly lies with the instability of driver preference which changes frequently, making the approach difficult to deploy in real world situations.
Dead-mileage comprises a significant share of total travel covered by drivers within the ride-hailing industry in terms of miles travelled and number of trips overall. Accurate demand prediction within the ride-hailing industry can greatly improve vehicle utilization whilst reducing waiting time. Customers mainly desire minimization of waiting time whilst drivers on the other hand aim to minimize deadheading and idle time after trips. This subsection of the industry comprises another area of strong research interest. [15], is one of the first to study this emerging field. He develops a model which predicts the gap between rider demand and driver supply within a given time period and specific geographic area using Point of Interest (POI), Traffic, Weather data as well as data from Car sharing orders. A data sampling techniques is used to determine patterns and generalizations which can be applied in real case sce- narios forming the basis for future work. This concept of finding the supply and demand gap is important as it allows for the deployment of drivers to improve the level of service Time based demand prediction is another research area fast gaining ground. This is based on the premise of predicting ride-hailing vehicle demand in the next hour.

Operational Research Mobility Optimization
The vast majority of human interaction takes place in one of two areas; home or work. In order to further understand mobility patterns of users of ridesharing services across home and work locations, as well as social ties between users, [16] developed an algorithm for matching users with similar mobility patterns under constraints and concluded, a decrease in social distance of as much as 31% when users shared rides with others. These findings indicate the importance of the study of mobility patterns and the benefits which can be derived from optimizing ride-hailing services at an operational level. Using a more flexible yet extendible mobility model representing ride-sharing users movement and habits, [17] deploy a Variable-Order Markov Model (VOMM) underplayed with a Partial Matching (PPM) algorithm for next location prediction, with a prediction accuracy ranging from 60% -81%. A major limitation of the usage of the PPM algorithm hovers around the compression process which tends to limit performance over time. In comparing the use of privately owned vehicles and two Autonomous Mobility on-Demand (AMoD) simulated on a real transport network based on current situation, under different scenarios, [18] found the deployment of AMoD system resulted in a major decrease in both number of vehicles required in order to meet transport needs (that is, 43% in AMoD1 and 88% in AMoD2) and street parking space required (58% in AMoD1 and 83% in AMoD2). [19], also cite effective road utilization as another advantage of designing the matching algorithm. Comparing the use of privately owned vehicles and two autonomous mobility on-demand (AMoD) simulated on a real transport network based on current situation, under different scenarios. Autonomous Mobility on-Demand vehicles are viewed by many as the future of transport, however their effectiveness hinders largely on the ability to coordinate their movement and predict demand as accurately as possible using the vast quantity of data we have available at our disposal, for which this paper seeks to pursue further.
In an attempt to resolve the surge of homeward-bound persons during the holiday seasons, [20] proposed a large-scale ridesharing system called Country-Roads ® using an online greedy matching algorithm to match drivers and passengers, recording a success rate of 23.2%. Online Greedy matching algorithms have a comparatively low performance threshold when applied in complex systems such as ride-hailing services as experienced by the authors this is largely due to the level of rigidity of process making it not ideal for location prediction. Based on the concept of space-time windows, [21], develop a unique approach based on Lagrangian relaxation, and conclude that the adoption of flexible pickup and delivery will evidently reduce system-wide cost whilst improving service quality.
This hypothesis although found to be true, defeats the purpose of ride hailing services. Flexible pickup and delivery have not been widely accepted even within the carpooling sphere as centralized pick-up location is yet to gather wide acceptance.

Linear Programming & Statistical Methods
In implementing optimization solutions based on linear programming, [22] deploy a Tabubased meta-heuristic algorithm with the aim of solving the mixed integer linear program (MILP) under differing scenarios. The algorithm is observed to have a higher computational accuracy than control, the introduction of meet points to the ridesharing system reduces total travel time by 2.7% -3.8% for scaled tests. With meet-points not having been widely accepted within the ride-hailing and carpooling industry, the benefits of reduced travel time, and reduced travel costs associated with it cannot be fully quantified. Especially given Covid-19 social distancing protocols. This demonstrates the need to improve location prediction as a lasting solution.
From the domain of probability and statistics, [23] having collected data of taxi trips in New York, Singapore, San Francisco and Vienna compute shareability curves for each city, then through natural rescaling collapse them into a universal curve which is used to predict the potential of ridesharing in any given city based on a few qualities and parameters. The statistical methods employed here demonstrate the general overview of the potential of the growth of ridehailing services in any given city. This is to help with city planning purposes and fails to examine rider-driver interaction.
Examining the relationship between the frequency and probability of ridesharing usage, and frequency of public transit usage, [24], develop a Zero-inflated negative binomial regression model.

Research Framework and Design
To reduce the number of vehicles, alleviate traffic jams and curb pollution in transporting people in office hubs in Poland, [25] collected a representative sample of the population and used spatial data mining techniques to develop a set of parameters for the multi-agent system. Using the distributed model-free, system DeepPool ® based on deep Q-network (DQN) techniques, [26] develop an algorithm able to learn the optimal dispatch policy through interaction with the en-vironment, incorporating travel demand statistics and a dataset of taxi trips in New York to dispatch vehicles and anticipate future demand. Deploying a convolutional neural network (CNN) based on deep learning for multi-step ride-hailing demand prediction using trip request data in Chengdu, [27] showcase faster training and prediction of CNN models compared to the use of Long Short Term Memory (LSTM) models.

Data
In conducting this research, a large scale dataset of rideshare and taxi trips spanning 2018/2019 in Chicago is collected, as shown in Table 1, with each observation consisting of the following elements: The data is processed and cleaned. As a first step, a comprehensive understanding of the individual features within the dataset is required, as well as knowledge of trip distribution across the city, from origin (O) to Destination (D). Numerous studies have demonstrated the importance of regional partitioning in location prediction. Research and experiments by [28] demonstrated that regional partitioning led to better forecast and demand prediction of geospatial data. This is followed up with followed by scenario development. Figure 1 shows a color-coded layout of the City of Chicago, detailing its community areas as well as census tracts.

Multidimensional Scenario Formulation
Scenario performance analysis allows for measuring performance under varied rider privacy limitations.

Scenario 1
Location prediction with no information i.e. drop-off community area (destination) prediction with only pick-up (origin) data, and vice versa. This is in order to allow for riders with strict privacy concerns in information release, measuring ability to predict trip start and end points given rider privacy restrictions.

Scenario 2
Location prediction with partial information. That is, drop-off community area (destination) prediction with pick-up data and Census Tract (destination zone) information, vice versa. It is based on the idea of being able to predict trip start and end points under rider uncertainty.
Steps and Methodological process  FEATURE RANK USING RRELIEFF The RReliefF algorithm estimates the quality of an attribute according to the degree with which it discriminates between instances near each other. Here, an instance R is randomly selected, then the K-nearest instances with respect to class value are selected. The difference between the value of A of R as well as the value of the same attribute for one of the K-instances is then compared with respect to the difference of their class values. This process is repeated and ultimately yields a weight for each attribute ranging between −1 and 1.

Cross Validation Model Evaluation and Scoring
The Leave-P-Out Cross Validation (CV) approach leaves "p" data points out of the training data, with a sample size of n-p being used as the validation set. This process is repeated for all possible combinations, with error being averaged for all trials in order to determine overall effectiveness.
To measure the degree of error of the developed models, error metrics will then be used to judge model quality and compare the different regression models. The Mean Average Error (MAE), Mean Squared Error (MSE), and R-Squared (R 2 ) will be used for evaluation.
where, SSEM is the sum of Squared Errors by Mean line and SSER is the sum of Squared Errors by Regression Line Predictive Modelling using Ensemble Learning Generally, ensemble learning is the term used to describe meta-algorithms that makes predictions based on inputs from different models, thus, by combining multiple individual models, the ensemble model tends to have less bias, variance, and avoids overfitting culminating in improved predictions.
Adaboost and Random Forest are the most commonly used.

Principal Component Analysis (PCA)
In analyzing the weights of the individual features within the data sample collected, PCA analysis is performed, measuring the degree of variance covered by each principal component within the data set. Analysis of PCA results reveals an increase in the degree of variance explained by each of the data attributes within the dataset. Figure 3 describes the results obtained from PCA analysis. Results show that certain attributes within the dataset are able to explain 55.9% of the recorded variance, with 5 attributes able to explain 73.7% of the variance and so on. This aids in selecting the most important data attributes which will effectively improve the models prediction accuracy. Analysis reveals that 9 attributes to be the optimal number of features to incorporate in building the models.
Feature Scoring and Rank After PCA analysis, the features within the dataset are then ranked in order according to feature influence on prediction output. Figure 4 details the weight associated with each attribute used in designing the model, with some attributes being more critical to predictive performance than others.
RreliefF is used to rank and measure individual features by level of importance as shown above.

Re-Evaluation and Model Calibration
Scenario 1 Predicting drop-off community area (destination) with only pick-up (origin) data. Model results show an ability of linear regression models to predict potential Drop-off areas within a radius of 13 blocks (community area). This is in the absence of any information other than pick-up point (origin).
The results are shown in Figure 5:   Implications on ride-hailing Industry includes: 1) Improved vehicle utilization, and time efficiency.
2) Reduced dead-mileage and idle time after trips.
3) Improvement riders and driver experience.

Conclusions
In conclusion, results from the research indicate the ability to use predictive modelling and analytics to adequately maximize driver positioning and deployment by predicting surge zones before they occur irrespective of rider privacy settings.
The implications of these results on the transport industry includes:  Reduced incidence of the surge and increasing rider satisfaction.
 Reduced transport costs.
 Increase in the ease of parking particularly in high-demand (downtown) areas.
 From a social and environmental point of view for fewer wasted miles would translate into less emissions overall.