Big Data Analytics of Taxi Operations in New York City

As a global financial center, the transportation system in New York City (NYC) has always been studied from various aspects. Since 2009, NYC Taxi and Limousine Commission have made public the information on NYC taxi operations, offering an opportunity for detailed analysis. Thus, this research project investigates taxi operations in New York City based on big data analysis. The correlation between taxi operations and different types of weather, including precipitation, snow depth, and snowfall is discussed in this paper. The research also evaluates taxi trip distribution in each NTA area using Geopandas, and presents its density on an NYC map.


Introduction
As a global financial center, New York City is frequently studied by researchers, and its transportation has become an increasingly important topic. A large amount of data related to transportation released by NYC Taxi and Limousine Commission makes more sophisticated analysis possible. Using big data analysis to study taxi operations in the city of New York, this research paper explores the statistics of taxi's payment type, daily and monthly trend of taxi operation, its long-term trend, and the impact of weather. To process the data, econometrics is used to find out comprehensive results.  [3]. With graphs and diagrams, Schneider concluded that people in Queens prefer to use Uber than those in Manhattan.
In this research paper, the impact of different types of weather, including precipitation, snow depth, and snowfall, will be evaluated to determine factors that affect taxi operations. In addition, distribution of taxi trips in each NTA area defined by Geopandas will be studied, and its density will be shown on a plotted NYC map.

Data and Methods
This research uses data of taxi operations between 2009 to 2015 from NYC Taxi and Limousine Commission, with a focus on the newest data from year 2015 [4]. Data of weather information is extracted from observation of central park in NYC. Geometrical information is obtained from map of NYC.
This research requires the use of Python for programming, and programming cells are run on Jupyter notebook. Numpy, Pandas, Geopandas, Matplotlib are applied to process data. Specifically, Numpy and Pandas are used to analyze array and data frame data, Geopandas is used for geometry data, and Matplotlib is used to plot graphs. Linear regression and linear algebra are later used to determine the functional relation of the selected data.

Data Analysis and Description
First, basic information of taxi operations in January 2015 is studied. Several columns of information related to the topic, such as pickup time, are selected, and the raw data is then read into Jupyter notebook using "read.csv". The data is grouped by day since daily trips are the main targets. The result, as plotted in Figure 1, shows total pickups corresponding to each day in January and signals that there are fluctuations of daily trips, especially on 27 th January, when trip amount decreases sharply by around 150 thousand, and increases later to 500 thousand on 30th January. Figure 1 illustrates amount of daily total trips on y-axis and days in January on x-axis.
Next, the average amount of hourly trip is learned: data of trip amount of January is grouped by 24 hours and then divided by 31, as there are 31 days in January. Consequently, the result of average trip amount of each hour in January is obtained. As shown in Figure 2, which graphs the relation between daily hour and number of trips, the amount of taxi trips is the lowest at 5 a.m. It then rises over the day, and decreases gradually again from 8 p.m. The amount of trips increases sharply from 6 a.m. to 10 a.m. because people begin to go out for work or need to move across the city. The amount of taxi trips arrives at its peak at around 5 p.m., when people leave their workplace and go back home. American Journal of Operations Research     Following the analysis of taxi operations in January 2015, data of the whole year is studied. Data of taxi trip operations from February to December of 2015 and data of weather are read into Jupyter notebook respectively. After daily trip amount is selected and combined to a data frame, linear regression is applied to show the relationship between snow depth and the amount of daily trips in the year of 2015. According to Figure 4, daily trips slightly increase with the rise in snow depth during the year. This result, which is opposite to that of the analysis in January 2015, signals that the previous statistics do not represent the situation of the whole year. The reason could be that January is the coldest month in winter, greatly different from other seasons. Based on the overall trend of 2015, it is found that more people need taxi rides when snow depth goes up. Figure 4 illustrates daily trips (y-axis) against snow depth (x-axis) in 2015. According to the graph, the amount of daily trip rises gradually as snow depth is larger in 2015.
Using the same method, linear regression is applied to test the relationship between snowfall and the amount of daily trips in 2015. Figure 5 shows that trip amount slightly falls as snowfall goes up, which is opposite to the result gained from linear regression between snow depth and daily trips. The reason for such difference is probably that snow stays for a long time throughout the year in the city and therefore has less impact on people's lives than snowfall. Figure 5 shows the relationship between daily trips (y-axis) and snowfall (x-axis) in year 2015. It can be observed from the graph that daily trip amount decreases when snowfall increases. The possible reason is that people prefer to stay at home instead of going out on snowy days.
Another liner regression is done to test the relationship between daily trips and precipitation in 2015. In Figure 6, x-axis represents precipitation, while y-axis represents the amount of daily trips in 2015. As illustrated in the graph, the amount of daily trips decreases gently as precipitation increases. The reason is probably that people chose to stay at home instead of going out on rainy days if not necessary.    In Figure 7, x-axis represents 12 months in a year and y-axis represents monthly trip amount. The graph shows the general trend of monthly trip amount during 2015.
In a similar manner, it is found that the trend of average hourly trips throughout the year is almost the same with that in January. Figure 8 depicts the relationship between daily trips and 24 hours in a day.
In Figure 8, y-axis represents amount of daily trips in 2015, while x-axis represents 24 hours per day. Hourly average trips are shown as a result.
Data of payment to taxis in year 2015 is also decomposed to analyze variation of payment in weekday. Figure 9 shows the trend of daily trips on weekday. According to the graph, the amount of trips is at its peak on Friday, while at its lowest level on Sunday, when people enjoy their weekends at home. Graphically, the trend of weekday payment is similar to the trend of trip amount in 2015, however, the highest taxi payment appears on Thursdays, while the largest amount of trips appears on Fridays.
In Figure 9, x-axis represents 7 days respectively during a week, and y-axis represents amount of daily trips.
The location of taxi trips is then carefully studied. A map of NYC is used and read into the Jupyter notebook. It is converted to Geopandas format in order for python to analyze. To find out the trip amount in each NTA area, a function is set up to select which NTA area each trip belongs to. The result is shown in Figure 10, a graph that illustrates the amount of taxi trips in 195 NTA areas. The density of each area is gained by dividing trip amount in an area by the corresponding area. The graph is drawn based on the NYC map, and the differentiation in the darkness of the color distinguishes the density of taxi trips. Figure 10 shows the trip density and distribution from 8 a.m. to 10 a.m. in January 2015, NYC. According to the graph, it can be also observed that Manhattan and the airport are the busiest areas during rush hours. It implies that the taxi company may distribute more taxis to these areas to meet the need of passengers. American Journal of Operations Research

Conclusions
This research paper mainly analyzed basic information of taxi trips in 2015 in the city of New York. The trend of the amount of daily trip and average hourly American Journal of Operations Research trip in January is first studied. The weather's impact on the trip amount is then discussed using linear regression to find out the relationship between snow depth, snowfall, precipitation and trip amount. The difference between the impact caused by snowfall and snow depth is later compared. Furthermore, monthly trend, weekday trip amount, hourly average trip, and weekday payment are examined based on the data of 2015. In addition, the distribution of taxi trips in each NTA area is discussed and their density is shown on an NYC map. Finally, by looking at the trips from 8 a.m. to 10 a.m. in Figure 10, it's found that a certain area has a greater need for taxis. It is hence meaningful for taxi companies to study the trend and redistribute taxis in the city to satisfy the need of passengers.
The limitation of this research is that it takes a long time to run such a great number of data (2 GB for one month). It takes an afternoon to run only one cell of a year's data, which greatly restricts the amount of data used in the research process. Another limitation is the lack of visualization. The results are mostly shown through graphs, but not by animations which would allow readers to understand more comprehensively and directly.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.