Prediction of Accident Severity Using Artificial Neural Network: A Comparison of Analytical Capabilities between Python and R

Large amount of data has been generated by Organizations. Different Analytical Tools are being used to handle such kind of data by Data Scientists. There are many tools available for Data processing, Visualisations, Predictive Analytics and so on. It is important to select a suitable Analytic Tool or Programming Language to carry out the tasks. In this research, two of the most commonly used Programming Languages have been compared and con-trasted which are Python and R. To carry out the experiment two data sets have been collected from Kaggle and combined into a single Dataset. This study visualizes the data to generate some useful insights and prepare data for training on Artificial Neural Network by using Python and R language. The scope of this paper is to compare the analytical capabilities of Python and R. An Artificial Neural Network with Multilayer Perceptron has been imple-mented to predict the severity of accidents. Furthermore, the results have been used to compare and tried to point out which programming language is better for data visualization, data processing, Predictive Analytics, etc.


Introduction
The government of UK collects and publishes information regarding traffic accidents across the country usually every year. This information contains aspects such as the severity of accidents, number of casualties, road conditions and so on, making the data sets quite interesting and comprehensive for analysis and cidents are most likely to happen. Officials could use such information to direct safety efforts at the stretches of road where the impacts could be the biggest. In the context of this research, the UK government could use the information generated from a prediction system with a Neural Network predicting the accident severity and use this information to enhance the laws to build safer roads for the future.

The Chosen Analytics Technique
ANN is selected because ANN is an adaptive system that can change its structure based on external or internal information flowing through the Network [5].
From several studies, it is found that Neural Network proved to be a better predictor of accident severity. The choice is further justified from a research work of [6] where the researchers used ANN and DT to find the injury severity resulting from traffic accidents. Actual data from the National Automotive Sampling System (NASS)-General Estimates System (GES) were used. They used three Machine Learning models including ANN, DT and a Hybrid ANN, the results showed that the Hybrid model was the best and ANN also performed better than DT. Besides, most of the similar research work involved the use of ANN as found from the works of Abdelwahab (2001) [7], Zeng (2014) [8], Delen (2016) [9] and so on. From these works, it could be concluded that the choice of the Analytic Technique is appropriate and it could be used for strong research.

Programming Languages
In a recent worldwide survey, it was found that 83% of 24,000 data professionals used Python. Data Scientists consider Python because it is a general-purpose and dynamic programming language. In comparison to R, Python is found to work better than R with iterations less than 1000 and some consider it better than R for data manipulation tasks. It contains ideal packages for Machine Learning tasks and is inherently an Object-Oriented Programming Language [10]. In contrast, millions of analysts and data researchers use R Programming Language to handle their most difficult issues in the fields running from Computer Science to extensive marketing. R has become an important tool for organisations driven by Data Analytics. Some examples of companies using R include; Google, Facebook and Linkedin. R is considered as the best mechanism for Statistics, data analysis and Machine Learning. It is more than a statistical package as it allows for the creation of objects, functions and packages [11].

Project Aim and Objectives
This project aims to compare the analytical capabilities of Python and R. To reach the aim, two large data sets will be merged to form a single data set. Then data analysis will be performed on the data set to generate useful insights from the data. An Artificial Neural Network with Multilayer Perceptron will be implemented to predict the severity of accidents and finally, the results will be used to compare and contrast between the two chosen Programming Languages.

Artificial Neural Network
Machine Learning is an evolving branch of a set of Algorithms designed to copy the intelligence level of human beings by learning from their surroundings. These Algorithms are regarded as the key components of this era of Big Data. Techniques based on Machine Learning have been applied successfully in various fields ranging from computer vision, spacecraft engineering, entertainment and so on [12]. Machine Learning is required for tasks that are quite complicated for humans to code directly. Therefore the large sources of datasets or Big Data are provided to the Machine Learning Algorithm. It could then explore the data and complete tasks which it is programmed to perform [13].
Machine Learning is of three types which are supervised, unsupervised and reinforcement learning. In Supervised Learning, the Algorithms are provided with data that contains labelled examples. Then Algorithms are trained on these datasets. After the completion of the training process, these Algorithms could predict the outcomes when given new unseen data. Applications of Supervised Learning Algorithms include face recognition, Spam Classification and predicting Advertisement Popularity. In Unsupervised Learning Algorithms are provided with unlabelled data and such algorithms analyse the data to group them in such a way that a human being could make sense of this newly organised data. Applications of Unsupervised Algorithms are Recommendation Systems, gene-rating Buying Habits and so on. On the other hand, in Reinforcement Learning an agent learns from an environment and is provided feedback depending on how correct the agent has learned from the data. Applications of Reinforcement Learning are Video Games, Industrial Simulations and so on [14].

How Artificial Neural Network Works
Inspired by the capabilities of the human brain for its incredible processing capabilities due to interconnected neurons. Artificial Neural Networks (ANN) are designed by processing units known as Perceptrons. These Perceptrons consists of one layer and they can solve linearly separable problems. However to solve non-linear problems Multilayer-Perceptrons are used which contains usually three layers which are; an input layer, one or more hidden layers and an output layer [15].
The building block of an Artificial Neural Network is a neuron. These are simple computational units consisting of weighed input signals that produce an output signal by an activation function. Each neuron has a bias which can be thought of as an input that always has the value 1 or 0 and it must also be weighed. Weights are often initialised to small random values, ranging between 0 to 0.3, however, more complex initialisation schemes could be used. The weighted inputs are added and passed through an activation function also known as the transfer function. The activation function is a mapping of the summed input to the output of the neuron. There are different types of activation functions used for specific purposes [16].
From the research work of Abdelwahab [7], the concept of Multilayer-Perceptron is illustrated in Figure 1. Figure 1 consists of one input layer, one hidden layer and an output layer. In the figure displayed the input layer has K nodes and a bias node denoted as (Node 0). The hidden layer has J nodes and a bias node (Node 0). The output layer is denoted as I nodes with no bias node. The connections in the entire network are of feedforward type. This means that the connections are allowed from a layer of a certain index to layers of a higher index. It is also seen from the Figure that no connections are permitted from a layer of a certain index to a layer of lower index (i.e., feedback connections). In addition, no connections are allowed among the nodes belonging to the same layer. The MLP network can operate in the training and testing phase. In the training phase, given a set of training data . The Backpropagation Algorithm is used to train the MLP. A simple representation of the Algorithm is illustrated in Figure 2.
From the figure it could be considered that output of MLP is equal to ( )   The aim is to change the weight vector W so that the function shown above is minimised. By reducing the error function the actual output is taken closer to the desired output where it is assumed that Equation (1) can be differentiated, renowned optimization techniques could be applied to perform the minimisation task. One such technique is the gradient descent technique that changes W by an amount that is proportional to the negative gradient given by: where: The equations that describe the change of weights are given as follow: = error term associated with output node i due to presentation of input pattern p, = error term associated with hidden node j due to presentation of input pattern p, • ( ) j y p = output of hidden node j due to presentation of input pattern p, and

Advantages and Limitations of Artificial Neural Network
Neural Networks require limited formal statistical training to be developed. And it could be developed by newcomers given that they are provided with appropriate datasets and software. Datasets on which it is possible to apply other techniques such as Logistic Regression, Artificial Neural Network could also be applied to the same dataset. Also, Neural Networks could be trained using inputs and out- Algorithms which could be used to generate better results [17].
On the other hand, there are some drawbacks of Artificial Neural Networks.
Firstly, Artificial Neural Networks are dependent on hardware. These networks need processors with parallel processing capabilities. As a result, a mid to high-end Central Processing Unit is desirable for implementing such a technique. Another problem of a Neural Network is that the behaviour of such a network is unexplained. When an Artificial Neural Network generates a probing result, no indication or reasons are generated resulting in a lack of trust in implementing such a technique. It is also quite difficult to determine the ideal structure of a network.
There are no set of rules which could be used to find the best structure. When building a Neural Network experience is required, for novices a painstaking process of trial and error is inevitable. Appropriate data preprocessing is required as Neural Networks are sensitive to feature scaling. This means the training and testing data of feature matrices have to be transferred to a range of values before training the Neural Network. In addition, the network is reduced to a certain error value when training finishes. The produced value does not provide optimum results [18].

Use of Artificial Neural Network in Industries
The selected Advanced Analytics Technique has a broad range of applications in real-world business problems. Currently, it has been successful in a number of industries. Neural Networks are ideal for identifying patterns or trends in data. Hence, it is suited for prediction or forecasting requirements that include: Sales forecasting, Prediction Control in industries, Customer Research and so on [19].
Neural Networks have been applied with success in Banking and Finance to solve problems like derivative security pricing and hedging, future price and exchange rate forecasting and stock performance. Currently, Neural Networks are being used as the underlying technology for decision making. It is an ongoing research area in Medicine and it is believed that Neural Networks will be used in a wide-scale in biomedical systems in the near future. Nowadays, the research is mostly conducted modelling body parts of human beings and recognising diseases from different types of scans [20].
Neural Networks are used in research to simulate the human cardiovascular systems which are used to compare real-time physiological measurements taken from patients. If such a process is carried out on a timely basis then potential harmful medical conditions could be detected at an early stage thus making the process of curing the disease much easier. ANNs are currently used experimentally to build electronic noses as they have the potential to be used in several applications in telemedicine. As the sense of smell is important for surgeons, collection and generation of patients sense of smell could be made possible by Neural Networks. Neural Networks are also used in Credit Evaluation as well. A company named HNC has developed several applications using Neural Network one of them is a Credit Scoring System used to increase the profitability of the current system by 27% [19].
Neural Networks could be used in Marketing where the tasks involve market segmentation where the market is divided into different groups of customers with different consumer behaviours. In these cases, a Neural Network could be trained to perform this task. Furthermore, Neural Networks are also used in Retail and Sales. As they can consider different variables at the same time such as market demand for products, customer's income, product price and so on. Supermarkets could use a Neural Network to predict future sales [20]. Journal of Data Analysis and Information Processing

Designed ANN for the Experiment
The designed ANN architecture consists of three layers; an input layer, a hidden layer and an output layer as shown in Figure 3.
The input layer consists of the number of neurons that are equal to the number of input features of the dataset where "I" denotes the input layer neuron with "I_1" the first input feature where "n" is the last neuron. "H" denotes the first input neuron of the hidden layer and the same illustration is used to denote the input neurons. The output layer contains three neurons as the Network will predict three different outcomes. The neurons are shown as "O" with an underscore followed by the number of the neuron.
The neurons in the input and the hidden layers will be activated using the ReLu Function. The ReLu function is expressed as shown in Equation (5). From the shown equation it is seen that this function gives an output of "x" given that ReLu function is also useful in cases where a Network has weights randomly initialised and about 50% of Network results in 0 activation due to the features of ReLu. It means that the lower number of Neurons would be activated resulting in a lighter Network [21]. Equation (3) shows the ReLu function.
The Neurons of the output layer will be activated using the Softmax function.

Design of ANN for the Experiment
After creating and saving the dataset, the selected Programming Languages will then be used to find descriptive statistics about the dataset, data visualisation and training of Advanced Analytic Technique. Then the data will be preprocessed for implementing the Artificial Neural Network.
For Data Analysis and processing in Python, Numpy and Pandas will be used as Numpy is a package used for Data Analysis and Pandas library provides high-level data structures and methods to ease the data analysis process. The Matplotlib library is chosen as it provides quality plotting functionalities [24]. In addition, the Seaborn library is will be used which is built on top of Matplotlib providing attractive plotting options of statistical graphs [25]. For the Data Analysis and Visualisation part in R, "gglot2" library is chosen as it is over 10 years old and used by many users and organisations to generate millions of plots. It is a library that has methods to generate high-quality graphs in R [26]. To efficiently manipulate the used dataset the "dplyr" library is used as it provides a collection of tools for efficient data processing. It increases efficiency and it provides methods such as filter, group_by and so on [27].
For conversion of Categorical Features into numbers, suitable functionalities of Python and R have been used. For Python Pandas library is used whereas R in-built functions such as "data.matrix" and "as.numeric" functions have been used. As it is found from the research that one of ANNs drawbacks is that it requires scaled features as input data. To overcome this issue the most commonly used scaler called Standard Scaler is used. This method assumes that the data is normally distributed within each feature and will scale them such that the distribution gets centred around 0 with a standard deviation of 1 [28]. The formula is shown in Equation (7) where x is the input value, µ is the mean of column values and σ is Standard Deviation for a particular column [29].
The dataset will be divided into a training set and a test set with training set being 80% and the rest for the test set. In Python, the data splitting and scaling has been performed using the scikit-learn library. The library provides the implementations of all the well-known Machine Learning Algorithms and steps needed for data processing and performance evaluation of Algorithms [30]. This library is used to split, process and compute the confusion matrix of the implemented Advanced Analytics Technique. In R external libraries will not be required to split the data as R has built-in matrix handling options. For scaling the data in R the "scale" method is used which implements the task of standard scaler [31]. For implementing the Neural Network the Keras library will be used.
Keras is a high-level Neural Network API written in Python capable of running on top of other libraries such as TensorFlow, Theano or CNTK and it focuses on fast experimentation [32]. The same library is also used in R as the package provides an interface to Keras from within R [33].

Performance Evaluation Methods
To evaluate the performance of the trained ANN, the confusion matrix will be used which is a performance measuring tool for classification problems. It is a  Table 1.
From the metric, the prediction accuracy will be calculated for the ANN trained using both Programming Languages as shown in the Equation (8).

TP TN Accuracy TP FP FN TN
As mentioned previously there are many libraries required in Python to perform different tasks such as data pre-processing and Machine Learning. To ensure that all these libraries are of the same version and in a fixed directory, the Anaconda Software will be used which is free software [36]. Installation of Anaconda provides all the required libraries for Data Science and even some IDE are also provided in the distribution. For the IDE the Scientific Python Development Environment (Spyder in short) is used for Python. This IDE is provided for free with Anaconda distribution and it is a strong environment for Scientific computing which is also written in the Python Language designed for scientists and engineers. It provides all the standard functionalities of IDEs such as debugging, interactive interface and so on. But it has unique features for Linux. It has standard features such as syntax highlighting, code completion and so on. For Data Science it has data viewer, it integrates the tools used in R with a single environment, provides stunning Data Visualisation capabilities and so on [38]. The code of Python and R are provided in Appendix G-K and the Computer Configuration used for the experiment is provided in Appendix L in the Appendices Section.

Potential Contribution of ANN in the Chosen Context
With the use of Artificial Neural Network, it could be possible for the authorities working for the Government of the UK to get information on the type of accident that could occur in future. The trained Neural Network could be used as part of a software implementation where users could be able to input certain values such as day of the week, time, road condition, age of drivers, type of vehicles and so on. Then the software system could be able to generate future outcomes. Based on the results produced, the Government could enforce or update existing laws such as setting up new speed limits and so on.

Framework for the Evaluation of Python and R
1) Cost: Cost is an important factor for any Organisation wanting to develop software products for deployment. It would be ideal for companies to use Analytical Tools that are less expensive or free so that they could save monthly or annual costs for paying subscription fees.
2) Ease of Use: Having a user-friendly interface and writing fewer lines of codes could be beneficial. It would increase productivity and will allow developers to produce a system or analyze data much faster.

3) Available Libraries: A variety of libraries is a must for any Programming
Languages. Having access to suitable libraries would enable analysts, scientists and developers perform different tasks according to business needs.

4) Visualisation:
In order to make data meaningful for high-level authorities such as CEOs and Managers and so on the proper visualisation of data, is crucial. The visuals must be clear enough for the Management of a corporation to reach a business decision.

5) Community Support: It is not always possible for Business Analysts, Data
Scientists and Engineers to perform their tasks instantly. They could make coding errors or inexperienced employees could require to learn new concepts. In such cases, it is ideal to have good community support for Analytics Tools to resolve such issues.
6) Language Unity: Programming Language versions are constantly being updated. It is often found that major changes are seen in a change in method names for instance. A language that does not change the names of class or methods could be useful as users do not have to go through the hassle of going through new documentations.

Evaluation of Results
In this section, the results generated from Python and R will be evaluated based on the Analytical Framework developed in the previous section.

Cost
In this experiment two most commonly used Programming Languages for Data Science have been used which are Python and R. It has been already mentioned that both are free. For both languages, suitable IDEs were downloaded which were free editions and there are Professional versions available. Organizations could use either one of these Analytical Tools when Cost is a consideration.

Ease of Use
From a personal perspective, In terms of ease of use, it is found that the Syntax of both the Languages was quite simple which often seemed as writing English sentences. However, it could also be stated that users who are familiar with at least one other Programming Language such as C, C++ or Java will find it easy to learn Python and R but for beginner Python's. The syntax would be much easier and the claim is further justified by a post of [39].

Available Libraries
The libraries used for Python are Numpy and Pandas for Data Handling, scikit-learn for data pre-processing and displaying the results of the Confusion Matrix and Matplotlib and Seaborn for Data Visualisation. Whereas, data handling libraries are not required in R and for visualisation, the ggplot2 library was used along with deplyr for additional features such as filtering, grouping data and so on. For implementing ANN Keras was used for both Languages. It could be concluded that both Languages have adequate libraries. This could is further confirmed by a report published by [40] in TechRepublic where the author mentioned that it is a tie between Python and R when it comes to available libraries.

Statistical Correctness
It is safe to say that R is better as fewer lines of codes are required to generate useful statistics as shown in Figure A1 and Figure A2. Although the results generated in Python looks worse compared to R however, the results could be improved by using different IDEs rather than Spyder and it could be matched with R's results but it would take stronger logic and more lines of code making R the better choice for generating statistics. Also, it could be further assured from a post of [41] where it is found that fewer lines of code are required in R to generate Statistics.

Visualisations
To represent the data different visualisation techniques have been used in this research. The main goal is to implement a Machine Learning model which in this case is an Artificial Neural Network. The task of this model is to predict the Accident Severity depending on certain features such as Year, Weather Conditions, Road Conditions and so on. Therefore, the visualisations have been generated by showing the distribution of Accident Severity which are of three types; "Slight", "Serious" and "Fatal". Firstly the Pie Charts of the distribution of Accident Types in the data set and accidents distribution in different areas are shown in Figure A3 and Figure A4. Then Bar Charts have been generated shown in Figure A5 and Figure A6. Finally Line graphs are shown in Figure A7 and Figure A8 in Appendices Section produced by Python and R. From the results produced it is illustrated that results of Python represent better graphics than R. This is due to the seaborn library of Python.

Machine Learning
The target variable Accident Severity has three values. During model training the target values of Severity classes "Fatal", "Severe" and "Slight" were converted to "0", "1" and "2". From the result of the Confusion Matrix produced in Python as shown in Figure A11 the total number of observation is 340,828 among which correct predictions for "Fatal", "Serious" and "Slight" classes are; 1767 and 292,220 making a total number of correct predictions equal to 292,988. Hence giving an Accuracy of 85.96% whereas in R shown the total correct observation is 292,944 and the total number of predictions are 340,890 thus the Accuracy is 85.94%. The Confusion Matrix produced by R is shown in Figure A13 of Ap-pendices.

Community Support
Considering R, it has large community support that comes from mailing-lists, documentation contributed by users, active Stackoverflow members and so on.
On the other hand, Python also has strong community support found from the mentioned list of supporting sources that are available in R [41].

Execution Time and Memory Consumption
To carry out the tests two separate Script files were created using the Languages. One script was created to perform Data Visualisation and on the other script code for ANN implementation is written. The results show that for Data Visualisation R is faster whereas Python is faster for training the ANN. In contrast, R has a lower memory consumption compared to Python when comes to both Data Visualisation and training the Machine Learning Algorithm. The results are shown in Figure A9, Figure A10, Figure A12, and Figure A14 in the Appendices Section.

Conclusions and Future Recommendations
In this research, a comparison has been made between Python and R which are the most commonly used Programming Languages used by Data Scientists. For this research, the context has been set to predict the Severity of Accidents from the data published by the Government of the UK using ANN. Two large sources of Big Data have been used and combined into a single Dataset. After carrying out the tests it could be concluded that Python was marginally better at Data Visualisation and was more accurate at predicting the outcomes with faster execution time. On the other hand, it was easy to generate Descriptive Statistics in R and overall memory consumption of R was lower than Python. In terms of Costs and Community Support both were on par, however, when it comes to ease of use and Machine Learning tasks Python shines as beginners should find Python easy to learn and due to the fantastic Library in Python called scikit-learn data processing, implementing algorithms and assessing model performance are much easier in Python than R. Hence, from this research, Python is slightly ahead in most of the tasks performed in this research project.
The results of Data Visualisation show that in this dataset there are values which are "Data Missing or Out of Range", for a more conclusive research, a larger dataset with added features such as "Name of Road", "Street Number", "Vehicle Model" and more is required with appropriate values in the columns. For an extension to the current research, instead of using a Machine Learning model, it could be more reasonable to use a Deep Learning model instead to test the Analytical Capabilities of both Python and R. without their help it would not have been possible to complete this work. First I would like to thank Mr. Liton Hossain for selling and setting up the entire Computer System used for this research work. Then I would like to thank Thanasis for publishing the datasets in Kaggle and Britta Bettendorf who published her work in Kaggle from which I have borrowed the concepts of combining the Datasets, the Government of the United Kingdom for making the data publicly available. Finally, I would like to thank Mrs. Tanzila Islam for her guidance