Analysis of the Resolution of Crime Using Predictive Modeling

There has been evidence of crime in the US since colonization. In this article, we analyze the crime statistics of San Francisco and its resolution of crime recorded from January to September of the year 2018. We define resolution of crime as a target variable and study its relationship with other variables. We make several classification models to predict resolution of crime using several data mining techniques and suggest the best model for predicting resolution.


Introduction
On a daily basis, all manners of residents in the United States are affected by crimes. Crime rates vary over time, reaching its peak between the 1970s and early 1980s. According to the FBI [1], there are two types of crimes in the USA namely violent crime and property crime. Crimes such as murder, manslaughter, and rape are described as violent crime whereas crimes such as burglary, larceny, and vehicle theft belong to property crime.
In order to implement law and order effectively, one must analyze the crime statistics and should minimize the number of unsolved crimes as low as possible.
In this article, we analyze the crime statistics of San Francisco and its resolution (resolved or not resolved) of crime recorded from January to September of the year 2018. We define resolution of crime as a target variable and study its relationship with other variables. We make several predictive models to predict "Resolution of crime" using several machine learning techniques and suggest the best model (or models).
Several authors have defined machine learning in their own way. One of the common ways to define machine learning is: Technology uses for the development of computer algorithm with the ability of imitating the intellectuality of human beings is known as machine learning. It is produced from the ideas of the different fields such as Computer Science, Information Theory, Statistics and Probability, Artificial Intelligence, Psychology, Control Theory and Philosophy [2] [3] [4].
It has been a very challenging question which model type to apply to a machine learning task in order to make a precise prediction. Every model has some merits and demerits [5]. It can be difficult to compare the relative merits of the models. In this paper, five different supervised classification machine learnings: Logistic Regression (LR), Classification Tree (CART), Linear Discriminant Analysis (LDA), Quadrilateral Discriminant Analysis (QDA), and K-Nearest Neighbor (KNN) are implemented. We use these five classification models to predict the resolution of crime. Finally, the performance of the algorithms is compared to select the best model.
In section 2, we discuss data description and preprocessing. Different classification machine learning will be discussed in section 3. In section 4, we compare models and select the best model based on their performance. In section 5, we summarize the main findings and conclude the journal.

Data Source
In this study, we use the publicly available dataset that we obtained from San Francisco Police Department Incident Reports from January to September of the year 2018, which has information of 111,531 official crimes. This project started on October 2018; therefore, the only data available was from January to September of 2018. Every entry in the dataset contains information about a crime. The dataset contains 26 variables and 111,531 observations. The detail information of the dataset with variable name, type, and level are available in [6].

Data Cleaning
In the case of a large dataset, learning the dataset is not useful unless the unwanted features are removed since an irrelevant and redundant feature does not add anything positive and new to the target concept [7]. Before implementing machine learning algorithms to our dataset, we went through a series of prepossessing steps.  The variable Datetime is rejected since it gives the same information as Incident Day of the week and Incident Time. Report Datetime, Report Type Code, and Report Type Description are rejected since we care when the crime was committed, not reported. Point provides the same information as Latitude and Longitude, so it is rejected. The variables Analysis Neighborhood and Police District give the same information. The Analysis Neighborhood has missing value as opposed to Police District so we keep Police District and reject Analysis Neighborhood. The variables Incident category, Incident Subcategory, Incident Description give the same information, so we keep the variable Incident category as an input variable and the other two are rejected.
• Imputing missing values Missing data is a common problem in data mining. Rates of less than 1% missing data are generally considered trivial, 1% -5% are manageable. However, 5% -10% requires sophisticated method to handle, and more than 15% may severely impact any kind of interpretation [8]. The variables CNN (The unique identifier of the intersection for reference back to other related basemap datasets), Latitude, Longitude, and Supervisor District have 5575 missing values. Approximately 5% of the data are missing in our datasets so it is not reasonable to ignore missing data and delete from dataset. Several methods for imputation of missing data together with their merits and demerits have discussed [9]. Missing values of our datasets include both numeric and categorical so the reliable way to impute is K-nearest neighbors (KNN). KNN algorithm is the algorithm most useful for any kind of missing data because it takes missing data within its closet k neighbors in the multi-dimensional space. We imputed the missing values using KNN method explained in [10] with k = 10.

• Data transformation
The variable Filed Online is either TRUE or blank in the original data, so it is converted to TRUE/FALSE to represent whether a report was filed online or not.
The variable Incident category is a characteristic variable with 39 subcategories which is not feasible to interpret. We realized that more meaningful approach is to collapse the categories into fewer, large groups: Assault, Burglary, Larceny theft, Non-criminal, and Others.
We have used the case when command in dplyr package of R to change the level of variables Filed Online and Incident category. repress the explained effect to gather all of the features into the same magnitude [12].
To scale the features of the dataset, standardization has used. The formula used to calculate the standardization is as follows: where z, min (x), and max (x) are standardized input, minimum, and maximum values for the features, respectively.

Data Partition
In this part of the preprocessing stage, the data is split into two parts: training and testing data in the ratio 3:1. We have used the sample command of R to select 75% of the entire dataset. This random sample is taken as train data. The remaining 25% of the data is considered as test data. The main purpose of the splitting data is to avoid overfitting. There might be the case where the machine learning algorithm performs exceptionally well in the training dataset, however, performs badly in the testing dataset.

Machine Learning Algorithms
There are various machine learning algorithms available to solve the classification problems such as Logistic Regression, Neural Network, and Support Vector Machine. However, our research is limited to the following machine learning algorithms.

Logistic Regression
Logistic Regression (LR) Model is used for predicting binary outcomes. It is a statistical model that in its basic form uses as a sigmoid function to model a binary response variable, taking on values 1 and 0 with probability π and 1 − π respectively. A logistic regression model is given below as: LR is one of the most popular and common method that has been used for a long time to solve classification problem especially when the response variable is binary. Due to simplicity and convenience, the first method that comes in the mind of most statistical is LR. We have fitted the logistic regression model using

Linear Discriminant Analysis
Fisher Linear Discriminant Analysis (also called Linear Discriminant Analysis (LDA)) is a method used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification [14]. Though their motivation differs, the logistic regression and Linear Discriminant Analysis (LDA) are closely connected. The only difference between these two models is the way their parameters are estimated. In Logistic Regression, the parameters are estimated using maximum likelihood, whereas in LDA method, the parameters are computed using the estimated mean and variance from the normal distribution. In LDA method, we assume that the variables follow Gaussian distribution with common covariance matrix. If this assumption is met, LDA outperforms Logistic Regression. Conversely, Logistic Regression outperforms LDA if these assumptions are not met. We fit the LDA model using R command lda of the MASS package similar to the procedure explained in [5].

Quadrilateral Discriminant Analysis
Quadrilateral Discriminant Analysis (QDA) is a supervised machine learning in which a quadratic decision boundary classifier is used to differentiate the class. QDA serves as a compromise between LDA and Logistic Regression approach and the nonparametric KNN method. QDA is more flexible than LDA and Logistic Regression as its decision boundary is quadratic but less flexible than KNN. A QDA model is fitted using R command qda of the MASS packages like the procedure explained in [5].

Classification Tree
Classification trees are a powerful alternative to more traditional approaches of land cover classification. Trees provide a hierarchical and nonlinear classification method and are suited to handling non-parametric training data as well as categorical or missing data. By revealing the predictive hierarchical structure of the independent variables, the tree allows for great flexibility in data analysis and interpretation [15]. Classification tree is simple and useful for interpretation. It is a statistical model which is used to predict a qualitative response. In this model, we predict that each observation belongs to the most commonly occurring class of training observations in the region which it belongs to. A Classification tree with the best value of complexity parameter is fitted using R package rpart similar to the procedure explained in [16]. models. To fit KNN model, no assumption is needed. In fact, it is completely nonparametric. KNN can outperform other classification models if the assumptions are not met. We fit the KNN model using R packages Class similar to the procedure explained in [10].

Model Comparisons
To determine which model has the better performance, they were trained on the training dataset and fit to the test dataset to retrieve the following matrices: Sensitivity, Specificity, and Accuracy. We compute the confusion matrix for each model as shown in Table 2.
The proportion of the actual resolved case that is correctly predicted as resolved is called sensitivity. It is also called true positive rate (TPR) and is given in Equation (4).
The proportion of the actual unresolved case that is correctly predicted as unresolved is called specificity. It is also called false positive rate (FPR) and is given in Equation (5).
The proportion of the cases that is predicted accurately is called the accuracy and is defined by Equation (6).

TP TN Accuracy
TP FN TN FP The model with higher statistics: sensitivity, specificity, and Accuracy is considered as a better model. Table 3 summarizes such statistics. The sensitivity of  Specificity of all models are reasonable. All models were able to attain at least 88%. The accuracy of the Classification tree is 0.7864, which is the highest. So the Classification tree is considered as a better model.

Results
We compared different classification machine learning algorithms for predicting the resolution of crime using the publicly available dataset that we obtained from San Francisco Police Department Incident Reports from January to September of the year 2018. The Classification tree followed by Logistic Regression outperforms the other three models: Liner Discriminant Analysis, Quadrilateral Discriminant Analysis, K nearest neighborhood. A possible cause is that KNN suffers from the poor performance whenever the class distribution of the Resolution is skewed [17]. Most of the voting will raise conflict when there are huge class that dominates prediction. There will also be a tendency for new data to be voted into additional popular classes. Figure 1 verifies the fact that the number of unsolved cases is almost four and half times more than the number of solved cases. As a result, it is unsuitable to use KNN in this dataset.
It is worth noting that in models: Liner Discriminant Analysis and Quadrilateral Discriminant Analysis, the sensitivity is very low, less than 20%. This is likely due to the fact that the dataset failed to meet Gaussian requirement. It can be seen from