^{1}

^{*}

^{2}

^{3}

^{4}

There has been evidence of crime in the US since colonization. In this article, we analyze the crime statistics of San Francisco and its resolution of crime recorded from January to September of the year 2018. We define resolution of crime as a target variable and study its relationship with other variables. We make several classification models to predict resolution of crime using several data mining techniques and suggest the best model for predicting resolution.

On a daily basis, all manners of residents in the United States are affected by crimes. Crime rates vary over time, reaching its peak between the 1970s and early 1980s. According to the FBI [

In order to implement law and order effectively, one must analyze the crime statistics and should minimize the number of unsolved crimes as low as possible. In this article, we analyze the crime statistics of San Francisco and its resolution (resolved or not resolved) of crime recorded from January to September of the year 2018. We define resolution of crime as a target variable and study its relationship with other variables. We make several predictive models to predict “Resolution of crime” using several machine learning techniques and suggest the best model (or models).

Several authors have defined machine learning in their own way. One of the common ways to define machine learning is: Technology uses for the development of computer algorithm with the ability of imitating the intellectuality of human beings is known as machine learning. It is produced from the ideas of the different fields such as Computer Science, Information Theory, Statistics and Probability, Artificial Intelligence, Psychology, Control Theory and Philosophy [

It has been a very challenging question which model type to apply to a machine learning task in order to make a precise prediction. Every model has some merits and demerits [

In section 2, we discuss data description and preprocessing. Different classification machine learning will be discussed in section 3. In section 4, we compare models and select the best model based on their performance. In section 5, we summarize the main findings and conclude the journal.

In this study, we use the publicly available dataset that we obtained from San Francisco Police Department Incident Reports from January to September of the year 2018, which has information of 111,531 official crimes. This project started on October 2018; therefore, the only data available was from January to September of 2018. Every entry in the dataset contains information about a crime. The dataset contains 26 variables and 111,531 observations. The detail information of the dataset with variable name, type, and level are available in [

In the case of a large dataset, learning the dataset is not useful unless the unwanted features are removed since an irrelevant and redundant feature does not add anything positive and new to the target concept [

· Dropping irrelevant features

The feature which has almost negligible effect on the response variable is called irrelevant feature. One of the common examples of irrelevant feature is serial number. In data mining, there are many features of selection methods such as “Filter Method”, which automatically drop the irrelevant features. In general, we use the feature selection method if you have a huge number of features in hand. However, since our dataset has only 26 features, it is not difficult to identify the irrelevant features and omit them from the further process. The variables: Incident Code, Incident Number, Incident ID, Row ID, Report Type Code, and CAD (Computer Aided Dispatch) Number are irrelevant identifiers, so they are omitted.

· Dropping redundant features

The variable Datetime is rejected since it gives the same information as Incident Day of the week and Incident Time. Report Datetime, Report Type Code, and Report Type Description are rejected since we care when the crime was committed, not reported. Point provides the same information as Latitude and Longitude, so it is rejected. The variables Analysis Neighborhood and Police District give the same information. The Analysis Neighborhood has missing value as opposed to Police District so we keep Police District and reject Analysis Neighborhood. The variables Incident category, Incident Subcategory, Incident Description give the same information, so we keep the variable Incident category as an input variable and the other two are rejected.

· Imputing missing values

Missing data is a common problem in data mining. Rates of less than 1% missing data are generally considered trivial, 1% - 5% are manageable. However, 5% - 10% requires sophisticated method to handle, and more than 15% may severely impact any kind of interpretation [

· Data transformation

The variable Filed Online is either TRUE or blank in the original data, so it is converted to TRUE/FALSE to represent whether a report was filed online or not.

The variable Incident category is a characteristic variable with 39 subcategories which is not feasible to interpret. We realized that more meaningful approach is to collapse the categories into fewer, large groups: Assault, Burglary, Larceny theft, Non-criminal, and Others.

We have used the case when command in dplyr package of R to change the level of variables Filed Online and Incident category.

The variable incident date was a categorical variable with standard US date format (MM/DD/YYY), which gives the information of the incident starting from 1^{st} January to 24^{th} September. In order to make the analysis fruitful and feasible, we have extracted the incident month from the incident date and converted the incident date to incident month with 9 different categories from January to September using case when command explained above. Similarly, Incident Time was a categorical variable with time format HH: MM. This is decomposed into four categories: Morning, Afternoon, Evening, and Overnight. We decomposed such that: 6 am-noon as Morning, noon-6 pm as Afternoon, 6 pm - 10 pm as Evening, and midnight - 6 am overnight.

The variable Resolution is a categorical variable with 6 classes: Open or Active, Cite or Arrest Adult, Cite or Arrest Juvenile, Exceptional adult, Exceptional Juvenile, and Unfounded. We define classes; Cite or Arrest Adult, Cite or Arrest Juvenile, Exceptional adult, Exceptional Juvenile as Resolved and other two classes; Open or Active, and Unfounded as Unresolved so that the variable Resolution become binary with 1 for resolved and 0 for unresolved. We decided to take this as a Target variable. The brief summary of the cleaned data with role, type, and level is summarized in

· Encoding Categorical Feature

Feature engineering is a crucial part of machine learning. Since the implemented algorithm is only able to read numerical values, it is extremely important to encode that the categorical features are transformed into numerical values. Many statistical learning algorithms such as LDA, and QDA require as input a numerical feature matrix. When categorical variables are present in the data, feature engineering is needed to encode the different categories into a suitable feature vector [

Variable Name | Variable Role | Variable Type | Variable Level |
---|---|---|---|

CNN | Input | Numeric | Interval |

Latitude | Input | Numeric | Interval |

Longitude | Input | Numeric | Interval |

Incident Month | Input | Characteristic | Nominal |

Incident Time | Input | Characteristic | Nominal |

Incident Day of Week | Input | Characteristic | Nominal |

Incident Category | Input | Characteristic | Nominal |

Police District | Input | Characteristic | Nominal |

Supervisor District | Input | Numeric | Interval |

Filed Online | Input | Characteristic | Binary |

Resolution | Target | Characteristic | Binary |

· Feature Scaling

Since most machine learning algorithms for example KNN, use Euclidean distance between two data points; data sets containing various ranges are a problem. Features need to be accurate. Due to this, feature scaling is utilized to repress the explained effect to gather all of the features into the same magnitude [

To scale the features of the dataset, standardization has used. The formula used to calculate the standardization is as follows:

z = x − min ( x ) max ( x ) − min ( x ) (1)

where z, min (x), and max (x) are standardized input, minimum, and maximum values for the features, respectively.

In this part of the preprocessing stage, the data is split into two parts: training and testing data in the ratio 3:1. We have used the sample command of R to select 75% of the entire dataset. This random sample is taken as train data. The remaining 25% of the data is considered as test data. The main purpose of the splitting data is to avoid overfitting. There might be the case where the machine learning algorithm performs exceptionally well in the training dataset, however, performs badly in the testing dataset.

There are various machine learning algorithms available to solve the classification problems such as Logistic Regression, Neural Network, and Support Vector Machine. However, our research is limited to the following machine learning algorithms.

Logistic Regression (LR) Model is used for predicting binary outcomes. It is a statistical model that in its basic form uses as a sigmoid function to model a binary response variable, taking on values 1 and 0 with probability π and 1 − π respectively. A logistic regression model is given below as:

logit ( Pr ( Y = 1 ) ) = β 0 + ∑ j = 1 p X j β j (2)

where,

logit ( Pr ( Y = 1 ) ) = ln ( Pr ( Y = 1 ) 1 − Pr ( Y = 1 ) ) (3)

LR is one of the most popular and common method that has been used for a long time to solve classification problem especially when the response variable is binary. Due to simplicity and convenience, the first method that comes in the mind of most statistical is LR. We have fitted the logistic regression model using the glm commands of R package as explained in [

Fisher Linear Discriminant Analysis (also called Linear Discriminant Analysis (LDA)) is a method used in statistics, pattern recognition and machine learning to find a linear combination of features which characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification [

Though their motivation differs, the logistic regression and Linear Discriminant Analysis (LDA) are closely connected. The only difference between these two models is the way their parameters are estimated. In Logistic Regression, the parameters are estimated using maximum likelihood, whereas in LDA method, the parameters are computed using the estimated mean and variance from the normal distribution. In LDA method, we assume that the variables follow Gaussian distribution with common covariance matrix. If this assumption is met, LDA outperforms Logistic Regression. Conversely, Logistic Regression outperforms LDA if these assumptions are not met. We fit the LDA model using R command lda of the MASS package similar to the procedure explained in [

Quadrilateral Discriminant Analysis (QDA) is a supervised machine learning in which a quadratic decision boundary classifier is used to differentiate the class. QDA serves as a compromise between LDA and Logistic Regression approach and the nonparametric KNN method. QDA is more flexible than LDA and Logistic Regression as its decision boundary is quadratic but less flexible than KNN. A QDA model is fitted using R command qda of the MASS packages like the procedure explained in [

Classification trees are a powerful alternative to more traditional approaches of land cover classification. Trees provide a hierarchical and nonlinear classification method and are suited to handling non-parametric training data as well as categorical or missing data. By revealing the predictive hierarchical structure of the independent variables, the tree allows for great flexibility in data analysis and interpretation [

KNN model takes a completely different approach than the other classification models. To fit KNN model, no assumption is needed. In fact, it is completely nonparametric. KNN can outperform other classification models if the assumptions are not met. We fit the KNN model using R packages Class similar to the procedure explained in [

To determine which model has the better performance, they were trained on the training dataset and fit to the test dataset to retrieve the following matrices: Sensitivity, Specificity, and Accuracy. We compute the confusion matrix for each model as shown in

The proportion of the actual resolved case that is correctly predicted as resolved is called sensitivity. It is also called true positive rate (TPR) and is given in Equation (4).

Sensitivity = True positive rate ( TPR ) = True positive ( TP ) True positive ( TP ) + False negative ( FN ) (4)

The proportion of the actual unresolved case that is correctly predicted as unresolved is called specificity. It is also called false positive rate (FPR) and is given in Equation (5).

Specificity = False positive rate ( FPR ) = True negative ( TN ) True negative ( TN ) + False positive ( FP ) (5)

The proportion of the cases that is predicted accurately is called the accuracy and is defined by Equation (6).

Accuracy = TP + TN TP + FN + TN + FP (6)

The model with higher statistics: sensitivity, specificity, and Accuracy is considered as a better model.

Actual Resolved | Actual Unresolved | |
---|---|---|

Predicted Resolved | TP | FP |

Predicted Unresolved | FN | TN |

Model method | Sensitivity | Specificity | Accuracy |
---|---|---|---|

Logistic Regression | 0.1712 | 0.9585 | 0.7685 |

Classification tree | 0.6119 | 0.8112 | 0.7864 |

LDA | 0.003715 | 0.9974 | 0.7576 |

QDA | 0.1851 | 0.9476 | 0.7635 |

KNN | 0.4187 | 0.8819 | 0.7701 |

models: LR, LDA, and QDA are less than 18%, which is very low so they can’t be considered as a better model because less than 18% of the time, they correctly predict the actual resolved cases to be resolved cases. On the flipside, sensitivity of Classification tree is 0.6119 which is highest among the models.

Specificity of all models are reasonable. All models were able to attain at least 88%. The accuracy of the Classification tree is 0.7864, which is the highest. So the Classification tree is considered as a better model.

We compared different classification machine learning algorithms for predicting the resolution of crime using the publicly available dataset that we obtained from San Francisco Police Department Incident Reports from January to September of the year 2018. The Classification tree followed by Logistic Regression outperforms the other three models: Liner Discriminant Analysis, Quadrilateral Discriminant Analysis, K nearest neighborhood.

A possible cause is that KNN suffers from the poor performance whenever the class distribution of the Resolution is skewed [

It is worth noting that in models: Liner Discriminant Analysis and Quadrilateral Discriminant Analysis, the sensitivity is very low, less than 20%. This is likely due to the fact that the dataset failed to meet Gaussian requirement. It can be seen from Figures 2-4, several variables fail to follow Gaussian distribution. The feature Longitude is skewed to the left as shown in

outlier as shown in

The authors declare no conflicts of interest regarding the publication of this paper.

Dahal, K.R., Dahal, J.N., Goward, K.R. and Abayami, O. (2020) Analysis of the Resolution of Crime Using Predictive Modeling. Open Journal of Statistics, 10, 600-610. https://doi.org/10.4236/ojs.2020.103036