^{1}

^{*}

^{1}

^{*}

^{2}

^{*}

To date, not many studies have been conducted on criminal prediction. In this study, the criminal data related to city S is divided into a training data set and a validation data set at a 1:1 ratio in light of the personal tag data and the travel and accommodation data of criminals and ordinary people in city S. Firstly, the FP-growth algorithm is adopted to calculate association rules between the criminals and the ordinary people in their travel and hotel accommodation data, in order to discover criminal suspects based on association rules. Secondly, the DBSCAN algorithm is employed for clustering of the tag data of the criminals and the ordinary people, followed by similarity calculation, in order to discover criminal suspects based on tag clustering. Lastly, intersection operation is performed on the above two sets of criminal suspects, and the resulting intersection is verified against the criminal validation set for elimination of criminals who appear in the intersection so as to obtain final criminal suspects. Results show that a set of 648 criminal suspects is retrieved based on the association rules calculated by the FP-growth algorithm, while a set of 973 criminal suspects is retrieved based on DBSCAN clustering and cosine similarity of the personal tags; the number of criminal suspects is narrowed down to 567 after the intersection operation of the two sets, and 419 of the 567 criminal suspects are further verified to be criminals using the validation set, thereby leaving the other 148 to be the final criminal suspects and giving a prediction accuracy of 73.9%. The data mining method of criminal suspects based on association rules and tag clustering in this study has been successfully applied to the police system of city S, and the experiment proves the effectiveness of this method in detecting criminal suspects.

Nowadays, crime situation is becoming increasingly serious across the globe with more crime types and a higher number of criminals, posing a threat to human lives and property as well as social stability. Public security authorities are increasingly tasked with maintaining public security and fighting crimes with an ever-growing requirement on law enforcement. Given the constantly generated crime data, it is necessary for data analysts to reveal hidden patterns in the data, analyze implicit relationships between the data, predict occurrence of crimes and discover potential criminals, so as to improve the efficiency of law enforcement efficiency of public security authorities and prevent occurrence of crimes.

Association rule mining (ARM) [

In summary, association rules have been extensively applied to crime mining in relevant studies, the effectiveness of ARM in different fields of crime mining has been investigated in depth, and a large number of improved ARM algorithms have been proposed. These studies are mainly focused on using crime data for association rules mining, but in practice more frequently encountered is business data, which is not necessarily related to crime but is easier to obtain, thereby making it crucial to extract crime information from a large volume of ordinary business data. Moreover, existing studies are largely focused on crime case analysis and crime pattern mining instead of the mining and prediction of potential criminals, and most of the studies are focused on macro-level factors that affect the occurrence of crimes while not considering micro-level characteristics of criminals, while the micro-level characteristics are the internal factors that determine whether an individual would commit a crime. This study combines ARM algorithm and clustering algorithm to deal with crime and related data. This method is different from previous crime mining methods. It can directly find criminal suspects instead of criminal hotspots or others, and it makes full use of ordinary business data, which is easier to access and handle. In this study, ARM is performed on city S-related travel data and accommodation data of criminals and ordinary people, and meanwhile, tag clustering is performed on criminals and ordinary people, in order to discover as criminal suspects, the ordinary people who not only frequently travel with criminals but also have highly similar tags to them. Given that criminal acts are often carried out in the form of gangs, the people who are discerned to travel with the criminals and have a high similarity to the criminals in tags can be considered as potential key individuals. Monitoring criminal suspects can greatly reduce the incidence of infraction and crimes and improve public safety. In Section 2, we briefly describe the data used in the study and the preprocessing of the data. In Section 3, we provide the details of our business process modeling methodology. In Section 4, we describe the research results and analysis in detail, and test and compare the algorithm. In Section 5, we summarize the conclusions of this work.

City S-related travel and accommodation data of criminals and ordinary people in 2016 as well as their personal tag data at that time are collected. The travel and accommodation data consist of shuttle ticketing data and hotel accommodation data (

The personal attributes in

Clustering analysis and similarity calculation are performed on vector inputs. Tag vectorization is conducted according to the publicly accessible corpus of Chinese word vectors developed by Beijing Normal University and Renmin University of China―a corpus providing pre-trained 300-dimensional (300 d) character vectors. Each tag vector is calculated according to the following formula:

V = 1 m ∑ i = 1 m v i

where V is a calculated tag vector, with m representing the number of Chinese characters in the tag and v i = ( v i 1 , v i 2 , ⋯ , v i 300 ) a character vector of the tag’s i-th character. Each individual has 12 tags, and therefore the tag matrix of each individual is 12 × 300 in size.

This paper first uses the FP-growth algorithm to mine the ordinary people who frequently travel with the criminals as criminal suspects based on travel data and hotel accommodation data; Then use the DBSCAN algorithm to cluster and analyze the label data of criminals and ordinary people, and obtain several tag clusters. Carry out the similarity calculation of the criminals and ordinary personnel in each cluster, and find some ordinary persons with the highest similarity with the criminals as criminal suspects; Finally, the criminal suspects based on the association rules and the criminal suspects based on the label clustering are interdigitated to obtain the final criminal suspects, and the criminals test data is used to check whether calculated criminal suspects obtain actual criminals.

Data sheet | Attribute field |
---|---|

Shuttle | Route code, departure time, starting station, arrival station, passenger ID number |

Hotel | Check-in ID number, hotel code, check-in time, hotel name |

Tag | Value | Meaning |
---|---|---|

ID number | 6cbe2819c3xxxxxxxx | |

Gender | M | Male |

F | Female | |

Age | Minor | <18 years old |

Youth | 18 - 40 years old | |

Middle aged | 41 - 60 years old | |

Elderly | >61 years old | |

Marital status | Married | first marriage, remarriage, and remarriage |

Unmarried | unmarried, divorced, widowed | |

Employment status | Employed | Currently employed |

Unemployed | Currently unemployed | |

Income | Low | Below the minimum wage |

Middle | Between the minimum wage and the per capita wage | |

High | Higher than the per capita wage | |

Educational level | Primary school | Primary school education |

Junior high school | Junior high school education | |

Senior high school | Senior high school education | |

University | University education | |

Single-parent family | Yes | Raised in a single-parent family |

No | Raised in a non-single-parent family | |

Offspring | Yes | Having offspring |

No | Having no offspring | |

House property | Tenant | Having no house property |

Property owner | Having house property | |

Foster family | Yes | Fostered |

No | Non-fostered | |

Household registration | Urban | Urban household registration |

Rural | Rural household registration | |

Migrant | Yes | Migrant |

No | Non-migrant |

The research method flow is shown in

Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from data sets found in various kinds of databases such as relational databases, transactional databases, and other

forms of data repositories. Association rules are created by thoroughly analyzing data and looking for frequent if/then patterns. Then, depending on the following two parameters, the important relationships are observed:

1) Support: Indication of how frequently the itemset appears in the database. It is defined as the fraction of records that contain X∪Y to the total number of records in the database. Suppose, the support of an item is 0.1%, it means only 0.1% of the transactions contain that item.

Support ( XY ) = Support count of ( XY ) / Total number of transaction in D

2) Confidence: Fraction of the number of transactions that contain X∪Y to the total number of records that contain X.

It’s is a measure of strength of the association rules. Suppose, the confidence of the association rule X ⇒ Y is 80%, it means that 80% of the transactions that contain X also contain Y together.

Confidence ( X | Y ) = Support ( XY ) / Support (X)

The FP-Growth Algorithm, proposed by Han in [

The FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. For so much it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the itemset association information.

In simple words, this algorithm works as follows: first it compresses the input database creating an FP-tree instance to represent frequent items. After this first step it divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. Finally, each such database is mined separately. Using this strategy, the FP-Growth reduces the search costs looking for short patterns recursively and then concatenating them in the long frequent patterns, offering good selectivity.

Based on travel data and hotel accommodation data, this study compares each car or daily hotel accommodation to a shopping basket, where the item is the ID of the person and the item set is all the people in the car or hotel. With the minimum support and confidence, the FP-growth algorithm is used to calculate the association rules between criminals and ordinary people. That is to mine ordinary people who frequently travel or stay with criminals as criminal suspects.

As one of the most cited of the density-based clustering algorithms, DBSCAN is likely the best known density-based clustering algorithm in the scientific community today. The central idea behind DBSCAN and its extensions and revisions is the notion that points are assigned to the same cluster if they are density-reachable from each other. To understand this concept, we will go through the most important definitions used in DBSCAN and related algorithms. The definitions and the presented pseudo code follows the original by Ester et al., but are adapted to provide a more consistent presentation with the other algorithms discussed in the paper. Clustering starts with a dataset D containing a set of points p ∈ D . Density-based algorithms need to obtain a density estimate over the data space. DBSCAN estimates the density around a point using the concept of ϵ-neighborhood.

Definition 1. ϵ -Neighborhood. The ϵ-neighborhood, N(p), of a data point p is the set of points within a specified radius around p.

N ϵ ( p ) = { q | d ( p , q ) < ϵ }

where d is some distance measure and ϵ ∈ R + . Note that the point p is always in its own ϵ-neighborhood, i.e., p ∈ N ϵ ( p ) always holds.Following this definition, the size of the neighborhood | N ϵ ( p ) | can be seen as a simple unnormalized kernel density estimate around p using a uniform kernel and a bandwidth of ϵ. DBSCAN uses N ϵ ( p ) and a threshold called minPts to detect dense regions and to classify the points in a data set into core, border, or noise points.

Definition 2. Point classes. A point p ∈ D is classified as

・ a core point if N ϵ ( p ) has high density, i.e., | N ϵ ( p ) | ≥ minPts where minPts ∈ Z + is a user-specified density threshold,

・ a border point if p is not a core point, but it is in the neighborhood of a core point q ∈ D , i.e., p ∈ N ϵ ( q ) , or

・ a noise point, otherwise.

Definition 3. Directly density-reachable. A point q ∈ D is directly density-reachable from a point p ∈ D with respect to and minPts if, and only if,

1) | N ϵ ( p ) | ≥ minPts , and

2) q ∈ N ϵ ( p ) .

That is, p is a core point and q is in its ϵ-neighborhood.

Definition 4. Density-reachable. A point p is density-reachable from q if there exists in D an ordered sequence of points ( p 1 , p 2 , ⋯ , p n ) with q = p_{1} and p = p_{n} such that p_{i+}_{1} directly density-reachable from p_{i} ∀ i ∈ { 1 , 2 , ⋯ , n − 1 } .

Definition 5. Density-connected. A point p ∈ D is density-connected to a point q ∈ D if there is a point o ∈ D such that both p and q are density-reachable from o.

Definition 6. Cluster. A cluster C is a non-empty subset of D satisfying the following conditions:

1) Maximality: If p ∈ C and q is density-reachable from p, then q ∈ C ; and

2) Connectivity: ∀ p , q ∈ C , p is density-connected to q.

The DBSCAN algorithm identifies all such clusters by finding all core points and expanding each to all density-reachable points. The algorithm begins with an arbitrary point p and retrieves its ϵ-neighborhood. If it is a core point then it will start a new cluster that is expanded by assigning all points in its neighborhood to the cluster. If an additional core point is found in the neighborhood, then the search is expanded to include also all points in its neighborhood. If no more core points are found in the expanded neighborhood, then the cluster is complete and the remaining points are searched to see if another core point can be found to start a new cluster. After processing all points, points which were not assigned to a cluster are considered noise.

This study uses cosine similarity to calculate the distance. Refer to the existing study [

cos ( θ ) = ∑ i = 1 n ( x i × y i ) ∑ i = 1 n ( x i ) 2 × ∑ i = 1 n ( y i ) 2

We use the machine learning algorithm library Spark Mllib in the distributed computing framework Spark to perform association rule calculation and cluster analysis. A total of five physical nodes are deployed for fast and efficient computing. Compared to stand-alone computing, Spark distributed computing time efficiency is increased by 300%; compared to Hadoop MapReduce distributed clusters, time efficiency is increased by 150%.

Association rule computation is based on a given minimum support (min_sup) and a given minimum confidence (min_conf). A change of min_sup and min_conf will lead to a different result. The settings of min_sup and min_conf are allowed to vary according to actual needs. In this study, their settings are as follows:

min_sup ( XY ) = 20 / Total number of transactions in D

min _ conf ( X | Y ) = 0.8

1) Travel association rules

For ARM of travel data, each route (including route code, departure time, start station, arrival station) is treated as a transaction ID, and all passengers on the route are treated as transaction items. Computation is performed on shuttle-related data according to a given min_sup and min_conf, and the results are expressed in the form of “route passenger X => passenger Y”, which represents that when the requirements of minimum support and minimum confidence are met, passengers X and Y are considered to have taken the same route within a certain time window. However, it is unclear whether the above rules contain criminals, and therefore filtering operation is performed on all association rules to find the association rules that contain criminals. Based on each association rule that contains criminals, such as “X => Y” wherein X is assumed to be a criminal, it is possible to find all the routes that both X and Y have taken together. The ordinary individual Y who often takes the same routes as the criminal X can be considered as a criminal suspect.

The calculation results, as listed in

Criminal | Criminal suspect | Confidence | Route information | Number of the same routes |
---|---|---|---|---|

6eb9b8197cxxxxxxxx | 6eb9b8a9f8 xxxxxxxx | 1.0 | {652901xxxxxxxx, 2016-3-16 14:50:00, aks station, awt station}…{…} | 45 |

6eb9bg199bxxxxxxxx | 4ab8b81987 xxxxxxxx | 0.9 | {654025xxxxxxxx, 2016-4-26 8:45:00, dsz station, xy station}…{…} | 78 |

6deabf198dxxxxxxxx | 6deab8199c xxxxxxxx | 0.8 | {654221xxxxxxxx, 2016-423 10:30:00,em station, tgg station}…{…} | 34 |

65b8b9a98gxxxxxxxx | 6eb8b9a990xxxxxxxx | 0.9 | {654025xxxxxxxx, 2016-9-18 16:20:00, aks station, em station}…{…} | 29 |

…… | …… | …… | …… | …… |

650a0ba97fxxxxxxxx | 6bb8b7a992xxxxxxxx | 0.8 | {652829xxxxxxxx, 2016-7-26 18:10:00,xy station, zq station}…{…} | 40 |

Note: A total of 433 association rules.

there is one route named 653124xxxxxxxx, which started at 14:50:00 on March 16, 2016 from the aks station to the awt station. As shown in the above table, different association rules have different levels of support and confidence. For example, the support and confidence in the first association rule are 45 and 1.0, respectively, indicating that the criminal 6eb9b8197cxxxxxxxx and the criminal suspect 6eb9b8a9f8xxxxxxxx have traveled together with each other for all of the 45 routes. The second association rule shows that nearly 70 of the 78 routes that the criminal 6eb9bg199bxxxxxxxx has taken involve the criminal suspect 4ab8b81987xxxxxxxx. It is impossible yet to know which of the criminal suspects 6eb9b8a9f8xxxxxxxx and 4ab8b81987xxxxxxxx has a higher likelihood of being an actual criminal. In the third association rule, the number of the same routes and the confidence are relatively smaller compared to the first two rules, indicative of a smaller likelihood of the criminal suspect 6deab8199c xxxxxxxx being an actual criminal.

2) Hotel accommodation association rules

For each hotel, all persons who check in for accommodation on a given day are treated as one transaction. Computation is performed on the accommodation data according to given min_sup and min_conf, and the results are expressed in the form of “hotel code person X => person Y”, indicating that passengers X and Y are considered to have stayed at this hotel given the fulfillment of the min_sup and min_conf requirements. By processing the hotel accommodation association rules in a similar way to the travel association rules, it is possible to find the criminal suspect Y that often stays in the same hotels as the criminal X.

Criminal | Criminal suspect | Confidence | Hotel information | Times of staying in the same hotels |
---|---|---|---|---|

6edab6197fxxxxxxxx | 4aaabc1980xxxxxxxx | 0.9 | {2016-05-01 jy inn}, {2016-05-03 xh hotel}…{…} | 48 |

5acd29199bxxxxxxxx | 1a0aa1198exxxxxxxx | 0.8 | {2016-05-15 fk hotel}, {2016-06-15 sd hotel}…{…} | 23 |

6abe23198axxxxxxxx | 6ab5bc1970xxxxxxxx | 1 | {2016-06-17 yh inn}, {2016-06-29 sy hotel}…{…} | 35 |

654abc199dxxxxxxxx | 3a0aa0198exxxxxxxx | 0.8 | {2016-06-28 dt inn}, {2016-07-24 xf hotel}…{…} | 47 |

…… | …… | …… | …… | …… |

65280ac966xxxxxxxx | 5ab5cda979xxxxxxxx | 1 | {2016-10-01 xh inn}, {2016-11-03 cf hotel}…{…} | 24 |

Note: A total of 323 association rules.

date; in addition, the times of staying in the same hotels refers to the times of the criminal and the criminal suspect staying in the same hotels during the studied time window. There is a difference in support and confidence between different accommodation association rules, as is the case with the travel association rules, which indicates that different ordinary people have different likelihoods of being a criminal suspect. For example, the probability of an ordinary individual being a criminal suspect is higher in the first association rule than in the second association rule, as the former rule has higher support and confidence than the latter; the third association rule and the last association rule in

Duplicates may exist between criminal suspects discovered by shuttle ticketing data and by hotel accommodation data―in other words, criminal suspects based on the association rules of shuttle ticketing may also appear in the association rules of hotel accommodation. Therefore, it is necessary to combine the two sets of results to eliminate the repetitive criminal suspects, and by dosing so a total of 648 criminal suspects has been finally obtained.

The results of DBSCAN clustering are shown in

Based on the tag clusters of criminals and ordinary people, the cosine similarity between ordinary people and criminals in each cluster is calculated, and ordinary people with cosine similarity greater than 0.8 are escalated as criminal suspects. The calculation results, as listed in

Cluster code | Criminals and ordinary people | Number of the people |
---|---|---|

Cluster_1 | 6edab6197fxxxxxxxx,4aaabc1980 xxxxxxxx,6eb9b8a9f8 xxxxxxxx,…… | 2039 |

Cluster_2 | 5acd29199bxxxxxxxx,1a0aa1198e xxxxxxxx,4ab8b81987 xxxxxxxx,…… | 2345 |

Cluster_3 | 65b8bca9960xxxxxxxx,65cab11989xxxxxxxx,6e900aa983xxxxxxxx,…… | 2376 |

...... | ...... | ... |

Cluster_66 | 65cab61984 xxxxxxxx,6eb8bca994 xxxxxxxx,6528011996 xxxxxxxx,…… | 2439 |

Cluster_67 | 6fb9b8a97cxxxxxxxx,65bcbd197dxxxxxxxx,1a0ab1198fxxxxxxxx,…… | 2657 |

Cluster code | Criminal suspects | Number |
---|---|---|

Cluster_1 | 4b0bb21980xxxxxxxx,411abca980 xxxxxxxx,6fb9b8a968 xxxxxxxx,…… | 23 |

Cluster_2 | bc08b7a969xxxxxxxx,ba11aa985 xxxxxxxx,4ab8b8a987 xxxxxxxx,…… | 27 |

Cluster_3 | 65b8bca9960xxxxxxxx,65cab11989xxxxxxxx,41bcb1a976 xxxxxxxx,…… | 15 |

...... | ...... | ... |

Cluster_66 | 51aab9a987 xxxxxxxx,65bcb4a974xxxxxxxx,110aaa198fxxxxxxxx,…… | 17 |

Cluster_67 | 65bcbd197dxxxxxxxx,1a0ab1198fxxxxxxxx,51b9b719fexxxxxxxx,…… | 13 |

Note: A total of 973 criminal suspects.

criminal suspects―which is attributed to that each cluster is different than another in terms of both the total number of people and the number of criminals, so a cluster that is higher in both of the two numbers is also likely to have a higher number of criminal suspects identified by similarity calculation. In addition, DBSCAN clustering may result in a different compactness in a different cluster, and as a result the number of predicted criminal suspects also varies.

Moreover, the distribution of the 973 criminal suspects by their personal tags as shown in

offspring. With respect to the ownership of house property, nearly 80% of the criminal suspects do not have their own house property. The distribution by household registration and migrant status indicates that 67% of the potential key individuals have rural household registration and 60% are migrants. In summary, criminal suspects mainly have the tags of being a male, being middle-aged, being unmarried, being unemployed, having low and middle income, having received elementary school and junior high school education, being raised in a single-parent family, having no offspring, being a tenant, being raised in a foster family, having rural household registration, and being a migrant. These tags are in line with the actual situation of criminals. For example, in the comparison of a male individual versus a female individual, an employed individual versus an unemployed individual, and a low-income individual versus a high-income individual, the former is more prone than the latter to commit crimes. In summary, criminal suspects discovered by means of clustering analysis and tag similarity are in line with the tag-based distribution.

After ARM and DBSCAN clustering, the two sets of criminal suspects found by the two methods are subject to intersection operation. In this process, the number of criminal suspects appearing in both of the two sets is calculated to be 567, and the 567 criminal suspects are verified by the validation set of criminals, that is, these suspects are traversed in order to find the individuals appearing in the validation data―who are actual crimes, totaling 419. Therefore, with the elimination of the actual criminals, the number of criminal suspects is finally determined to be 148. We can say that the accuracy of the algorithm is 73.9%, but the actual accuracy needs to be tested in actual law enforcement activities. We can only think that the validation results here indicate the effectiveness of the above method for finding criminal suspects. Moreover, most studies predict the number of crime hotspots and crime cases, but do not predict criminal suspects. Therefore, it is difficult to compare with other crime prediction methods.

In this study, the FP-growth algorithm is used to perform ARM on travel data and hotel accommodation data, and the DBSCAN algorithm is used to achieve tag clustering of criminals and ordinary people, and finally the above results are verified. The results show that:

1) The FP-growth association rule mining algorithm shows that different association rules have different support and confidence, that is, different ordinary people have different possibilities of being criminal suspects.

2) By using the FP-growth association rule algorithm, 648 criminal suspects are found, while 973 are found through DBSCAN clustering of personnel tags; the number of criminal suspects in the intersection of the above two sets of criminal suspects is 567, and when verified against the validation data, the 567 criminal suspects are found to contain 419 actual criminals, thereby leaving 148 as the final criminal suspects with a prediction accuracy of 73.9%.

3) Criminal suspects mainly have the tags of being male, being middle-aged, being unmarried, being unemployed, having low and middle income, having received elementary school and junior high school education, being raised in a single-parent family, having no offspring, being a tenant, being raised in a foster family, having rural household registration, and being a migrant, and such tag-based distribution agrees with the situations of actual key individuals to some extent.

4) The validation results show that the method is effective in discovering criminal suspects, and it has great generalizability; that is, it can be used for data mining of train passenger information, passenger exit/entry records, Internet-café user information, as well as other travel spending information.

The authors declare no conflicts of interest regarding the publication of this paper.

Cheng, B., Li, W.H. and Tong, H.X. (2019) Prediction of Criminal Suspects Based on Association Rules and Tag Clustering. Journal of Software Engineering and Applications, 12, 35-50. https://doi.org/10.4236/jsea.2019.123003