Computer Model for Evaluating Multi-Target Tracking Algorithms

Public benchmark datasets have been widely used to evaluate multi-target tracking algorithms. Ideally, the benchmark datasets should include the video scenes of all scenarios that need to be tested. However, a limited amount of the currently available benchmark datasets does not comprehensively cover all necessary test scenarios. This limits the evaluation of multitarget tracking algorithms with various test scenarios. This paper introduced a computer simulation model that generates benchmark datasets for evaluating multi-target tracking algorithms with the complexity of multitarget tracking scenarios directly controlled by simulation inputs such as target birth and death rates, target movement, the rates of target merges and splits, target appearances, and image noise types and levels. The simulation model generated a simulated video and also provides the ground-truth target tracking for the simulated video, so the evaluation of multitarget tracking algorithms can be easily performed without any manual video annotation process. We demonstrated the use of the proposed simulation model for evaluating tracking-by-detection algorithms and filtering-based tracking algorithms.


Introduction
A multi-target tracking problem is to estimate the trajectory of multiple targets as they move and interact with each other in a sequence of images, estimating targets' locations and velocities [1] [2].A multi-target tracking problem was driven primarily by aerospace and defense applications, such as radar, sonar, guidance, and navigation [1] [3] [4].With the advancement in high performance the behavior of targets with a linear or non-linear state space model and estimates the system states (typically target locations) using Kalman filtering [23] [24] [25], particle filtering [26] [27] [28] or other MCMC samplers [29] [30] [31] [32].
The performances of the existing approaches vary depending on the complexities and types of the video scenes which they applied to, and there is no single method that universally works best for all test videos.It is often very difficult to choose a good tracking algorithm among many algorithms that can handle the video complexities existing in an application of interest.The best way of evaluating and choosing a multitarget tracking algorithm is to use test video datasets collected directly from the application of interest.However, the evaluation with the test video is often very time-consuming because mostly no ground-truth is available for the test video datasets, for which users may need to spend significant time on manually tracking and annotating targets in the test video.The manual annotation is subject to many human operators' errors.Using public benchmark datasets coming with ground-truth, such as PETS [21] [33] [34], ETH dataset [35] [36], and Technical University of Darmstadt (TUD) dataset [37] [38] [39], might be an alternative, but the benchmark datasets sometimes do not give any good guidance on the evaluation, because the types of targets and the complexities of the tracking events in the public benchmark datasets are not comparable to those existing in the application of interest.In addition, the ground-truth of the public benchmark datasets was manually annotated by human operators using video annotation tools, and the quality and information in the ground-truth vary significantly [40].
In this paper we propose a simulation model that generates benchmark data for multitarget tracking with a fully controlled setting of target appearances, motion, target split and merge, target birth and death, and noise level in video scenes.Each generated benchmark data include a simulated video, and the groud truth target tracking, including individual target locations and the time traces of their merge and split events.We believe that using the benchmark dataset simulated with the complexity comparable to the application of interest would lead to a more accurate evaluation of the existing multitarget tracking algorithms and thus a better choice of the algorithm for the application.Open Journal of Modelling and Simulation The remaining of this paper is organized as follows.In Section 2, we review the public benchmark datasets popularly used for multitarget tracking, and the performance metrics that have been used for evaluating multitarget tracking algorithms.In Section 3, we describe our simulation model.In Sections 4 and 5, we present an example of applying the simulator for evaluating a multitarget tracking algorithm.

Limitation of Public Benchmark Datasets
There are public benchmark datasets in computer vision for testing and evaluating multitarget tracking algorithms.This section will briefly review some of the popular benchmark datasets as listed below: • Performance Evaluation of Tracking and Surveillance (PETS) [21] [33] [34]; • Technical University of Darmstadt (TUD) [37] [38] [39].
• ETH dataset: -BIWI Walking Pedestrian dataset [35] [36]; -Pedestrian Mobile Scene Analysis [41]  The splits and merges among objects often occur due to overlapping pedestrian images from a single camera view.
The third dataset is maintained by the Department of Electrical and Computer Engineering at ETH.There are two popular data in the ETH dataset.The first one is a video of walking pedestrians.The video is taken from the top of a hotel; therefore, it has a bird-eye-view.Similar to the dataset in the TUD, this dataset is taken with a single camera.Due to the limited view angle of the single camera, the appearance and disappearance of targets occur almost every frame in the da-Open Journal of Modelling and Simulation ta.The merging and splitting event is also differentiable in this data, and they appear less than the TUD dataset (about one per four frames).The second data from the ETH is the video of walking pedestrians recorded from the frontal viewpoint instead of the bird-eye perspective.This data is mainly used for detection purpose.However, it can be used for testing multi-target tracking algorithms as well.Similar to the first data, this data is also taken by a single camera; the birth and death of targets occur every frame, and the merging and splitting events do not occur at all.Comparing to both the PETS and TUD datasets, the ETH datasets have simpler split and merge patterns among targets.If the multi-target tracking algorithm focuses on handling the birth and death of targets only, the ETH dataset will be an ideal choice for testing the algorithm.
The last dataset is the Caviar dataset.It is maintained by the computer vision research group at University of Edinburgh.Comparing to previous datasets, this dataset gives a fewer number of targets, one person or two people per scene.
Target merges, splits, births and deaths sequentially occurs.The Caviar datasets have lower level of difficulty than the aforementioned datasets, because there are only two targets per frame.With a known number of targets (i.e., two people), and sequential events, the Caviar dataset is a good choice to test multi-target tracking algorithms at the first time before testing with PETS, TUD, or ETH datasets.
Although users can test and evaluate multitarget tracking algorithms with these datasets, they are unable to modify these datasets in order to produce the similar complexity of video scenarios that exist in the multitarget tracking application of their interest.The simulation model proposed in this paper can overcome this challenge.Our simulation model allows users to control the target's appearance, number of targets, and target's motion.In addition, users can generate different events, such as merging, splitting, birth and death as well as control the frequency of these events.

Discrete Event Simulator for Generating Multitarget Tracking Benchmark
As depicted in Figure 1, the simulator for generating benchmark data is a discrete event simulator which is described by a system state t S at time t and its state transition events with discrete time steps S is a collection of the state vectors for each target i, • ( ) x y : a centroid coordinate of target i, • i B : a vector of a finite number of ( ) , x y -coordinates that represents the outline of target i,  For time t, a synthetic image t I is generated from the system state t S with different imaging conditions; see details in Section 3.7.The system state t S is recorded and used to generate the groudtruth of the simulated image sequence.
At time 0, the system state variables are initialized with 0 n specified by a simulator user, 0 G as a 0 0 n n × matrix of zeros, and the target-level state vectors initialized randomly by the following q functions, ( ) The details on the q functions are described in Section 3.1 and Section 3.2.
The system state is changed from t S to 1 t+ S at time 1 t + by generating a series of the following events sequentially.1) Consider a Split event for each of the merge events occurred at time t.When a split condition is satisfied for a merge case occurred at time t, split the merged target into target i and target j, which resets ( ) ( ) The split condition will be described in Section 3.5.
2) For each target i, Motion f and Rotation g make changes on , , More details on f and g are described in Section 3.3.
3) Merge of target i and target j that exist at time t occurs when the merging condition described in Section 3.4 is satisfied, and it sets ( ) ( ) 4) Birth sets + with the number of birth events t b , which increases the number of system state variables by t n .In the discrete event simulation, a birth process has been modeled by a Poisson arrival process with α as the mean birth rate per unit time interval, where the inter-birth time in Open Journal of Modelling and Simulation between two consecutive birth events follows an exponential distribution with mean 1 α [46].The rate parameter can be changing in time, which is denoted by t α .Following the approach, we sample t b from ( ) where t α is the expected number of birth events occurred at time t and it can be specified by simulator users to change the level of birth events.The state vector for each of the t b born targets is initialized by the q functions, ( )

Appearance Function qB
The appearance function q B is a random process that generates the image coordinates on the outline of a target.By default, our simulator generates the circular outline of a target with a random radius r,

Initial Scene Function qx,y and qθ
At time 0, the initial scene function determines the initial location and orientation of a target by ~, , .0 where Σ is a two-by-two positive definite covariance matrix which determines the overall movement direction.
The rotation function g determines the random rotation angle of target i at time x y and , 1 i t θ + are sampled, the outline of target i at time where 1 is a column vector of ones which has the same size as the row size of i B .

Merging Condition
The merge event occurs at time t in between target i and j if they spatially overlaps after the motion and the rotation applied.Since the target i's outline would be , sin cos the overlap of targets i and j would occur only if is the Hausdorff distance between A and B .If the condition holds, the merge matrix t G is updated by setting its (i, j) and (j, i) elements set to one.Once they merge, the simulator applies the same motion for the two targets so that they can move together before they split.

Split Condition
Each merged case is considered for being re-split into individual targets.The Open Journal of Modelling and Simulation probability of split is represented by a binomial distribution of the probability of split p.

Determination of the Number of Birth Events bt
The number of birth events at time t is determined by ( ) where t α is the average number of event occurrings.The initial state of each new born target is randomly sampled from q B , q θ and , x y q .

Image and Noise Generation Function
For time t, a synthetic image t I is generated from the system state t S .It is a m n × grayscale image where all targets outlined by x y and orientation , i t θ are colored black (i.e.image intensity = 0), and the remaining background is colored white (i.e.image intensity = 255).We add the following noise components on the synthetic images to simulate different signal-to-noise ratios: • Gaussian random noise: for each image pixel at ( )

Performance Evaluation with the Simulator
The synthetic images generated by the simulator are used as inputs to a multi-target tracking algorithm to be tested, and the state vectors { } are used as a ground-truth to compute the performance metrics of the algorithm.
If the algorithm being tested is a tracking-by-detection algorithm [14] [15] [22], target detections are first obtained for all image frames either by obtaining target boundaries with true system state t S or by choosing and running an existing target detection algorithm on simulated images.A tracking-by-detection algorithm associates the detections to target identities via one-to-one, one-to-many or many-to-one associations, which will output the data association t J for every time frame t, which is a 2 t n × matrix with the detection identifier in the first column and the corresponding target identifier in the second column for each row.The output association can be compared with the groud-truth association generated using t S ; in particular, each target location ( ) , , , i t i t x y and the merge Open Journal of Modelling and Simulation state matrix t G are used.The popular performance metric for this type of the algorithm is the accuracy of one-to-one, one-to-many (split) and many-to-one (merge) data associations in terms of the false positive (FPR) and false negative rates (FNR), which is defined by the following equations [52], where FP is the number of false positives, FN is the number of false negatives, TN is the number of negatives, and TP is the number of true positives.
The false positive and false negative rates can be obtained by comparing MM are the number of false negatives, false positives, and missed matches respectively for t, and t g is the total number of ground-truth targets at time t.

Demonstration
In this section, we demonstrate the use of our simulator in generating benchmark data and evaluating multitarget tracking algorithms with the benchmark data.

Simulate Benchmark Datasets with Our Simulator
For the demonstration purpose, we simulated a benchmark dataset with the following input parameters, total number of image frames: T; initial number of targets: n 0 ; target appearance: circle of a radius with mean mean r and standard deviation sig r ; spatial distance in between targets at the first frame: d; image size: 2 2 m n × ; temporal change in target locations: x σ along x-direction and y σ along y-direction; average number of birth events per time: t α ; image noise: white noise with noise variance 2 σ  and non-uniform illumination with ( ) , f x y ; probability of split: p.
, 1 x y f x y x y − − , and 0.8 p = .

Testing and Evaluating Tracking-by-Detection Algorithms
We demonstrate the use of our simulator to evaluate the multiway data association [22] and the linear programming approach [21].For the evaluation, we designed the experiment by a 2 3 factorial design of three variables, the initial distance in between two closest targets (d), target movement speed ( x σ and y σ ), and target birth rate ( t α ), as seen in Table 1.We fixed the other simulation inputs to and 0.8 p = .For each design, we performed ten simulation runs with the same design variables for replicated experiments to reduce any random effects on the evaluation outcomes.For each simulation run, we recorded the numbers of merging events, splitting events, target births, and target deaths occurred during the simulation run.Table 2 shows the average numbers for ten replicated experiments for each design.As observed in Table 2, the complexity of the simulated tracking scenarios change as we vary simulation inputs.For example, when we decrease the distance d and/or increase the temporal variation of target location, x σ and y σ , more merging and splitting events occur.With the increased birth rate t α , more birth events occur.
For each replication, we input the images generated by our simulator ten   simulations to each of the multiway data association [22], and the linear programming approach [21].The outcomes of the two methods were compared with the ground-truth given by our simulator.The popular performance metric for a tracking-by-detection algorithm is the accuracy of one-to-one, one-to-many (split) and many-to-one (merge) data associations in terms of the false positive (FPR) and false negative rates (FNR) as we summarized in Section 4. The comparison results were summarized by the average false positive (FPR) and false negative rates (FNR) over ten simulation runs for each design.Table 3 and Table 4 show comparative results.The multiway data association [22] performed better than the linear programming approach [21] in handling birth events for all experiments.In terms of the capability of handling death events, the multiway data association [22] had a better performance than the linear programming approach [21] with the exception of Design 4 and 6.The multiway data association [22] detected better 2-1 merging and 1-2 splitting events than the linear programming approach [21] for most of the designs except Design 1.The linear programming approach failed to track 3-1 merging and 1-3 splitting events, while the multiway data association [22] was able to track 3-1 merging and1-3 Table 3.The average false positive rate (FPR) and false negative rate (FNR) for the multiway data association [22] and the linear programming approach [21]  splitting evets.Overall, the multiway data association [22] performed better than the linear programming approach [21] in detecting all merging and splitting events.

Evaluating Filtering-Based Algorithms
In this section, we use our simulator to evaluate a simple Kalman filtering-based algorithm [1] [54] [55] to see how it performs with different levels of image noises and different numbers of targets.We performed a 2 2 factorial design of two factors 0 n and σ  as described in Table 5, while controlling the other factors to 10 T = , For each design, we performed ten simulation runs with the same factor levels.For each experiment run, the Kalman-filtering-based algorithm first read in simulated images, detects each target in the scene and finally estimates each target's position with the Kalman filter.We computed the MOTP and MOTA metrics in order to evaluate the accuracy of the estimation.Table 6 summarizes the average MOTP and MOTA metrics over ten simulation runs for each design, where high values implies higher accuracy for both of the metrics [53].The performance degradation of the algorithm was significant with the increased level of noises, but the effect of the number of targets on the performance was not significant.

Conclusion
We presented a novel simulation model that simulates video frames containing the movements and interactions of multiple targets with fully controlled complexities of target motion, target appearance, image noise levels, target birth and death rates, and target merge and split rates.The simulator also generates the groundtruth locations of the targets for the simulated video frames, which makes the simulation model an ideal tool for generating benchmark datasets to [42] [43];• Caviar[33] [44][45].The first dataset has been provided as a part of the paper competition for the International Workshop on Performance Evaluation of Tracking and Surveillance.Every year different test video scenarios are provided, which are mostly for video surveillance.For example, PETS 2000 and PETS 2001 datasets are designed for tracking outdoor people and vehicles.PETS-ECCV 2004 has a number of video clips recorded for the CAVIAR project, including people walking alone, meeting with others, window shopping, fighting and passing out, and leaving a package in a public place.The PETS 2006 dataset is the surveillance data of public spaces for detecting left luggage events.The PETS 2007 dataset considers both volume crime (theft) and a threat scenario (unattended luggage).The datasets for PETS 2009, PETS 2010 and PETS 2012 consider crowd image analysis and include crowd count and density estimation, tracking of individual(s) within a crowd and detection of separate flows and specific crowd events.Many of the datasets consists of three or four video scenarios categorized by the levels of object tracking difficulties.The second dataset is maintained by Technical University of Darmstadt (TUD) in Germany.Video frames in TUD dataset are taken by a single camera, which include three video sequences of multiple pedestrians at different places.

Figure 1 .
Figure 1.Discrete event simulation model for generating benchmark datasets.
a is a vector of equally spaced numbers in [ ] 0, 2π .When users want to customize target boundaries, users can give a set of N function q B randomly chooses one of the boundaries for each target with the chosen boundary index sampled from ( ) Unif 1, N , where ( ) Unif 1, N is the uniform distribution over integer numbers in 1, N.

Figure 3
Figure3shows some of the generated images under Design 1 and Design 2.

Motion Function f and Rotation g
[49]motion function f determines each target's location at time step t in the simulation.Users have two choices among the Brownian motion[47][48][49],

Table 1 .
Design of experiments for evaluating tracking-by-detection algorithms.

Table 2 .
Average number of target split, merge, birth and death events occurred per an experimental run.

Table 4 .
[21] ten replicated simulations for Designs 1 through 4. The average false positive rate (FPR) and false negative rate (FNR) for the multiway data association[22]and the linear programming approach[21]over ten replicated simulations for Designs 5 through 8.