Computer Model for Evaluating Multi-Target Tracking Algorithms ()
1. Introduction
A multi-target tracking problem is to estimate the trajectory of multiple targets as they move and interact with each other in a sequence of images, estimating targets’ locations and velocities [1] [2]. A multi-target tracking problem was driven primarily by aerospace and defense applications, such as radar, sonar, guidance, and navigation [1] [3] [4]. With the advancement in high performance computing and the availability of inexpensive sensors and cameras, multi-target tracking problem has become popular, and it has grown into an established discipline [5] [6] [7]. Currently multi-target tracking algorithms find many applications in computer vision [2], oceanography [8] [9], robotics [10] [11], and remote sensing [12].
Many multi-target tracking approaches have been extensively studied. A popular approach is a tracking-by-detection method that uses an existing target detection algorithm to detect targets in images and solves an optimization problem for associating and tracing the detected targets over a time horizon [13] - [22]. Another popular approach is a filtering-based approach, which explicitly models the behavior of targets with a linear or non-linear state space model and estimates the system states (typically target locations) using Kalman filtering [23] [24] [25], particle filtering [26] [27] [28] or other MCMC samplers [29] [30] [31] [32].
The performances of the existing approaches vary depending on the complexities and types of the video scenes which they applied to, and there is no single method that universally works best for all test videos. It is often very difficult to choose a good tracking algorithm among many algorithms that can handle the video complexities existing in an application of interest. The best way of evaluating and choosing a multitarget tracking algorithm is to use test video datasets collected directly from the application of interest. However, the evaluation with the test video is often very time-consuming because mostly no ground-truth is available for the test video datasets, for which users may need to spend significant time on manually tracking and annotating targets in the test video. The manual annotation is subject to many human operators’ errors. Using public benchmark datasets coming with ground-truth, such as PETS [21] [33] [34], ETH dataset [35] [36], and Technical University of Darmstadt (TUD) dataset [37] [38] [39], might be an alternative, but the benchmark datasets sometimes do not give any good guidance on the evaluation, because the types of targets and the complexities of the tracking events in the public benchmark datasets are not comparable to those existing in the application of interest. In addition, the ground-truth of the public benchmark datasets was manually annotated by human operators using video annotation tools, and the quality and information in the ground-truth vary significantly [40].
In this paper we propose a simulation model that generates benchmark data for multitarget tracking with a fully controlled setting of target appearances, motion, target split and merge, target birth and death, and noise level in video scenes. Each generated benchmark data include a simulated video, and the groud truth target tracking, including individual target locations and the time traces of their merge and split events. We believe that using the benchmark dataset simulated with the complexity comparable to the application of interest would lead to a more accurate evaluation of the existing multitarget tracking algorithms and thus a better choice of the algorithm for the application.
The remaining of this paper is organized as follows. In Section 2, we review the public benchmark datasets popularly used for multitarget tracking, and the performance metrics that have been used for evaluating multitarget tracking algorithms. In Section 3, we describe our simulation model. In Sections 4 and 5, we present an example of applying the simulator for evaluating a multitarget tracking algorithm.
2. Limitation of Public Benchmark Datasets
There are public benchmark datasets in computer vision for testing and evaluating multitarget tracking algorithms. This section will briefly review some of the popular benchmark datasets as listed below:
・ Performance Evaluation of Tracking and Surveillance (PETS) [21] [33] [34] ;
・ Technical University of Darmstadt (TUD) [37] [38] [39].
・ ETH dataset:
- BIWI Walking Pedestrian dataset [35] [36] ;
- Pedestrian Mobile Scene Analysis [41] [42] [43] ;
・ Caviar [33] [44] [45].
The first dataset has been provided as a part of the paper competition for the International Workshop on Performance Evaluation of Tracking and Surveillance. Every year different test video scenarios are provided, which are mostly for video surveillance. For example, PETS 2000 and PETS 2001 datasets are designed for tracking outdoor people and vehicles. PETS-ECCV 2004 has a number of video clips recorded for the CAVIAR project, including people walking alone, meeting with others, window shopping, fighting and passing out, and leaving a package in a public place. The PETS 2006 dataset is the surveillance data of public spaces for detecting left luggage events. The PETS 2007 dataset considers both volume crime (theft) and a threat scenario (unattended luggage). The datasets for PETS 2009, PETS 2010 and PETS 2012 consider crowd image analysis and include crowd count and density estimation, tracking of individual(s) within a crowd and detection of separate flows and specific crowd events. Many of the datasets consists of three or four video scenarios categorized by the levels of object tracking difficulties.
The second dataset is maintained by Technical University of Darmstadt (TUD) in Germany. Video frames in TUD dataset are taken by a single camera, which include three video sequences of multiple pedestrians at different places. The splits and merges among objects often occur due to overlapping pedestrian images from a single camera view.
The third dataset is maintained by the Department of Electrical and Computer Engineering at ETH. There are two popular data in the ETH dataset. The first one is a video of walking pedestrians. The video is taken from the top of a hotel; therefore, it has a bird-eye-view. Similar to the dataset in the TUD, this dataset is taken with a single camera. Due to the limited view angle of the single camera, the appearance and disappearance of targets occur almost every frame in the data. The merging and splitting event is also differentiable in this data, and they appear less than the TUD dataset (about one per four frames). The second data from the ETH is the video of walking pedestrians recorded from the frontal viewpoint instead of the bird-eye perspective. This data is mainly used for detection purpose. However, it can be used for testing multi-target tracking algorithms as well. Similar to the first data, this data is also taken by a single camera; the birth and death of targets occur every frame, and the merging and splitting events do not occur at all. Comparing to both the PETS and TUD datasets, the ETH datasets have simpler split and merge patterns among targets. If the multi-target tracking algorithm focuses on handling the birth and death of targets only, the ETH dataset will be an ideal choice for testing the algorithm.
The last dataset is the Caviar dataset. It is maintained by the computer vision research group at University of Edinburgh. Comparing to previous datasets, this dataset gives a fewer number of targets, one person or two people per scene. Target merges, splits, births and deaths sequentially occurs. The Caviar datasets have lower level of difficulty than the aforementioned datasets, because there are only two targets per frame. With a known number of targets (i.e., two people), and sequential events, the Caviar dataset is a good choice to test multi-target tracking algorithms at the first time before testing with PETS, TUD, or ETH datasets.
Although users can test and evaluate multitarget tracking algorithms with these datasets, they are unable to modify these datasets in order to produce the similar complexity of video scenarios that exist in the multitarget tracking application of their interest. The simulation model proposed in this paper can overcome this challenge. Our simulation model allows users to control the target’s appearance, number of targets, and target’s motion. In addition, users can generate different events, such as merging, splitting, birth and death as well as control the frequency of these events.
3. Discrete Event Simulator for Generating Multitarget Tracking Benchmark
As depicted in Figure 1, the simulator for generating benchmark data is a discrete event simulator which is described by a system state
at time t and its state transition events with discrete time steps
. The system state
is a collection of the state vectors for each target i,
・
: a centroid coordinate of target i,
・
: a vector of a finite number of
-coordinates that represents the outline of target i,
・
: a rotation angle of the appearance of target i,
and the system-level state variables of,
・
: the number of targets at time t, and
・
: a symmetric matrix for merging states;
implies targets i and j are merged at time t and
otherwise.
![]()
Figure 1. Discrete event simulation model for generating benchmark datasets.
For time t, a synthetic image
is generated from the system state
with different imaging conditions; see details in Section 3.7. The system state
is recorded and used to generate the groudtruth of the simulated image sequence.
At time 0, the system state variables are initialized with
specified by a simulator user,
as a
matrix of zeros, and the target-level state vectors initialized randomly by the following q functions,
The details on the q functions are described in Section 3.1 and Section 3.2.
The system state is changed from
to
at time
by generating a series of the following events sequentially.
1) Consider a Split event for each of the merge events occurred at time t. When a split condition is satisfied for a merge case occurred at time t, split the merged target into target i and target j, which resets
. The split condition will be described in Section 3.5.
2) For each target i, Motion f and Rotation g make changes on
More details on f and g are described in Section 3.3.
3) Merge of target i and target j that exist at time t occurs when the merging condition described in Section 3.4 is satisfied, and it sets
.
4) Birth sets
with the number of birth events
, which increases the number of system state variables by
. In the discrete event simulation, a birth process has been modeled by a Poisson arrival process with
as the mean birth rate per unit time interval, where the inter-birth time in between two consecutive birth events follows an exponential distribution with mean
[46]. The rate parameter can be changing in time, which is denoted by
. Following the approach, we sample
from
where
is the expected number of birth events occurred at time t and it can be specified by simulator users to change the level of birth events. The state vector for each of the
born targets is initialized by the q functions,
and
is augmented by appending
columns and
rows of zero values to
.
5) Death of target i occurs when the new centroid
is out of a pre-specified image region
. When it occurs, it would reduce
by one and would make the removal of the corresponding target-level state vectors.
3.1. Appearance Function qB
The appearance function
is a random process that generates the image coordinates on the outline of a target. By default, our simulator generates the circular outline of a target with a random radius r,
where
and
is a vector of equally spaced numbers in
.
When users want to customize target boundaries, users can give a set of N candidate target boundaries
. The appearance function
randomly chooses one of the boundaries for each target with the chosen boundary index sampled from
, where
is the uniform distribution over integer numbers in 1, N.
3.2. Initial Scene Function qx,y and qθ
At time 0, the initial scene function determines the initial location and orientation of a target by
which evenly distributes
targets over
on grid points with distance d, and margin w. At time
, the initial scene function determines the initial location and orientation of a newly born target by
3.3. Motion Function f and Rotation g
The motion function f determines each target’s location at time step t in the simulation. Users have two choices among the Brownian motion [47] [48] [49],
and the stochastic diffusion process [50],
where
is a two-by-two positive definite covariance matrix which determines the overall movement direction.
The rotation function g determines the random rotation angle of target i at time
by
Once
and
are sampled, the outline of target i at time
is determined by the rigid body transformation of
,
where
is a column vector of ones which has the same size as the row size of
.
3.4. Merging Condition
The merge event occurs at time t in between target i and j if they spatially overlaps after the motion and the rotation applied. Since the target i’s outline would be
the overlap of targets i and j would occur only if
where
is the Hausdorff distance between
and
. If the condition holds, the merge matrix
is updated by setting its (i, j) and (j, i) elements set to one. Once they merge, the simulator applies the same motion for the two targets so that they can move together before they split.
3.5. Split Condition
Each merged case is considered for being re-split into individual targets. The probability of split is represented by a binomial distribution of the probability of split p.
3.6. Determination of the Number of Birth Events bt
The number of birth events at time t is determined by
where
is the average number of event occurrings. The initial state of each new born target is randomly sampled from
,
and
.
3.7. Image and Noise Generation Function
For time t, a synthetic image
is generated from the system state
. It is a
grayscale image where all targets outlined by
with centroid
and orientation
are colored black (i.e. image intensity = 0), and the remaining background is colored white (i.e. image intensity = 255). We add the following noise components on the synthetic images to simulate different signal-to-noise ratios:
・ Gaussian random noise:
for each image pixel at
.
・ Non-uniform illumination:
for each image pixel
, and
is the function defined by users. The output image
at time t follows [51]
for each image pixel at
.
・ Both Gaussian random noise and non-uniform illumination: The output image
at time t follows
for each image pixel
.
and
are the non-uniform illumination and Gaussian random noise generated above, respectively.
4. Performance Evaluation with the Simulator
The synthetic images generated by the simulator are used as inputs to a multi-target tracking algorithm to be tested, and the state vectors
are used as a ground-truth to compute the performance metrics of the algorithm.
If the algorithm being tested is a tracking-by-detection algorithm [14] [15] [22], target detections are first obtained for all image frames either by obtaining target boundaries with true system state
or by choosing and running an existing target detection algorithm on simulated images. A tracking-by-detection algorithm associates the detections to target identities via one-to-one, one-to-many or many-to-one associations, which will output the data association
for every time frame t, which is a
matrix with the detection identifier in the first column and the corresponding target identifier in the second column for each row. The output association can be compared with the groud-truth association generated using
; in particular, each target location
and the merge state matrix
are used. The popular performance metric for this type of the algorithm is the accuracy of one-to-one, one-to-many (split) and many-to-one (merge) data associations in terms of the false positive (FPR) and false negative rates (FNR), which is defined by the following equations [52],
(1)
(2)
where FP is the number of false positives, FN is the number of false negatives, TN is the number of true negatives, and TP is the number of true positives. The false positive and false negative rates can be obtained by comparing
with the ground truth
.
contains all merge and split information and
contains the trace of individual targets. Therefore, one can extract all one-to-one, one-to-many and many-to-one associations by tracking
, which can be directly compared with
for FPR and FNR.
If the algorithm being tested is a filtering-based algorithm that estimates each target’s location at each time step [27] [28], its estimates can be compared with the ground-truth
in order to compute the popular CLEAR MOT metrics [53], the multiple object tracking precision (MOTP) and the multiple object tracking accuracy (MOTA). The MOTP is a measure of the overall target location estimation, which is computed by
(3)
where
is the Euclidean distance between the location estimate and the true target’s location
for target i at time t, and
is the number of the cases that the estimates are with a certain small range from the true target’s location, at time t. The MOTA is
(4)
where
,
, and
are the number of false negatives, false positives, and missed matches respectively for t, and
is the total number of ground-truth targets at time t.
5. Demonstration
In this section, we demonstrate the use of our simulator in generating benchmark data and evaluating multitarget tracking algorithms with the benchmark data.
5.1. Simulate Benchmark Datasets with Our Simulator
For the demonstration purpose, we simulated a benchmark dataset with the following input parameters, total number of image frames: T; initial number of targets: n0; target appearance: circle of a radius with mean
and standard deviation
; spatial distance in between targets at the first frame: d; image size:
; temporal change in target locations:
along x-direction and
along y-direction; average number of birth events per time:
; image noise: white noise with noise variance
and non-uniform illumination with
; probability of split: p.
Figure 2 shows example images generated with
,
,
,
,
,
,
,
,
,
,
, and
.
5.2. Testing and Evaluating Tracking-by-Detection Algorithms
We demonstrate the use of our simulator to evaluate the multiway data association [22] and the linear programming approach [21]. For the evaluation, we designed the experiment by a 23 factorial design of three variables, the initial distance in between two closest targets (d), target movement speed (
and
), and target birth rate (
), as seen in Table 1. We fixed the other simulation inputs to
,
,
,
,
,
,
,
, and
.
For each design, we performed ten simulation runs with the same design variables for replicated experiments to reduce any random effects on the evaluation outcomes. For each simulation run, we recorded the numbers of merging events, splitting events, target births, and target deaths occurred during the simulation run. Table 2 shows the average numbers for ten replicated experiments for each design. As observed in Table 2, the complexity of the simulated tracking scenarios change as we vary simulation inputs. For example, when we decrease the distance d and/or increase the temporal variation of target location,
and
, more merging and splitting events occur. With the increased birth rate
, more birth events occur.
For each replication, we input the images generated by our simulator ten
![]()
Table 1. Design of experiments for evaluating tracking-by-detection algorithms.
![]()
Table 2. Average number of target split, merge, birth and death events occurred per an experimental run.
simulations to each of the multiway data association [22], and the linear programming approach [21]. The outcomes of the two methods were compared with the ground-truth given by our simulator. The popular performance metric for a tracking-by-detection algorithm is the accuracy of one-to-one, one-to-many (split) and many-to-one (merge) data associations in terms of the false positive (FPR) and false negative rates (FNR) as we summarized in Section 4. The comparison results were summarized by the average false positive (FPR) and false negative rates (FNR) over ten simulation runs for each design. Table 3 and Table 4 show comparative results. The multiway data association [22] performed better than the linear programming approach [21] in handling birth events for all experiments. In terms of the capability of handling death events, the multiway data association [22] had a better performance than the linear programming approach [21] with the exception of Design 4 and 6. The multiway data association [22] detected better 2-1 merging and 1-2 splitting events than the linear programming approach [21] for most of the designs except Design 1. The linear programming approach failed to track 3-1 merging and 1-3 splitting events, while the multiway data association [22] was able to track 3-1 merging and1-3
![]()
Table 3. The average false positive rate (FPR) and false negative rate (FNR) for the multiway data association [22] and the linear programming approach [21] over ten replicated simulations for Designs 1 through 4.
![]()
Table 4. The average false positive rate (FPR) and false negative rate (FNR) for the multiway data association [22] and the linear programming approach [21] over ten replicated simulations for Designs 5 through 8.
splitting evets. Overall, the multiway data association [22] performed better than the linear programming approach [21] in detecting all merging and splitting events.
Evaluating Filtering-Based Algorithms
In this section, we use our simulator to evaluate a simple Kalman filtering-based algorithm [1] [54] [55] to see how it performs with different levels of image noises and different numbers of targets. We performed a 22 factorial design of two factors
and
as described in Table 5, while controlling the other factors to
,
,
,
,
,
,
,
,
,
and
.
For each design, we performed ten simulation runs with the same factor levels. Figure 3 shows some of the generated images under Design 1 and Design 2.
![]()
Table 5. Design of experiments for evaluating a Kalman filtering-based algorithm.
![]()
Table 6. Average MOTA and MOTP metrics.
![]()
Figure 3. Simulated video frame. (a) Design 1; (b) Design 2.
For each experiment run, the Kalman-filtering-based algorithm first read in simulated images, detects each target in the scene and finally estimates each target’s position with the Kalman filter. We computed the MOTP and MOTA metrics in order to evaluate the accuracy of the estimation. Table 6 summarizes the average MOTP and MOTA metrics over ten simulation runs for each design, where high values implies higher accuracy for both of the metrics [53]. The performance degradation of the algorithm was significant with the increased level of noises, but the effect of the number of targets on the performance was not significant.
6. Conclusion
We presented a novel simulation model that simulates video frames containing the movements and interactions of multiple targets with fully controlled complexities of target motion, target appearance, image noise levels, target birth and death rates, and target merge and split rates. The simulator also generates the groundtruth locations of the targets for the simulated video frames, which makes the simulation model an ideal tool for generating benchmark datasets to evaluate multitarget tracking algorithms. We demonstrated the use of the proposed simulation model to evaluate two tracking-by-detection algorithms and one filtering-based algorithm. In addition, the simulator can generate new public benchmark datasets with high-degree of interactions and complexity with ease from the user’s inputs.
Acknowledgements
This project was partially supported by the National Science Foundation with grant no NSF-CMMI-1334012, the Air Force Office of Scientific Research with grant no FA9550-13-1-0075, and the Florida State University Council on Research and Creativity Planning Grant 036656.
Nomenclature
ETH: Eidgenssische Technische Hochschule
FN: False Negative
FNR: False Negative Rate
FP: False Positive
FPR: False Positive Rate
MCMC: Markov Chain Monte Carlo
MOT: Multiple Object Tracking
MOTA: Multiple Object Tracking Accuracy
MOTP: Multiple Object Tracking Precision
PETS: Performance Evaluation of Tracking and Surveillance
TN: True Negative
TP: True Positive
TUD: Technical University of Darmstadt