Video Shot Boundary Detection Using Normalized Periodogram Distance Metric

Video shot boundary detection is the primary task for content based video management and retrieval system. This paper proposes a shot boundary detection strategy by exploiting the pros of Normalized Periodogram for efficiently representing the content of the video. A Normalized Periodogram based distance metric to detect the key frames using shot boundary, namely DistanceLeft-Right (DLR), is addressed, which is computed on a sliding sub-window basis. The DLR sequence is used to detect the suspected shot boundary frames and a transition type detection procedure is adapted to these suspected frames for discriminating the abrupt and gradual transitions. The proposed shot boundary detection methodology yields Precision—95.02%, Recall—93.15% and F1 score—94.07% for cut, Precision—86.57%, Recall—86.67% and F1 score—86.61% for gradual, Precision—90.6%, Recall—90.02% and F1 score—90.3% for overall transitions. Experimental results show that the proposed approach is superior to the recently available shot boundary detection techniques because of its robustness and simplicity, and presents an effective distance metric to detect the shot boundary.


Introduction
In this internet era, Digital Video plays a significant role in human's daily lives.Many practical applications like Video Retrieval, Video Surveillance, Video Content Analysis, Video Indexing, Video Skimming, etc., face trade-off between complexity and accuracy.The diverse content of video makes video management systems, a challenging task for multimedia researchers.Manual annotation of multimedia data is possible, but it is highly time consuming, which seeks the need for automatic vision algorithms for annotating the multimedia database over Internet.Video Shot Boundary Detection (VSBD) has been widely accepted as a solution to this trade-off and structural analysis of video.Generally, frames extracted from the shot boundary are minimal compared to entire video content and represent the video effectively.A set of frames captured on a single camera is termed as shot.A shot can be categorized into cut or smooth based on the frames involved in transition as shown in Figure 1.Transition which involves sudden change from one frame to another is cut transition and smooth transition that involves sequence of frames due to several editing effects.This gradual transition can be dissolve, fade and wipe transitions.Dissolve involves very smooth disappearance of previous data and gradual appearance of new data in video.Wipe transition includes shapes like diamond, straight line, star or clock for frame transition.Fade of a frame occurs when multimedia information disappears onto a dark black screen.The process of temporally segmenting a video into shots includes three basic steps: 1) Frame Content Representation; 2) Similiarity/Dissimilarity evaluation between frame features and 3) Shot Boundary Detection [1].
Prior work on shot boundary detection mainly concentrates on the abrupt boundary detection and is very easy to detect the frames of sudden transition since the phenomena involve great discontinuity between adjacent frames.Most approaches involve a feature dissimilarity measure between the adjacent frames and predict the cut transition when the dissimilarity measure exceeds a threshold.Compared to abrupt SBD detection gradual SBD is complex as it does not involve great discontinuity between consecutive frames.Gradual SBD algorithms should be robust enough to issues like camera and object motion.The overall research work carried out in VSBD can be categorized, viz.pixel wise, global based, block based and motion activity based techniques.Various methodologies like [2] [3] use pixel difference as a common feature and these methods fail due to high false alarm rates raised due to fast camera operations in small area.To overcome these drawbacks global based approaches [4]- [6] have been proposed, which detect the boundary using measures like histogram difference, histogram intersection, weighted histogram difference, etc.Even though global approaches are robust to camera and object motion, spatial distribution changes between two different shots are not detected.
Block based approaches have been introduced to improve the SBD accuracy and reduce the computation time.All these approaches discussed so far involve features like moment invariants, local feature fusion, entropy, motion vector, Visual Bag of Words, Edge Change ratio, feature points, etc. Detecting the gradual transition by predicting and training an appropriate model [7] for the corresponding transition has been reported.Mutual information and the joint entropy based cut/fade transitions [7], between consecutive frames, are also reported in literature.Using edge energy of DC coefficients [8], dissolve shot is detected based on U-shaped diagram search.Representing the video content directly by gradient and edge based features [9] is addressed for detecting shot boundary.In [9], the distribution of variance on the edge information is used to detect dissolve and fade transition.Edge based shot boundary detection algorithms suffer from poor performance due to object and camera motion.
As motion is continuous along a shot, motion is also used as a cue to detect shot changes.As the camera and object move graciously within a shot, the resulting motion field within a shot will be continuous.In [10] [11], a block matching algorithm which involves matching a block in the reference frame with all other blocks in the next frame for detecting shot boundary has been proposed.The performance of these methods depends only on the threshold procedure.To overcome this drawback, a multiple feature based cut and gradual detection with minimum number of threshold compared to [11] has been proposed.Though motion based algorithms are computationally expensive, cut transitions can be easily detected.One of the major drawbacks of motion based algorithms is that the algorithms can be easily fooled by varying illuminations.
Multiple features like pixel wise difference, color and edge histogram are extracted from the video frames and fed as input to the machine learning classifier, and support vector machine for transition classification [8] [12].An accumulation histogram difference approach which can identify the dissolve and fade even under flash lights has been proposed [13].In recent years, mutual information and joint entropy based transition detection algorithm [7], which can detect fade and cut, has also been proposed.A model based shot boundary detection algorithm based on frame transition parameter [14], a SVD based fast shot boundary detection algorithm [15] and Walsh Hadamard Transform (WHT) [16] based VSBD technique are the recent techniques available in literature.
One of the limitations of various algorithms proposed for VSBD phenomena is the lack of unified approach for detecting all types of transitions in various video streams like Video Lecture, News, Entertainment Shows, Sports and Movies.Many algorithms proposed for detecting all types of transitions include a tedious procedure and high computational cost.Most of the earlier SBD works are evaluated only on bench mark datasets and produce better results at high computational cost.Hence, the proposed methodology introduces a normalized periodogram distance based Left-Right (LR) ratio to detect the abrupt as well as gradual shot boundaries in video, which is efficient and effective in terms of accuracy and computational cost.The main contributions of this work are: 1) A normalized periodogram distance metric based LR ratio is introduced to detect the shots in a given video.
2) The proposed methodology is evaluated in unconstrained videos, including News, entertainment shows, Movies, Sports and TRECVID 2001 Dataset.

Proposed Video Shot Boundary Detection (VSBD) Methodology
This section elaborates the proposed normalized periodogram based D LR metric for detecting both abrupt and gradual transition simultaneously.Given a video, sequence of frames obtained by partitioning the video is denoted as For each frame f k , the power spectrum is estimated using the classical non-parametric periodogram method.The periodogram of frames can be written as , , , k Per Per Per Per =  .Using suitable sub-window, the normalized periodogram based D LR metric is computed for the feature frames and compared against the statistical threshold S th chosen by trial and error method.The Frames with D LR metric greater than S th are suspected frames for shots.With the suspected frame as centre, the suitable suspected window is selected to decide the transition type as abrupt/gradual.The proposed flow graph for VSBD is shown in Figure 2.

Non-Parametric Power Spectrum Estimation
Periodogram is a non parametric technique for power spectrum estimation [17].The periodogram of a random process is the Fourier transform of the autocorrelation of the random sequence.The autocorrelation of a matrix 'F' can be determined by, The periodogram can be written as, ( ) ( )   , , e Even though the periodogram is represented using the autocorrelation function, it is necessary to represent periodogram in terms of the input frame/matrix 'f'.Let f B (i, j) be the dot product of f(i, j) and the box window filter B(i, j), The autocorrelation function of F B (i, j), Using Convolution Theorem of Fourier transform, where F B (k, l) is the Fourier Transform of the frame f b (i, j) at pixel i, j of size M × N.

Properties of Periodogram
Previous section clearly shows that the Periodogram is directly proportional to the squared magnitude of the Fourier Transform and is very simple to compute.This section gives a gist of the properties of periodogram as follows, 1) Bias of the Periodogram: The expected value of the peridogram of f(i, j) is the convolution of the power spectrum with the Fourier transform of Bartlett Window, Periodogram is a biased estimate.
where P f (k, l) is the power spectrum of f(i, j) and W B (k, l) is the Fourier Transform of the Bartlet window.
2) Variance of the periodogram: Variance of the periodogram does not converges and the periodogram Per f (k, l) is not the consistent estimate of the power spectrum.The variance of the periodogram is proportional to the square of the power spectrum of f(i, j)

DLR Metric Computation and Statistical Threshold Selection
A normalized periodogram distance, a periodogram based metric for shot boundary classification is detailed in this section.Consider the Power spectral estimate of two frames as The periodogram distance between frame x and y can be written as, , , The main intention of using periodogram in this work is to visualize the correlation between frames, Hence normalized periodogram is sufficient for this objective and is given by Step 1: Select the left "W" frames as sample set "L" and right "W" frames as sample set "R".
Step 2: Compute the normalized periodogram distance between each sample in L and centre sample, D LC = median(L j -C), where j is the number of frames in left window, and C is the centre NPD frame in the sub-window "W".Similarly, calculate D RC = median(R j -C).
Step 3: The same process is repeated for all k frames in the video.The obtained D LR metric is compared against the statistical threshold given by, The frames with D LR metric greater than S th are termed as suspected frames.These suspected frames are given as input to the Transition Type Identification Procedure (TTIP) and is detailed in the following algorithm.The flow graph of the D LR metric computation followed by TTIP is shown in Figure 3.After detecting shot boundary using proposed methodology, key frames are extracted based on [18].Step 2: T DLR = min(R Z )-min(L Z ).
Step 3: If T DLR > α1, suspected frame is gradual; Else suspected frame is cut.

Experimental Results and Performance Evaluation
This section presents the evaluation of proposed method over existing methodologies for shot boundary detection.Experimentation is carried out Matlab 8.5 software on DELL i3 core system.Description of the test dataset, evaluation measures and performance of the proposed methodology over state of art methods are detailed below:

Description of Test Videos
To evaluate the performance of the proposed approach, various test videos from OPEN VIDEO [19] and Youtube are downloaded.The test videos include entertainment shows, Song, Movie, Sports and News videos.These videos include abrupt and gradual transitions.VID1-VID11 are the videos collected from Youtube and OPEN VIDEO with vast lighting effects and camera motion.The Benchmark dataset namely TRECVID2001 [20] which is widely used for VSBD purposes is also evaluated using proposed methodology.The details of the test video like number of frames, duration and number of shots are shown in Table 1.

Parameter Selection
The parameters need to be set in the proposed method are "W", "Z", "α", "α1".The sub window size "W" is varied from 5 to 25 in steps of 5 and experimented on the TRECVID data of 5000 frames as shown in Figure 4.
For W = 5, though the precision measure is fair, more false hits occur.For W = 10, precision and recall measure shows little improvement, still false hits remain.Precision and recall value at W = 15, shows improved result than W = 10.Better performance is achieved at W = 20.The value of 'α' in (12) is set as 2 by trial and error  method and it varies for different video.The window size "Z", required to determine the type of shot, is chosen as 20, since minimum gradual transition duration involves 25 frames.

Performance Evaluation
For evaluating the proposed D LR based methodology, benchmark video dataset TRECVID 2001 is used to select the key frames from the video.The evaluating metrics namely precision and recal1 are computed using, As illustrated in Table 2, the performance of the proposed method is compared with recent methodology namely Walsh Hadamard Transform (WHT) [16] based VSBD process.The proposed approach shows a better performance in VID12, VID13 and VID15.The poor performance of the proposed algorithm in VID14 is due to the drastic camera motion.With W = 20, Z = 20, α = 2, α1 = 20, the proposed approach is also verified with the user collected database including entertainment, news and sports videos.For VID1-VID8 the proposed approach produce astounding results, but VID9 include fast object movement, which is confused a shot by D LR metric .Running the proposed approach on i3 core system, the time taken for processing the consecutive frames using [16] is six times greater than the proposed approach.Table 3 depicts the performance of the performance of the proposed algorithm in VID1-VID9.Hence, the periodogram based technique is quite simple and efficient for detecting shot boundaries in any video.

Conclusion
A robust and efficient technique for detecting abrupt and gradual shots in a video is presented.The power spectrum is estimated for video frames and using suitable window size, D LR metric is evaluated for the spectral features extracted from the frames.Suspected video frames are detected using statistical threshold approach on the  computed D LR metric and transition type detection procedure is used to classify the abrupt and gradual transitions.Thus the proposed periodogram based D LR metric shows a promising performance in constrained and unconstrained video data for detecting shot boundaries.The proposed method fails under some drastic camera and object movement conditions, which can be improved by including motion feature.

Figure 2 .
Figure 2. Proposed flow graph for key frame extraction.
is the variance estimate of the frame x.The normalized periodogram distance is written as, From the property 2 of periodogram it is evident that the variance of the periodogram is proportional to the spectral value and therefore it is meaningful to use logarithm of normalized periodogram.The normalized periodogram distance satisfies the basic properties of a metric: Property 1: Symmetry property; Dist a c Dist c b ≤ + With the knowledge of Normalized Periodogram Distance (NPD) between consecutive frames, select a sub-window of size 2W + 1 for the D LR metric computation, explained as follows:

Figure 3 . 1 :
Figure 3. Proposed flow graph for D LR and TTIP.Algorithm 1: Transition Type Identification Procedure Input: Peaks of D LR metric of suspected frames Output: cut/gradual Step 1: For each peak of the D LR metric choose a window of size 2Z + 1 with centre as peak value.Select the left "Z" D LR metric as sample set L Z and right "Z" D LR metric as set R Z .Step 2: T DLR = min(R Z )-min(L Z ).Step 3: If T DLR > α1, suspected frame is gradual; Else suspected frame is cut.

Table 1 .
Video test data description.

Table 2 .
Performance comparison with recent methodologies.

Table 3 .
Performance of proposed D LR metric.