^{1}

^{1}

^{2}

Aiming at the problem of video key frame extraction, a density peak clustering algorithm is proposed, which uses the HSV histogram to transform high-dimensional abstract video image data into quantifiable low-dimensional data, and reduces the computational complexity while capturing image features. On this basis, the density peak clustering algorithm is used to cluster these low-dimensional data and find the cluster centers. Combining the clustering results, the final key frames are obtained. A large number of key frame extraction experiments for different types of videos show that the algorithm can extract different number of key frames by combining video content, overcome the shortcoming of traditional key frame extraction algorithm which can only extract a fixed number of key frames, and the extracted key frames can represent the main content of video accurately.

With the rapid development of multimedia and Internet technology, we are in the era of data explosion. At every moment, there are a lot of data generating such as video, text, images, blogs in all walks of life. Vivid digital video has gradually replaced the monotonous text information, which has become one of the main ways for people to spread information. Either personalized recommendation or content-based video retrieval, it is difficult to analyze a large amount of video data. Key frames are video pictures that can represent the main content of video simply and effectively, they provide a suitable abstraction and framework for video indexing, browsing and retrieval. The use of key frames greatly reduces the amount of data required in video browsing and provides an organizational framework for dealing with video content [

Clustering algorithm is the process of dividing a set of data objects into multiple groups or clusters, which makes the objects in the same cluster have high similarity and the similarity of objects between different clusters is extremely low. From the point of view of pattern recognition, clustering is the discovery of potential patterns in data, helping people to group and classify them to achieve a better understanding of the distribution of data. As a kind of data mining tool, clustering analysis has been widely used in many fields such as biology, information security, intelligent business and web searching. Different clustering algorithms are based on different assumptions and data types, and each clustering algorithm has its limitations and biases. The choice of clustering algorithm often depends on the type of data and the purpose of clustering. For example, some clustering algorithms may work better on one application scenario, but not in another. Clustering algorithm is used to extract video key frames, and the frame images with high similarity in the video are clustered into one class, and these cluster centers are key frames.

Density-based clustering method classifies areas with sufficient high density into clusters, looking for high-density areas separated by low-density areas, and clusters with arbitrary shape can be easily obtained. The density peak clustering algorithm DPCA (clustering by fast search and find of density peaks) [

In the field of image processing and image retrieval, how to extract effective features from image content has become the most concerned issue. The color feature is one of the most significant visual features that widely used in the field of image processing, the main reason is that color is often closely relevant to the object or scene contained in the image. In comparison with other visual features, the color feature has less dependence on the size, direction and perspective of image and also has higher robustness. The earliest example of image retrieval making use of color is a retrieval algorithm based on global color histogram proposed by Swain and Ballard. The retrieval process based on color histogram involves the selection of image color space, the quantization of color space, the definition of color histogram and the calculation of similarity distance between histograms [

According to the tricolor theory, the human eye is more sensitive to red, green and blue, and the majority of colors can be synthesized by different proportions of red, green and blue. The RGB color space is shown in

RGB color model is the most commonly used color model in image processing. As far as editing images are concerned, it is the best color model. Its physical meaning is clear and suitable for the work of color kinescope, but it does not adapt to the visual characteristics of human beings and does not conform to the visual judgment of human eyes on color. For a color, the human eye is most concerned about its chroma, depth, brightness, and synthesizes three parameters to evaluate the color. People without professional knowledge of color cannot directly judge these colors by RGB value, so RGB color space is not in line with people’s perception of color psychology [

HSV color space is a color model oriented to visual perception, in which the color perception of human eye mainly includes three elements: hue, saturation and value [

Color histogram is a widely used color feature in image processing, which describes the proportion of different colors in the entire image, and does not care about the spatial location of each color [

Choosing the appropriate number of color cells (i.e. bin of histogram) and color quantization methods are related to the performance and efficiency requirements of specific applications. In general, the more color intervals there are, the stronger the histogram’s ability to distinguish colors. However, color histograms with many intervals not only increase the computational burden, but are also not conducive to indexing in large image database. Moreover, for some applications, the use of very fine color space partitioning method may not necessarily improve the retrieval effect, especially for those applications that cannot tolerate the omission of relevant images.

In the HSV color model, it is necessary to draw the histogram of its three components (H, S, V) separately, and when there are quite a few colors in the picture, the dimension of each histogram will be higher, so the HSV color space needs quantifying first. According to the characteristics of HSV color model, the following treatments are made in this study:

1) Considering the human visual resolution ability, the hue H component is divided into 12 parts, and the saturation S and value V components are divided into 5 equal parts.

2) Considering the value range of each component and the subjective color perception, the following quantization is performed.

H = { 0 H ∈ [ 346 , 15 ] 1 H ∈ [ 16 , 45 ] 2 H ∈ [ 46 , 75 ] 3 H ∈ [ 76 , 105 ] 4 H ∈ [ 106 , 135 ] 5 H ∈ [ 136 , 165 ] 6 H ∈ [ 166 , 196 ] 7 H ∈ [ 196 , 225 ] 8 H ∈ [ 226 , 255 ] 9 H ∈ [ 256 , 285 ] 10 H ∈ [ 286 , 315 ] 11 H ∈ [ 316 , 345 ] , S = { 0 H ∈ [ 0 , 0.2 ] 1 H ∈ [ 0.2 , 0.4 ] 2 H ∈ [ 0.4 , 0.6 ] 3 H ∈ [ 0.6 , 0.8 ] 4 H ∈ [ 0.8 , 1 ] , V = { 0 H ∈ [ 0 , 0.2 ] 1 H ∈ [ 0.2 , 0.4 ] 2 H ∈ [ 0.4 , 0.6 ] 3 H ∈ [ 0.6 , 0.8 ] 4 H ∈ [ 0.8 , 1 ]

3) Based on the perceptual characteristics of human eyes to color, that is, the sensitivity of human eyes to the H component is greater than the S component, the sensitivity to the S component is greater than the V component, and then these three color components are merged into one-dimensional feature vectors, as shown in Equation (1).

F = 5 H + 3 S + 2 V (1)

Therefore, the value range of F is 0 - 75. As shown in

The main idea of density clustering algorithm is to find high density regions separated by low density regions. DPCA, a density peak clustering algorithm, can use visualization to help find clusters with different densities. It requires that each cluster has a maximum density point as the cluster center, each cluster center attracts and connects the points with lower density around it, and different cluster centers are relatively far away [

Every data object has 76-dimensional attribute values, which can be expressed as x i = { x i 1 , ⋯ , x i d , ⋯ , x i D } (that D = 76). The distance between sample point x i and x j is calculated by Euclidean distance, as shown in Equation (2).

d i s t ( x i , x j ) = ∑ d = 1 D ( x i d − x j d ) 2 (2)

The local density ρ i of the data object x i is defined as follows:

ρ i = ∑ x j ∈ U χ ( d i s t ( x i , x j ) − d i s t c u t o f f ) (3)

where χ ( x ) is an indicator function, which is defined as follows:

χ ( x ) = { 1 x ≤ 0 0 x > 0

The d i s t c u t o f f item indicates the cutoff distance, and in literature [

The distance δ i from higher density point of data object x i is defined as follows:

δ i = { max x j ∈ U , j ≠ i ( d i s t ( x i , x j ) ) ρ i ≥ ∀ ρ j min j : ρ j > ρ i ( d i s t ( x i , x j ) ) otherwise (4)

When the local density ρ i of data object x i is the global maximum value, the relative distance δ i is the maximum distance between any other data object x j and x i . Otherwise, some data objects x j whose local density is greater than ρ i are found, and the relative distance δ i is the minimum distance between data objects x j and x i .

It can be seen that DPCA aims to find data objects with large local density and relative distance as cluster centers. These cluster centers attract and connect the points with low density around them, and they are relatively far away from each other. The local density ρ i and relative distance δ i of each data object x i are calculated, and a two-dimensional decision map is generated based on these two attribute values, where the horizontal axis is the local density ρ i and the vertical axis is the relative distance δ i . Some data points in the upper-right corner of the decision map can represent different cluster centers because of their high local density and large relative distance from other clusters.

Open CV, an open source computer vision library, is used to read a 511 frame test video and convert each frame of the video from the RGB color space to the HSV color space. According to the HSV histogram quantization formula, the quantized values of each channel are calculated and merged into HSV color level F on one channel according to Equation (1) ( F ∈ [ 0 , 75 ] ).Then the number of pixels appearing on each HSV color level F is counted based on the HSV histogram method, and each frame of the video is converted into a HSV histogram, which is expressed as a 76-dimensional eigenvector in numerical form, that is x i = { x i 1 , ⋯ , x i d , ⋯ , x i D } (D = 76).

Each frame of the video has been converted into a 76-dimensional feature vector, so the size of the input data is 511 × 76 . The distance of any two sample points is calculated according to the distance metric Equation (2) and stored in the distance matrix M, which is a 511 × 511 symmetric matrix. The value on the diagonal line of matrix M is all zero, M [ i , j ] corresponds to the distance between data object x i and x j , and M [ i , j ] = M [ j , i ] .

The empirical parameter t ∈ [ 1 % ∼ 2 % ] , and some experiments has been carried out at t = 1 % and t = 2 % respectively. The distance of any two data objects in dataset U is calculated and sorted incrementally, the value of d i s t c u t o f f takes the numeric value at the t position in the incremental sequence. Therefore, the larger t is, the larger the cutoff distance d c u t o f f is, and the greater the local density ρ i of data object x i is. The experimental results are shown in

Next, the local density ρ i of each data object x i is calculated using the density calculation Equation (3), and the relative distance δ i of each data object x i is calculated with the relative distance calculation Equation (4). Finally, the decision map is generated.

t | d c u t o f f |
---|---|

1% | 1766.30 |

1.5% | 2419.76 |

2% | 3196.17 |

When t takes different values, the local density and relative distance of each sample point and the distribution of sample points on decision maps will be different. In order to make the local density of data sample large, t = 2% was adopted as the experimental scheme in this study. In

Next, look at the 128 and 249 sample points, which have the potential to serve as cluster centers and are also recorded as key frames of the video. Finally, five key frames are obtained. In the decision map, sample points with the density of 0 on the horizontal axis are noise points or outliers, and there are no similar sample points around them. Therefore, these off-group points can be directly ignored in searching for cluster centers. The experimental results are shown in

Frame | ρ i | δ i |
---|---|---|

16 | 12 | 73,045.71 |

128 | 8 | 39,697.45 |

249 | 5 | 49,962.86 |

306 | 20 | 53,948.63 |

484 | 54 | 127,402.09 |

Since this study is mainly to find the key frames of video, it is not concerned about which sample points are included in every cluster, so it is only necessary to examine whether each sample point has potential to be a cluster center. In order to prevent the influence of different attributes during the experiment, the normalization of local density ρ i and relative distance δ i can be considered. The normalized decision map is shown in

Aiming at the problem of video key frame extraction, this paper proposes a density peak clustering algorithm, which uses the HSV histogram to transform high-dimensional abstract video image data into quantifiable two-dimensional input matrix. In fact, the video key frame is a relatively subjective concept. Extracting key frames with the density peak clustering algorithm can combine the characteristics of video content well. The extracted key frames can better represent the main content of video, they have low redundancy, good noise resistance, and can form clusters with arbitrary shape without the need to set up the initial parameters artificially.

This work was supported by the Natural Science Foundation of China under Grant Nos. 51668043 and 61262016 and the Gansu Natural Science Foundation of China under Grant No.18JR3RA156.

The authors declare no conflicts of interest regarding the publication of this paper.

Zhao, H., Wang, T. and Zeng, X.Y. (2018) A Clustering Algorithm for Key Frame Extraction Based on Density Peak. Journal of Computer and Communications, 6, 118-128. https://doi.org/10.4236/jcc.2018.612012