A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Aiming at the problem of video key frame extraction, a density peak clustering algorithm is proposed, which uses the HSV histogram to transform high-dimensional abstract video image data into quantifiable low-dimensional data, and reduces the computational complexity while capturing image features. On this basis, the density peak clustering algorithm is used to cluster these low-dimensional data and find the cluster centers. Combining the clustering results, the final key frames are obtained. A large number of key frame extraction experiments for different types of videos show that the algorithm can extract different number of key frames by combining video content, overcome the shortcoming of traditional key frame extraction algorithm which can only extract a fixed number of key frames, and the extracted key frames can represent the main content of video accurately.

greatly reduces the amount of data required in video browsing and provides an organizational framework for dealing with video content [1]. Key frame extraction has been recognized as one of the important research issues in video information retrieval [2].
Clustering algorithm is the process of dividing a set of data objects into multiple groups or clusters, which makes the objects in the same cluster have high similarity and the similarity of objects between different clusters is extremely low. From the point of view of pattern recognition, clustering is the discovery of potential patterns in data, helping people to group and classify them to achieve a better understanding of the distribution of data. As a kind of data mining tool, clustering analysis has been widely used in many fields such as biology, information security, intelligent business and web searching. Different clustering algorithms are based on different assumptions and data types, and each clustering algorithm has its limitations and biases. The choice of clustering algorithm often depends on the type of data and the purpose of clustering. For example, some clustering algorithms may work better on one application scenario, but not in another. Clustering algorithm is used to extract video key frames, and the frame images with high similarity in the video are clustered into one class, and these cluster centers are key frames.
Density-based clustering method classifies areas with sufficient high density into clusters, looking for high-density areas separated by low-density areas, and clusters with arbitrary shape can be easily obtained. The density peak clustering algorithm DPCA (clustering by fast search and find of density peaks) [3] is a new density-based clustering algorithm, which can find clusters with different densities by visualized method, quickly find the density peak points (i.e. cluster centers) of data sets, and efficiently allot sample points and eliminate outliers [4].
In the field of image processing and image retrieval, how to extract effective features from image content has become the most concerned issue. The color feature is one of the most significant visual features that widely used in the field of image processing, the main reason is that color is often closely relevant to the object or scene contained in the image. In comparison with other visual features, the color feature has less dependence on the size, direction and perspective of image and also has higher robustness. The earliest example of image retrieval making use of color is a retrieval algorithm based on global color histogram proposed by Swain and Ballard. The retrieval process based on color histogram involves the selection of image color space, the quantization of color space, the definition of color histogram and the calculation of similarity distance between histograms [5]. Several issues will be involved in extracting key frames by clustering algorithm. Firstly, it is necessary to select the appropriate color space to describe the color features. Secondly, use a certain quantitative method to express the color feature as a vector form. Finally, define a criterion to measure the similarity between images in color [6]. Image clustering is much more complex, because most of the image data is high-dimensional and the amount of data is large. All the image data has to be loaded into memory for calculation, so it is not only computationally large, but also prone to memory leaks. In view of this, this paper proposes a density peak clustering algorithm which combines the characteristics of HSV histogram, uses the HSV histogram to simplify calculation and effectively improves the quality and efficiency of key frame extraction.

RGB Color Model
According to the tricolor theory, the human eye is more sensitive to red, green and blue, and the majority of colors can be synthesized by different proportions of red, green and blue. The RGB color space is shown in Figure 1 is most concerned about its chroma, depth, brightness, and synthesizes three parameters to evaluate the color. People without professional knowledge of color cannot directly judge these colors by RGB value, so RGB color space is not in line with people's perception of color psychology [7]. In addition, the RGB color space is uneven, so the visual difference between two colors cannot be expressed directly by the distance between two color points in the color space [8].

HSV Color Model
HSV color space is a color model oriented to visual perception, in which the color perception of human eye mainly includes three elements: hue, saturation and value [9]. The HSV color model corresponds to a conical subset in the cylindrical coordinate system, as shown in Figure 2. The V-axis represents brightness, the distance from the V-axis represents saturation S, and the angle of rotation around the V-axis represents hue H. The top surface of the cone corresponds to 1 V = , and the color with the maximum brightness and saturation is located on the circumference of the top surface of the cone [10]. Androutsos et al. roughly divided the HSV color space by experiment: the areas with brightness greater than 75% and saturation greater than 20% were bright color areas, the areas with brightness less than 25% were black areas, the areas with brightness greater than 75% and saturation less than 20% were white areas, and the others were color areas [11]. The HSV model is similar to the painter's method of color matching. By changing the color intensity and depth, different tones of a pure color can be obtained. That is, adding white to a pure color to change the color intensity and adding black to change the color depth. It can be seen that the three elements of hue, saturation and brightness in the HSV color space have a clear structure, are easy to understand and closely related to the way people feel the color. In order to capture the features of video frames better, this paper uses HSV color model to carry out subsequent experimental analysis.

HSV Histogram
Color histogram is a widely used color feature in image processing, which describes the proportion of different colors in the entire image, and does not care about the spatial location of each color [12]. As shown in Figure 3, the gray histogram is to count all the pixels in the image and get a unified concept of the overall gray level. Among them, the horizontal axis represents the grayscale value (generally taken 0 -255), and the vertical axis is the number of pixels corresponding to each gray value in the image.  Choosing the appropriate number of color cells (i.e. bin of histogram) and color quantization methods are related to the performance and efficiency requirements of specific applications. In general, the more color intervals there are, the stronger the histogram's ability to distinguish colors. However, color histograms with many intervals not only increase the computational burden, but are also not conducive to indexing in large image database. Moreover, for some applications, the use of very fine color space partitioning method may not necessarily improve the retrieval effect, especially for those applications that cannot tolerate the omission of relevant images.
In the HSV color model, it is necessary to draw the histogram of its three components (H, S, V) separately, and when there are quite a few colors in the picture, the dimension of each histogram will be higher, so the HSV color space needs quantifying first. According to the characteristics of HSV color model, the following treatments are made in this study:  3) Based on the perceptual characteristics of human eyes to color, that is, the sensitivity of human eyes to the H component is greater than the S component, the sensitivity to the S component is greater than the V component, and then these three color components are merged into one-dimensional feature vectors, as shown in Equation (1).
Therefore, the value range of F is 0 -75. As shown in Figure 4, a frame of video is converted into a histogram of 76 bin, in which the horizontal axis represents 76 dimensions of one-dimensional feature vector F and the vertical axis represents the number of pixels appearing on each dimension in an image.

Density Peak Clustering Algorithm
The main idea of density clustering algorithm is to find high density regions separated by low density regions. DPCA, a density peak clustering algorithm, can use visualization to help find clusters with different densities. It requires that each cluster has a maximum density point as the cluster center, each cluster center attracts and connects the points with lower density around it, and different cluster centers are relatively far away [2]. That is, the density peak clustering algorithm is based on two assumptions: 1) the density of cluster centers is greater than that of their neighbors, and 2) the distance between different cluster centers and the higher density point is relatively large. Therefore, there are two main quantities that need to be calculated: local density i ρ and distance from higher density points i δ .

Distance Metric
Every data object has 76-dimensional attribute values, which can be expressed as (that D = 76). The distance between sample point i x and j x is calculated by Euclidean distance, as shown in Equation (2).

Local Density
The local density i ρ of the data object i x is defined as follows: where ( ) x χ is an indicator function, which is defined as follows: item indicates the cutoff distance, and in literature [2], it is pointed out that the value range of empirical parameter . The distance between any two data objects in dataset U is calculated and sorted incrementally, the value of cutoff dist takes the numeric value at the t position in the incremental sequence. The local density formula describes that the local density i ρ of each data object i x is equal to the number of data points where the distance from the object i x is less than the cutoff distance cutoff dist .

Distance from Higher Density Points
The distance i δ from higher density point of data object i x is defined as follows: When the local density i ρ of data object i x is the global maximum value, the relative distance i δ is the maximum distance between any other data object j x and i x . Otherwise, some data objects j x whose local density is greater than i ρ are found, and the relative distance i δ is the minimum distance between data objects j x and i x . It can be seen that DPCA aims to find data objects with large local density and relative distance as cluster centers. These cluster centers attract and connect the points with low density around them, and they are relatively far away from each other. The local density i ρ and relative distance i δ of each data object i x are calculated, and a two-dimensional decision map is generated based on these two attribute values, where the horizontal axis is the local density i ρ and the vertical axis is the relative distance i δ . Some data points in the upper-right corner of the decision map can represent different cluster centers because of their high local density and large relative distance from other clusters.  ).Then the number of pixels appearing on each HSV color level F is counted based on the HSV histogram method, and each frame of the video is converted into a HSV histogram, which is expressed as a 76-dimensional eigenvector in numerical form, that is

DPCA Clustering Process
Each frame of the video has been converted into a 76-dimensional feature vector, so the size of the input data is 511 76 × . The distance of any two sample points is calculated according to the distance metric Equation (2)  M i j corresponds to the distance between data object i x and j x , and The empirical parameter , and some experiments has been carried out at 1% t = and 2% t = respectively. The distance of any two data objects in dataset U is calculated and sorted incrementally, the value of cutoff dist takes the numeric value at the t position in the incremental sequence. Therefore, the larger t is, the larger the cutoff distance cutoff d is, and the greater the local density i ρ of data object i x is. The experimental results are shown in Table   1.
Next, the local density i ρ of each data object i x is calculated using the density calculation Equation (3), and the relative distance i δ of each data object i x is calculated with the relative distance calculation Equation (4). Finally, the decision map is generated. Figure 5 and Figure 6 are decision maps when t = 1% and t = 2% respectively.   When t takes different values, the local density and relative distance of each sample point and the distribution of sample points on decision maps will be different. In order to make the local density of data sample large, t = 2% was adopted as the experimental scheme in this study. In Figure 6, the sample point 484 in the upper right corner has the maximum local density, which indicates that the number of sample points similar to the sample point 484 in dataset U is the largest. In addition, sample points 16 and 306 have large local density and relative distance, they can also be selected as cluster centers. These three cluster centers can represent different clusters, and their corresponding video frames are also strong representative. Therefore, the 16, 306 and 484 frames of the video are key frames.
Next, look at the 128 and 249 sample points, which have the potential to serve as cluster centers and are also recorded as key frames of the video. Finally, five key frames are obtained. In the decision map, sample points with the density of 0 on the horizontal axis are noise points or outliers, and there are no similar sample points around them. Therefore, these off-group points can be directly ignored in searching for cluster centers. The experimental results are shown in Table 2.   Since this study is mainly to find the key frames of video, it is not concerned about which sample points are included in every cluster, so it is only necessary to examine whether each sample point has potential to be a cluster center. In order to prevent the influence of different attributes during the experiment, the normalization of local density i ρ and relative distance i δ can be considered. The normalized decision map is shown in Figure 7.

Conclusion
Aiming at the problem of video key frame extraction, this paper proposes a density peak clustering algorithm, which uses the HSV histogram to transform high-dimensional abstract video image data into quantifiable two-dimensional input matrix. In fact, the video key frame is a relatively subjective concept. Extracting key frames with the density peak clustering algorithm can combine the characteristics of video content well. The extracted key frames can better represent the main content of video, they have low redundancy, good noise resistance, and can form clusters with arbitrary shape without the need to set up the initial parameters artificially.