^{1}

^{2}

^{2}

To reduce the computation cost of a combined probabilistic graphical model and a deep neural network in semantic segmentation, the local region condition random field (LRCRF) model is investigated which selectively applies the condition random field (CRF) to the most active region in the image. The full convolutional network structure is optimized with the ResNet-18 structure and dilated convolution to expand the receptive field. The tracking networks are also improved based on SiameseFC by considering the frame relations in consecutive-frame traffic scene maps. Moreover, the segmentation results of the greyscale input data sets are more stable and effective than using the RGB images for deep neural network feature extraction. The experimental results show that the proposed method takes advantage of the image features directly and achieves good real-time performance and high segmentation accuracy.

In the past twenty years, deep convolutional neural networks have gradually become a powerful tool for analysing images in various computer vision fields [

At present, semantic image segmentation applications are divided into two main directions [

ineffective region-dependent effect for the two cars to the left of the image. This type of inaccurate judgement of critical objects on the road that cause a fully convolution network is difficult to put into use in automatic drive. The other model uses a fully convoluted neural network and a conditional random field model, which pays more attention to the segmentation effect.

The algorithm uses a conditional random field model to optimize the segmentation result of a fully convoluted neural network [

Conditional random fields are more likely to group pixels with similar locations and colours into the same category. The dependency relationships at the pixel level can be captured by this method, and objects’ boundaries can be seen clearly. However, the mean-field inference process in the conditional random field algorithm is similar to the iteration that occurs in a bilateral filter. The resulting high computational complexity makes real-time operation difficult to achieve.

In the first of the above research directions, the representative networks include the fully convolutional network (FCN), SegNet and Unet, among others [

The second research direction for semantic segmentation combines the conditional random field model into the traditional image segmentation algorithm. The conditional random field is used to obtain the dependency relationships between pixels. The representative network structures in this direction are DeepLab v2 and CRFasRNN [

In autonomous driving, although the representative networks of the first approach can be applied in a real-time sematic segmentation network, they cannot guarantee sufficient segmentation accuracy. The representative networks of the second approach can guarantee sufficient accuracy but cannot be applied in real time. Thus, this paper attempts to change the way the conditional random field is applied to make it suitable for real time systems. By observing the traffic-scene image, we found that paying too much attention to the accuracy of the segmentation results (e.g., DeepLab, CRFasRNN) is wasteful given the conditions required for real-time monitoring of traffic scenes. Additionally, in traffic scenes, the accuracy improvements after applying the conditional random field are not obvious. In

obtained by only traditional convolutional neural networks. Those areas have a common feature—they are obviously highly distinguishable. For these high-con- tinuity areas, a traditional convolutional neural network can obtain a good segmentation result. For the not-continuous areas, as shown in

To apply a semantic segmentation network to real-time traffic-monitoring, this paper proposes a new semantic segmentation network that is combined with a probabilistic graph model. This approach not only improves the use of the traditional probabilistic graph model but also considers the frame-to-frame relationships. Compared with other semantic segmentation networks, our method has the following advantages [

1) Our method adopts a special way of applying the conditional random field model that first selects the areas that will profit most (the area with the most important object, such as an area containing a person, bicycle or motor vehicle) as the area to be optimized. Then, the conditional random field model is applied in the selected area. This approach can be referred to as a locally connected conditional random field model. This approach limits the calculational complexity by concentrating only on the most pertinent area, which substantially reduces the computational cost.

2) We modify the fully convolutional network structure by replacing the VGG-16 structure in the original DeepLab v2 model by the ResNet-18 structure. Cavitation convolution is used to expand the receptive field during the convolution process [

3) We use the improved SiameseFC tracking network to capitalise on the strong correctional information between frames [

4) The input images for SiameseFC are not normally RGB images, but the segmentation results are. The total number of categories ranges from 0 - 255, the grey level value of each pixel. This approach makes it possible to effectively track the areas where additional attention is needed. The segmentation is the result of the action of both the grey image fully connected convolutional neural network and the conditional random field model, which highlight both the spatial and boundary information of target objects. The advantages obtained by this feature result in a segmentation that greatly reduces the operational complexity of the feature extraction process. In addition, using grey images avoids the need for massive colour a priori information of the target object during the traditional network tracking process. The features extracted from the grey images through the deep neural network are more stable and more effective than those extracted from RGB images. Therefore, the samples required to train the network can be obtained easily by clipping a few pictures around the current frame target area.

The proposed model consists of two parts, each of which plays an independent role in the semantic segmentation network. The first part is LRCRF, which is described in this paper. During the training process, a DeepLab-Resnet 18 network modifies the systematic segmentation results through the produced LRCRF. In addition, loss minimization between the output results and the input labels of LRCRF is used to optimize the segmentation network parameters. A flowchart of LRCRF process is shown in

The LRCRF model proposed in this paper consists of two parts, as described above. One is DeepLab-Resnet 18, and the other establishes a local conditional random field model.

A rough segmentation result is obtained by the DeepLab-Resnet 18 structure, which is derived from DeepLab v2, but the VGG-16 network in DeepLab v2 is replaced with Resnet-18 [

For any input traffic scene image, DeepLab-Resnet 18 is used to obtain rough segmentation result. The segmentation result is the maximum enclosing rectangles of the pedestrian, bicycle and motor vehicle. These areas are then used as the input to the second structure. The area selection process is shown in

The second structure establishes a local area conditional random field model using the following steps. First, the areas recorded above in the original picture as input. For any input area, we treat every pixel as a node; then, we rearrange all the pixels in the regions into a vector. Thus, for any input area X, X = ( x 1 , x 2 , x 3 , ⋯ , x N ) (where x i is the pixel value of the i-th point and N is the number of pixels in this area) corresponding to an output area Y, Y = ( y 1 , y 2 , y 3 , ⋯ , y N ) (where y 2 is the segmentation result of the i-th output area, the range value is L, L = ( l 1 , l 2 , l 3 , ⋯ , l N )

where l i is the i-th label category). These input and output areas appear in pairs and are called the Markov random field [

P ( Y | X ) = 1 Z ( X ) exp ( − E ( Y | X ) ) (1)

where E ( Y | X ) is the variation trend of the random variable Y in the expression, and it is also called the energy function. Here, Z ( X ) = ∑ X , Y exp ( − E ( Y | X ) ) is a normalizing factor for the probability value of the potential function. From the above expression, it is obvious that our goal is to evaluate the Y output when the energy function E ( Y | X ) is at the minimum. According to the definition of a conditional random field, the expression of the energy function can be described as follows:

E ( Y | X ) = ∑ i φ u ( y i ) + ∑ i < j φ p ( y i , y j ) (2)

where φ u ( y i ) is a unary potential function describing the probability that the i-th pixel point is assigned to label y i —that is to say, it describes the cost of assigning label to the i-th pixel point, and φ p ( y i , y j ) is a binary potential function that describes the cost of assigning pixels i and j to the same label. In this model, the unary potential function is taken from a fully connected convolutional neural network; that is, the predictive value of every pixel label is obtained through a fully connected convolutional network. Because the unary potential function does not consider the smoothness property of the picture or the dependency relationship between pixels, the binary potential function is designed to compensate for this defect; it has a picture-smoothing process and encourages assigning positionally adjacent pixels with similar colours to the same label. According to [

φ p ( x i , x j ) = μ ( x i , x j ) ∑ m = 1 M w ( m ) k ( m ) ( f i , f j ) (3)

where w ( m ) is the m-th Gaussian kernel weight value; k ( m ) is the number of Gaussian kernels (m = 1, ∙∙∙, M), and the Gaussian kernel method for selecting M is the same as the binary potential function [

φ p ( x i , x j ) = { μ ( x i , x j ) w k ( f i , f j ) , S < thre μ ( x i , x j ) ∑ m = 1 M w ( m ) k ( m ) ( f i , f j ) , S ≥ thre (4)

where S is the size of the selected area, and the threshold value is pre-set. For specific spatial and colour features, the formula can be described as follows:

φ p ( x i , x j ) = { μ ( x i , x j ) exp ( − ‖ I i − I j ‖ 2 σ 2 ) , S < thre μ ( x i , x j ) ( w ( 1 ) exp ( − ‖ I i − I j ‖ 2 σ 2 − ‖ p i − p j ‖ 2 θ 2 ) + w ( 2 ) exp ( − ‖ p i − p j ‖ 2 θ 2 ) ) , S ≥ thre (5)

where I_{i}, I_{j} represent the colour feature values of two pixels, p i , p j represent the spatial position feature values of the two pixels, and w ( 2 ) , σ and θ represent the parameters of the mixed Gaussian kernel model obtained through learning.

By evaluating the unary and binary potential function, E ( Y | X ) is obtained; then, P ( Y | X ) and the segmentation result Y are obtained by the conditional random field. Finally, the network parameters are trained by evaluating the losses on the final revised result diagram. The process is shown in

By applying the locally connected conditional random field, it is possible to obtain more accurate segmentation results in the selected areas shown above (the pedestrian, bicycle and motor vehicle areas). To make full use of the continuity between frames, we record the areas where the segmentation results variation exceed a certain threshold value after applying the conditional random field (this part is recorded as a misclassified area, and the threshold values are usually set to 0.5, 0.7, and 0.8). Then, these areas are used as baselines for correction. If these areas are followed by misclassifications in the subsequent frame, the uncorrected result will be replaced with the corrected reference result.

The proposed SiameseFC model is used for tracking. The training sample used as input to tracking model is obtained by cutting around the area of the above misclassification [

For the cropped greyscale sample, feature extraction and tracking training are

performed using the network structure shown in

In the tracking network, z is the training sample in the input tracking network. X denotes the pending search area, and the area determined by the above mentioned misclassification area is doubled in size in the segmentation result image, where φ represents two different convolutional structures each designed according to their own environment. These convolution structures are designed to extracting image features. The images tracked in this paper are greyscale images of the segmentation results, with well-described boundary information of the target objects through the conditional random field process. The architectures with two convolutional layers and pooling layers serve as the structure of φ. This approach substantially improves the tracking speed, and the simple convolution structure still performs well on the greyscale segmentation result map in an image where the information is obvious. When defining the loss of the tracking network, the point-by-point loss method in SiameseFC is used. That is, the loss is obtained for each pixel in the final 8 × 8 × 1 feature map. In the map, the misclassified areas correspond to the label 1, and the remaining areas are labelled as −1. For each point in this feature map, the loss is obtained as follows:

l ( y , v ) = log ( 1 + exp ( − y v ) ) (6)

where l ( y , v ) is the loss function, v is the output of each point on the 8 × 8 × 1 feature map, and y is the label value of the corresponding points. The final loss L ( y , v ) is obtained by summing the loss from all points

L ( y , v ) = 1 | D | ∑ u ∈ D l ( y [ u ] , v [ u ] ) (7)

where D is the set of all points in the 8 × 8 × 1 feature map. The tracking network is trained by stochastic gradient descent (SGD) [

θ * = arg min θ L ( y , f ( z , x ; θ ) ) (8)

Finally, in consideration of the uncertain misclassifications and information singularity of the greyscale image of the segmentation result, the tracking network is designed as a form of online learning that works only during the testing stage of the whole model, as shown in

The model training process involves training only the LRCRF field model. The tracking network works only in the testing stage in an online fashion. This is because the prior information of misclassified areas required by the tracking process is missing, and there is not much similarity in far-apart images in different frames. Therefore, the relevant characteristics of the misclassified areas cannot be obtained via pre-training. For any input LRCRF training sample, the first step is to zoom to a fixed size and perform mean-value subtraction for the dataset image; then, the weight coefficient of the convolution kernel in the DeepLab- Resnet 18 network is initialized using the Xavier method [

In the test process, the input image is subtracted from the mean value and sent to the trained LRCRF model. Then, the misclassified area selected by the abovementioned rules is obtained in the final result image. These areas are cropped from the segmentation result map to obtain a few grey-scale samples and sent to the improved SiameseFC [

In this experiment, we used a computer equipped with an i5-7790k CPU and a GTX-1060 GPU. The fully connected conditional random field and the iteration of average field processes described in this paper are all iterated 10 times on the GPU. The dataset used in the experiment was acquired from three different source datasets. The first is a small sample dataset containing 106 paved-road scenes. The shooting location is urban roads, the weather conditions are good, and the road environment is relatively simple. This dataset contains 42 categories, including pedestrians, trucks, cars, buses, bicycles, roads, road signs, etc. The images are scaled to 720 × 1080 pixels. The second dataset includes all the images in the CAMVId dataset [

The segmentation performance of the proposed model is evaluated by commonly used industry standard metrics, including pixel accuracy, mean pixel accuracy and mean intersection over union [

P A = ∑ i = 0 k p i i ∑ i = o k ∑ j = 0 k p i j (9)

where k is the total number of categories labelled in advance, p i j is the total number of pixels in each category i, and the predicted category is j, which describes the percentage of correct classifications of all the pixels in the test data. Mean pixel accuracy is calculated by

M P A = 1 k + 1 ∑ i = 0 k p i i ∑ j = 0 k p i j (10)

which describes the correct average value of the different classified pixels in the test data. Mean intersection over union is calculated as follows：

M I O U = 1 k + 1 ∑ i = 0 k p i i ∑ j = 0 k p i j + ∑ j = 0 k ( p j i − p i j ) (11)

which describes the ratio of the true classifications of the different pixel categories in the test data to the misclassifications associated with that category.

We also adopt the region overlap index commonly used in the industry, which measures the intersection-over-union of the area to be tracked and the tracking network output areas and a pre-set threshold value. When the output of the tracking network is larger than the threshold value, the tracking is considered successful.

Of the above datasets, the names dataset 1, dataset 2 and dataset 3 are used to denote the small dataset, the CAMVID dataset and the CVPR 2018 WAD Video Segmentation Challenge dataset, respectively. To demonstrate the significant improvement in accuracy of model described in this paper, we compared it to the segmenting network SegNet, which is commonly used to segment road-traffic scenes in the industry. In this experiment, the basic SegNet network and the LRCRF model described in this paper are pre-trained on the ImageNet dataset. The experimental results are shown in

The model proposed in this paper is applied to the conditional random field for only some special areas. To compare the benefits of applying the conditional random field in those areas, this paper verified the segmentation indicators for certain special areas (e.g., pedestrian, bicycle and automobile) [

Dataset | Dataset 1 | Dataset 2 | Dataset 3 | ||||||
---|---|---|---|---|---|---|---|---|---|

Evaluation index | PA | MPA | MIOU | PA | MPA | MIOU | PA | MPA | MIOU |

SegNet | 76.5 | 48.8 | 28.7 | 88.6 | 65.9 | 50.2 | 93.4 | 89.5 | 81.3 |

LRCRF | 80.3 | 50.3 | 29.9 | 94.4 | 68.3 | 52.8 | 98.8 | 93.3 | 86.6 |

Dataset | Dataset 1 | Dataset 1 | Dataset 1 | |||
---|---|---|---|---|---|---|

Method | SegNet | LRCRF | SegNet | LRCRF | SegNet | LRCRF |

Resolution | 720 × 1080 | 360 × 480 | 1600 × 1200 | |||

ms/frames | 46.8 | 93.6 | 21.2 | 43.3 | 59.6 | 114.2 |

Dataset | Dataset 1 | Dataset 2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Evaluation index | PA | MPA | MIOU | PA | MPA | MIOU | ||||||

Areas | F | N | F | N | F | N | F | N | F | N | F | N |

SegNet | 66.2 | 77.8 | 60.2 | 47.4 | 44.8 | 28.0 | 89.2 | 88.1 | 68.4 | 65.1 | 60.3 | 44.4 |

LRCRF | 84.5 | 79.7 | 67.7 | 48.6 | 51.2 | 27.6 | 97.2 | 91.8 | 75.3 | 64.3 | 66.4 | 45.0 |

In the second research direction, a fully connected conditional random field model is usually used to refine the segmentation results of the image. However, simply applying CRF to full scenes may cause inefficiency in the operation process in some areas. To verify that the model described in this paper is better than the non-special area, the effects of the fully connected conditional random field model and the proposed model are compared. The segmentation accuracy under different basic segmented network structures is also compared. We used the DeepLab-Resnet 18 structure for the basic segmented network of DeepLab- Resnet 18-CRF, and applied the fully connected conditional random field model. The experimental results are shown in

To demonstrate that the model in this paper has a lower time complexity than the traditional fully connected conditional random field model, we report the execution time of the fully connected conditional random field model and the proposed model on different test datasets. The experimental results in

In this paper, the tracking effects of the tracking network are verified using three different datasets. For the test samples in the different datasets, the segmentation results of the prior frame of the current frame are obtained by LRCRF in each test sample in the testing dataset. Then, the test data are corrected in the tracking reference map. The frame interval between the tracking reference map and the test chart is gradually increased, and the refinement effect of the tracking reference map to the test data is validated at the different frame intervals. The accuracy values of the experimental results are shown in

Dataset | Dataset 1 | Dataset 2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Evaluation index | PA | MPA | MIOU | PA | MPA | MIOU | ||||||

Areas | F | N | F | N | F | N | F | N | F | N | F | N |

DeepLab v2 | 81.6 | 78.2 | 64.7 | 47.3 | 49.4 | 27.0 | 95.8 | 87.0 | 73.4 | 63.8 | 65.2 | 44.1 |

DeepLab-Resnet 18-CRF | 85.2 | 81.5 | 68.2 | 50.0 | 51.7 | 28.6 | 97.5 | 92.3 | 75.1 | 65.8 | 67.0 | 46.1 |

LRCRF | 84.5 | 79.7 | 67.7 | 48.6 | 51.2 | 27.6 | 97.2 | 91.8 | 75.3 | 64.7 | 66.4 | 45.0 |

dataset | dataset 1 | dataset 2 | dataset 3 | ||||
---|---|---|---|---|---|---|---|

method | DeepLab- Resnet 18-CRF | LRCRF | DeepLab- Resnet 18-CRF | LRCRF | DeepLab- Resnet 18-CRF | LRCRF | |

resolution | 720 × 1080 | 360 × 480 | 1600 × 1200 | ||||

ms/frames | 896.4 | 93.6 | 475.2 | 43.3 | 1433.6 | 114.2 | |

evaluation index | frames | dataset 1 | dataset 2 | dataset 3 | ||||||
---|---|---|---|---|---|---|---|---|---|---|

PA | MPA | MIOU | PA | MPA | MIOU | PA | MPA | MIOU | ||

LRCRF | 0 | 80.3 | 50.7 | 31.2 | 94.4 | 69.1 | 54.8 | 97.8 | 93.3 | 87.6 |

3 | 79.6 | 50.3 | 30.8 | 93.8 | 68.7 | 54.5 | 97.2 | 92.8 | 86.8 | |

5 | 78.8 | 49.8 | 30.2 | 92.8 | 68.1 | 53.7 | 96.1 | 92.2 | 85.4 | |

7 | 77.3 | 49.1 | 29.1 | 91.2 | 66.8 | 52.2 | 94.4 | 91.4 | 83.8 | |

9 | 76.1 | 48.2 | 27.8 | 89.7 | 65.7 | 51.3 | 93.0 | 89.2 | 81.6 |

0 | 3 | 5 | 7 | 9 | |
---|---|---|---|---|---|

Dateset1 | 93.6 | 65.5 | 60.17 | 57.2 | 53.1 |

Dataset 2 | 43.3 | 30.0 | 26.8 | 24.3 | 23.1 |

Dataset 3 | 114.2 | 81.36 | 74.7 | 70.8 | 69.0 |

frame intervals (e.g. 3 frames) between the tracking reference map and test chart.

Finally, the tracking effects are compared when the RGB original image or the segmentation result greyscale image are used as the tracking reference map. The results are shown in

from basic segmentation networks. Eventually, the tracking effect using the RGB original image becomes much lower than the tracking effect when using the greyscale map of the segmentation result.

An improved conditional random field model application and a new region trac- king application are proposed in this paper. The proposed method applies the probabilistic graph model to systems with high real-time requirements. The unique features of traffic road scenes are analysed, the conditional random field model is selected to be applied to special areas, and different binary potential functions are selectively used. Using this approach, areas that have clear boundaries in traffic scenes are ignored, while the conditional random field model is applied to those areas with non-smooth or unclear boundary areas. This filtering process greatly reduces the time complexity of the system without losing too much precision, and further optimizes the operation time of the system. By considering the time interdependencies of sequential video images, the application tracking technology tracks and modifies the successive misclassified areas that appear in consecutive frames. The experimental results show that the proposed method achieves state-of-the-art performance.

This research was funded by the Key Research and Development Program of Zhejiang Province (Grants No. 2020C03098).

The authors declare no conflicts of interest regarding the publication of this paper.

Jiang, X., Yu, H.B. and Lv, S.S. (2020) An Image Segmentation Algorithm Based on a Local Region Conditional Random Field Model. Int. J. Communications, Network and System Sciences, 13, 139-159. https://doi.org/10.4236/ijcns.2020.139009