1. Introduction
Stereo matching technology is a fundamental problem in computer vision, which aims to obtain depth information of the 3D scene generated by left-right stereo image pairs, and has been widely used in fields such as robot navigation, autonomous driving, 3D reconstruction, and augmented reality [1] . The core task is to find the matching relationship between corresponding pixels in the two images, i.e., to find the corresponding point of each pixel in the right image from the left image, and calculate the depth of the scene through the disparity information of these matching points. Therefore, the main problems to be solved in stereo matching are the correctness and accuracy of the matching points.
In traditional methods, stereo matching is summarized into four steps [2] : cost calculation, cost aggregation, disparity calculation, and disparity optimization. In recent years, with the rapid development of deep learning, more and more scholars at home and abroad have gradually used deep learning methods to replace the four steps in traditional methods, ultimately forming the popular end-to-end stereo matching network. The disparity estimation obtained by these end-to-end stereo matching networks has greatly improved in underdetermined areas such as weak texture and discontinuous regions compared to traditional methods. However, the generalization performance of stereo matching networks is still the main challenge for applying network structures to real-world scenarios. The generalization ability of a model refers to its ability to adapt to new data after training. The model learns the underlying patterns behind the data, and the trained network can also give appropriate output for data with the same pattern. Currently, the common method to achieve generalization ability is domain generalization based on domain-invariant features. Existing domain generalization methods can be simply divided into three categories: data manipulation [3] , representation learning [4] , and policy learning [5] . Some stereo matching networks have already obtained domain-invariant features by performing feature matching. DSMNet [6] designs two trainable neural network layers that can perform domain generalization well, and by regulating the distribution of the learned representations, the network maintains feature invariance to differences. CFNet [7] integrates multiple low-resolution dense cost volumes to guide the network to learn invariant geometric scene information from different datasets, expanding the receptive field for capturing global representations. Reference [8] proposes the MS-Net network, which replaces deep learning-based feature extraction with traditional matching functions and confidence measures, shifting the learning process from the color space to the matching space to prevent overgeneralization of specific dataset features. The above works transform the input to the domain-invariant feature space, reducing dependence on specific features in the dataset and exhibiting stronger robustness.
Based on the research ideas of the above model, this paper proposes a more widely applicable stereo matching network. In response to the problem of decreasing cross-domain feature consistency, a whitening loss function is introduced during feature extraction. As the loss function decreases, the stereo matching network relies less on matching-unrelated information to form feature representations, thus extending the stereo matching network to real-world scenarios and improving the model’s generalization ability.
This paper is organized as follows: Section 2 discusses related work; Section 3 introduces the TUNet architecture and the improved adaptive stereo matching network ATUNet; Section 4 presents experimental results and analysis; and finally, Section 5 draws conclusions based on the findings.
2. Related Work
2.1. Stereo Matching Networks Based on Deep Learning
In recent years, with the rapid development of Convolutional Neural Network (CNN) [9] , as well as the significant improvement in computing power of various hardware devices with technological advancement, more and more scholars at home and abroad have been using deep learning methods to reduce the phenomenon of mismatching in the ill-posed areas of stereo matching algorithms. Scholars have used CNN to replace individual steps in traditional binocular stereo matching algorithms, dividing deep learning-based binocular stereo matching algorithms into non-end-to-end stereo matching algorithms and end-to-end stereo matching algorithms [10] .
Compared with traditional stereo matching algorithms, non-end-to-end stereo matching algorithms can obtain good disparity effects in complex scenes, greatly promoting the development of stereo matching algorithms. However, non-end-to-end stereo matching algorithms only use local information for cost computation, lacking global information, which makes them still challenging in occlusion, low texture, and repetitive texture areas. Meanwhile, non-end-to-end stereo matching algorithms use a series of cascaded post-processing steps to refine disparities, which makes the training process complicated and difficult to directly optimize the entire stereo matching process. Therefore, using end-to-end stereo matching algorithms has become a research hotspot in stereo matching algorithms in recent years.
The end-to-end stereo matching algorithm inputs a pair of left-right stereo images into a convolutional neural network and directly outputs accurate disparity after training. In 2016, Mayer et al. [11] proposed the first end-to-end stereo matching network which is called DispNet, which used a convolutional neural network to extract features, obtained the feature correlation mapping between the left and right feature maps, and output disparity of different resolutions in multiple transposed convolution layers. They also contributed a large dataset called Scene Flow, generated through synthetic techniques, for network training. Du et al. [12] input foreground segmentation information into the AMNNet network together, improving the generalization performance of the stereo matching network. PSMNet [13] proposed using spatial pyramid pooling and dilated convolution to expand the receptive field, which can combine global environmental information into image features. At the same time, they repeated the stacked 3D hourglass network from coarse to fine and fine to coarse to increase the utilization rate of global information. Cai et al. [8] pointed out that the poor generalization performance of the stereo network is caused by the network’s strong dependence on image appearance and suggested using combinations of matching functions for feature extraction.
The stereo matching algorithm based on the end-to-end deep learning framework consists mainly of four modules: feature extraction, cost volume construction, cost aggregation, and disparity regression, which is consistent with the basic process of traditional stereo matching algorithms. In traditional stereo matching algorithms, manually designed feature descriptors such as SIFT [14] and SURF [15] are usually used for feature extraction. Although these feature descriptors cannot solve problems in specific scenes (such as textureless areas, overexposed areas, and repetitive problem areas), they rarely affect the disparity calculation effect due to dataset transformations. Therefore, the feature extraction layer in the deep learning framework can be considered as a key factor in improving the cross-dataset generalization ability of stereo matching networks. The feature extraction layer captures the style information of images by extracting the correlation between feature channels, which has been further explored in style transfer, image-to-image translation, and other fields. Recently, a selective whitening method was proposed in literature [16] to remove sensitive style information in the dataset, thereby reducing the learning of significant features in the dataset, where the style information selection depends on manually designed photometric transformations. Inspired by selective whitening, this paper chooses information that is sensitive to changes in stereo viewpoints, not just dependent on photometric transformations. This is because in the left and right views of stereo matching, the image transformation is not only photometric, but also involves changes in the scene, etc.
2.2. Factors Affecting the Generalization Ability of Stereo Matching
The key to enhancing the generalization ability of stereo matching networks is to improve their adaptation ability from one dataset to another. Generally speaking, there are significant differences in color, contrast, texture, and scene between stereo images before and after cross-dataset, which can cause the training dataset features learned by deep stereo matching networks to not be well adapted to other datasets, ultimately resulting in erroneous matching results when the network estimates disparities for other datasets.
In order to verify the phenomenon of erroneous disparity estimation due to large image differences before and after cross-dataset in the model, this paper uses the mainstream PSMNet network model for cross-domain feature visualization. First, PSMNet is trained to convergence on the Scene Flow dataset, and then the results of the feature extraction layer from different datasets are visualized and compared in testing. As shown in Figure 1, two sets of stereo image pairs from the Scene Flow and KITTI 2015 datasets were selected, and they were transmitted to the PSMNet network to obtain their feature visualization results.
The output of the feature extraction part of the PSMNet network is a feature tensor of size C × 1/4 H × 1/4 W, where C is the number of feature channels. By analyzing the feature differences before and after cross-domain comparison in the same channel dimension, the information difference of the features can be
observed. In order to more intuitively observe the feature transformation, this paper uses the method of mean [17] to determine the feature differences of the network before and after cross-domain. The specific method of the mean method is as follows: first, calculate the mean of each feature of the network on the first dataset, which can be calculated by averaging the output of the network; then, use the trained network to perform forward propagation on the second dataset and record the output of each feature to further calculate the mean of each feature in the second dataset; finally, compare the mean of each feature in the first and second datasets, if the mean of a certain feature in the first dataset is significantly different from that in the second dataset, then it indicates that the feature has differences on different datasets. In each channel, the mean on the pixel dimension (H, W) is defined as the following formula:
(1)
where, H and W represent the pixel-wise positions, H represents the height of the pixel dimension, W represents the width of the pixel dimension, and
is a small constant added to avoid division by zero in the denominator.
In the generalization experiment, we randomly selected the left images from 105 pairs of stereo images in Scene Flow and KITTI 2015 datasets, and then calculated the mean values of the two datasets in the same channel. As shown in Figure 2, the black line represents the mean distribution of channel 1 in Scene Flow dataset, and the blue line represents the mean distribution of channel 1 in KITTI 2015 dataset. The mean value curves show that for a group of images with low sensitivity to color changes, the feature means are relatively close. Conversely, for a group of images with high sensitivity to color changes, the feature means vary greatly. To better explain this, we refer to larger changes as “sensitive changes” and smaller changes as “insensitive changes”. Examples of sensitive and insensitive changes are shown in Figure 3.
Figure 2. Characteristic channel average curve.
3. Method
3.1. Transformer-Based Iterative Update Stereo Matching Network
Transformer-based Iterative Update Stereo Matching Network (TUNet) framework is shown in Figure 4. The extracted left and right feature maps are transformed into a more easily matched feature that is related to context and position through feature transformation. The cost volume is constructed through similarity calculation and then iteratively updated through GRU to obtain the disparity estimation result.
3.1.1. Feature Transform Module
During feature extraction, a pair of stereo images Il and Ir are input to two feature extraction networks with weight sharing. The architecture of the feature extraction network consists of a series of residual layers and subsampled layers that extract left and right feature at different resolutions. Then, the attention mechanism from the Transformer algorithm [18] is added to aggregate global contextual information by using alternate self-attention and cross-attention layers, so that the feature maps processed by the Transformer can produce dense matching in low texture areas. Meanwhile, relative positional encoding is added to the feature vectors to greatly enhance the position dependency of the feature maps. Linear Transformers [19] are used to reduce computational complexity during the alternate calculation process of self-attention and cross-attention.
3.1.2. Disparity Iterative Update Module
The disparity update is performed using Gated Recurrent Unit (GRU) [20] , which is a type of recurrent neural network (RNN) [21] unit used for modeling sequential data. The specific steps are as follows: starting from the initial disparity of
, the disparity estimation is performed by producing an update direction
in each iteration, which is fed into the next iteration to compute the current disparity estimation:
. The disparity estimation is calculated by inputting the left feature maps, correlations, and the updated hidden state into the GRU, which updates the hidden state and further predicts the new disparity based on the updated hidden state.
Figure 4. Network architecture of TUNet algorithm.
3.2. Improved Adaptive Recurrent Iterative Update Stereo Matching Network
Building on the TUNet stereo matching network in Section 3.1, this paper introduces an adaptive recurrent iterative updating stereo matching network—ATUNet, through incorporating a whitening loss module in the feature extraction module. By suppressing feature consistency, the model’s generalization performance is improved.
3.2.1. Whitening Loss Module
Stereo matching networks typically use Batch Normalization (BN) [22] to normalize features. During training, BN uses batch-wise statistics to normalize features, while during inference, it uses the statistics of the entire training dataset. This leads to the over-reliance of stereo matching networks on the training dataset, making them more sensitive to dataset shifts. To extend feature consistency across different datasets, Instance Normalization (IN) [23] layers are used to replace some BN layers. Unlike BN layers, the IN layer normalizes each sample across its channel dimension, thus avoiding any dependence on the data. For each sample
, the IN layer normalization process is as follows:
(2)
In the equation above,
and
represent the mean and variance, respectively, and C represents the index of the feature channel. Although the IN layer normalizes features within the local neighborhood, it does not consider the correlation between different channels. To further improve the consistency of feature representation, the whitening loss module can remove the redundancy between features by suppressing the feature covariance components that are sensitive to changes in color and other factors in the dataset, as shown in Figure 5.
Firstly, feature extraction is performed, and then the extracted features are subjected to the following computation:
Setp 1: compute the feature vector covariance matrix
:
(3)
Setp 2: calculate the feature covariance matrix
between the left image feature vector variance
and its corresponding right image feature vector variance
:
(4)
where covariance matrix
between the i-th and j-th channels represents the sensitivity to viewpoint changes. If the covariance elements between the left and right features have high variances, these elements are considered to be components that are sensitive to viewpoint changes, that is, the correlation between the two features is high. Therefore, these covariance elements should be considered in the whitening loss. To obtain these values, the k-means [24] method can be used to cluster the covariance matrix
and calculate the selective mask.
Setp 3: compute the selective mask
:
(5)
3.2.2. Whitening Loss Module
Compute whitening loss on the left image feature vector variance:
(6)
where
is an upper triangular matrix, Γ represents the number of layers for loss calculation, and γ represents the corresponding intermediate layer.
Finally, the loss function of the stereo matching network with the introduced whitening loss is calculated as follows:
(7)
where
is the disparity loss function, which is calculated using the smooth L1 loss, as shown in Equation (8):
(8)
By introducing whitening loss, the stereo matching network can not only reduce its dependence on irrelevant information but also further improve the consistency and generalization of its feature representation. Since the differences between left and right stereo images are usually limited to specific physical features, such as diffuse reflection of light, the network model can learn these generalized physical features from limited training data. This enables the network to better adapt and perform when facing new datasets and scenes, thereby improving its reliability and stability in practical applications. In addition, the introduction of whitening loss can also help the network learn more discriminative features, further enhancing its matching performance and accuracy.
4. Experiments
In this experiment, the proposed stereo matching network model was trained only on the Scene Flow dataset, and then tested on the KITTI 2015, Middlebury, and ETH3D datasets to evaluate its cross-dataset generalization ability. The network was built using the PyTorch framework on an NVIDIA RTX A6000 48 G, and the stereo matching network model was trained using a batch size of 8 and the Adam optimizer (
,
). Prior to training, the input images were randomly cropped to 512 × 256. Finally, the network was trained for 15 epochs on the Scene Flow dataset with a learning rate of 0.001.
4.1. Datasets
4.1.1. Scene Flow
The Scene Flow dataset contains high-resolution images and the optical flow and depth information between adjacent frames of multiple indoor and outdoor scenes. Each scene includes approximately 40 adjacent frames with a resolution of 1024 × 436 pixels. These frames were captured at a frame rate of 15 frames per second. Each scene in this dataset contains various types of objects such as vehicles, pedestrians, buildings, etc., with diverse directions and speeds of movement. Therefore, this dataset is very useful for testing the motion and depth estimation capabilities of various types of objects in different scenes.
4.1.2. KITTI 2015
The KITTI 2015 dataset contains image sequences of multiple real-world scenes, each captured by a stereo camera setup comprising of left and right cameras. The dataset includes approximately 200 sequences, each of which contains high-resolution images and accurate depth and optical flow information collected by a system of sensors such as laser scanners and cameras. The images in the dataset cover various scenes, including city streets, highways, rural roads, etc., and exhibit diverse movements and shape changes of objects such as vehicles, pedestrians, buildings, etc.
4.1.3. Middlebury
The Middlebury dataset provides image sequences of various resolutions, including Full, Half, and Quarter resolutions, which can be used to test and evaluate algorithms of different accuracy. The images in the dataset cover various scenes, including indoor, outdoor, natural, and artificial scenes, where objects exhibit diverse features such as shape, size, motion, and color. Additionally, this dataset also provides multiple evaluation metrics, such as flow and disparity error, flow and disparity visualization, which can be used to assess the accuracy and performance of algorithms.
4.1.4. ETH3D
The ETH3D dataset includes multiple sets of image sequences captured by multiple cameras, including 27 stereo image pairs for training and 20 stereo image pairs for testing. Each sequence contains complete camera intrinsic and extrinsic parameters and highly accurate 3D point cloud information. Additionally, the dataset provides depth maps, surface normal maps, and surface texture maps in various formats, which can be used to test and compare different 3D reconstruction algorithms.
4.2. Feature Generalization Analysis
In order to verify the generalization ability of the model, this paper defines the mean of the same feature channels extracted by the model in different datasets as the feature similarity. The formula for this is:
(8)
where
represents the response difference of the mean, and
and
represent the feature means of two different datasets, respectively.
This paper randomly selected 100 images from the Scene Flow and KITTI 2015 datasets and used different methods to visualize response differences. The results are shown in Figure 6, where the horizontal axis represents the number of 32 feature channels, and the vertical axis represents the response difference amplitude. The smaller the amplitude of the response difference, the closer the mean of the information extracted by the feature extraction module in the two datasets.
From Figure 6, it can be seen that different stereo matching models have significant fluctuations in response differences across different datasets. Among them, the stereo matching model with the addition of the whitening loss module has response differences that fluctuate up and down by no more than 0.5 across datasets, and its fluctuation curve is relatively smoother compared to the currently popular PSMNet and the Iterative Stereo Matching Network.
4.3. Contrast Experiment
To evaluate the effectiveness of the whitening loss module, three methods for improving generalization ability, including instance normalization, domain normalization, and the whitening loss module, were added to the model in Section 3.1 for experimental comparison. The threshold error matching rate was used as the evaluation method, where the threshold was 3PX for the KITTI 2015
dataset and 2PX for the Middlebury dataset. As shown in Table 1, ATUNet with the added whitening loss module achieved a 35.05% improvement in accuracy at 3PX on the KITTI 2015 dataset and a 14.6% improvement in accuracy at 2PX on the Middlebury dataset compared to the original TUNet model. Compared with the other two methods for improving generalization ability, introducing the whitening loss module into the original model helps the model to better generalize to other datasets.
To further validate the superiority of the proposed approach, this paper compared the adaptive cyclic iterative updating stereo matching network ATUNet with cross-dataset stereo matching networks and other state-of-the-art end-to-end stereo matching networks on three real datasets. It can be seen that among all stereo matching network models, ATUNet achieved a leading performance compared with other Scene Flow pre-trained stereo matching networks and traditional stereo matching algorithms. As shown in Table 2, the 2 px pixel error rate reached 18.1 on the Middlebury dataset, which was 30.06% higher than PSMNet and 15.02% higher than DSMNet, the most advanced cross-dataset invariant stereo matching network. On the KITTI 2015 dataset, the 3 px pixel error rate reached 6.3, which was 68.18% higher than PSMNet and 3.07% higher than DSMNet. Figure 7 and Figure 8 show the disparity visualization results of ATUNet on the Middlebury and KITTI 2015 datasets.
4.4. Experimental Test on ETH3D
In this section, the effectiveness of the proposed method was evaluated on the ETH3D stereo matching dataset, and ATUNet was compared with various traditional and deep learning stereo matching methods. The model was trained only on the Scene Flow dataset and then tested on the test set provided by the ETH3D
Table 2. KITTI 2015 and Middlebury generalization ability.
Figure 7. Visualization effect on middlebury dataset.
Figure 8. Visualization Effect on KITTI 2015 Dataset.
dataset. The ETH3D visualization results are shown in Figure 9, and the evaluation results are shown in Table 3. Among them, ATUNet performed best on the ETH3D dataset, with a pixel ratio greater than 0.5 between the estimated and true values reaching 6.23%, a pixel ratio greater than 1.0 reaching 2.32%, and an average absolute error of 0.16. Compared with the popular GWCNet [25] model currently on the market, the pixel ratio greater than 0.5 between the estimated and true values was increased by 47.5%, the pixel ratio greater than 1.0 was increased by 36.6%, and the average absolute error was increased by 44.8%. At the same time, Table 3 also shows that the proposed ATUNet model with the addition of the whitening loss module has better generalization performance than the
Figure 9. ETH3D training dataset effect diagram.
TUNet model. The pixel ratio greater than 0.5 between the estimated and true values was increased by 15.0%, the pixel ratio greater than 1.0 was increased by 7.6%, and the average absolute error was increased by 11.1%.
5. Conclusion
This paper proposes a stereo matching model with wider adaptability, which incorporates a whitening loss module during feature extraction to improve the model’s generalization ability by constraining the variation of sensitive pixels in the feature domain. Experimental results show that the improved network model has good cross-dataset adaptability and can better transfer the training results to other datasets through transfer learning. The proposed method is compared with several existing stereo matching algorithms on multiple datasets and effectively reduces the error matching rate while exhibiting a certain level of robustness.
Acknowledgements
This work was supported by Natural Science Foundation of China Youth Fund (No. 62001272), Shandong Provincial Natural Science Fund (No. ZR2019BF022).