Semantic Constraint Based Unsupervised Domain Adaptation for Cardiac Segmentation

The segmentation of unlabeled medical images is troublesome due to the high cost of annotation, and unsupervised domain adaptation is one solution to this. In this paper, an improved unsupervised domain adaptation method was proposed. The proposed method considered both global alignment and cate-gory-wise alignment. First, we aligned the appearance of two domains by image transformation. Second, we aligned the output maps of two domains in a global way. Then, we decomposed the semantic prediction map by category, aligning the prediction maps in a category-wise manner. Finally, we evaluated the proposed method on the 2017 Multi-Modality Whole Heart Segmentation Challenge dataset, and obtained 82.1 on the dice similarity coefficient and 4.6 on the average symmetric surface distance, demonstrating the effectiveness of the combination of global alignment and category-wise alignment.


Introduction
Medical image segmentation is a basic task of intelligent medical diagnosis, which aims at extracting target regions like organs, tissues or lesions from medical images. In recent years, deep learning has developed fast in the field of medical image segmentation [1] [2] [3], but still remaining some problems to be solved. On the one hand, deep learning needs sufficient annotated data, but the annotation of medical images is highly cost. On the other hand, deep learning Advances in Pure Mathematics assumes that the test data and training data are of independent identically distribution, while the distributions of medical image modalities vary largely, as shown in Figure 1. So, the segmentation for modality with few annotations is troublesome.
Domain adaptation is one commonly used method to this problem. It aims at transferring the knowledge of labeled data to few labeled or unlabeled data, helping to promote their task performance [4]. In domain adaptation, the labeled data are called source domain data, the few or unlabeled data are called target domain data. When there are no labeled data in the target domain, calling it unsupervised domain adaptation. In this paper, we focus on unsupervised domain adaptation.
Aligning the distributions of source and target domain data is a common strategy for unsupervised domain adaptation. When the distributions of data are aligned, the two data can share one same model. The way of aligning distributions can be divided into two categories: global alignment and category-wise alignment.
The global alignment aligns the marginal distributions of two domains and has been implemented in different spaces.
For example, some unsupervised domain adaptation works implement global alignment in the input image space [5] [6] [7] [8] [9], regarding each input image as a whole sample. By aligning the distributions of input images, the appearance gap of two domains can be narrowed.
Some other works implement global alignment in the feature space [10] [11] [12] [13], taking each feature map as a sample. Once the features of two domains follow the same distribution, they can share one classifier.
In addition, some works implement global alignment in the output space, taking every output map as a sample. The alignment of output maps provides a low computation way for feature alignment, which has been widely used in unsupervised domain adaptation segmentation [14] [15] [16].
The global distribution alignment can effectively align the marginal distributions of data, but lacking of considering the category information within each data, which may cause the misalignment between categories. So, some works additionally consider the category-wise alignment, to further regularize the segmentation results of each category. Currently, category-wise alignment in unsupervised domain adaptation segmentation is mainly implemented at the feature   [26] firstly assign class for each feature vector using the segmentation prediction, then put the features of same category into a discriminator, to align the segmentation results of same category; in a reverse order, Menta et al. [27] firstly put the whole feature map into a discriminator, then assign class to the output map of the discriminator; Zhang et al. [28] calculate each category's center, and aligns the centers of two domains. In the field of medical image segmentation, category-wise alignment has not been involved yet.
Based on the above works, we propose an improved unsupervised domain adaptation model which combines global alignment and category-wise alignment, and apply it to cross-modality cardiac segmentation. The contributions of the proposed method are as follows:  Both global distribution alignment and category-wise alignment are introduced to medical image segmentation.  Category-wise distribution alignment is creatively implemented in the semantic prediction space rather than the feature space.
The organization of the rest of the paper is as follows: in Section 2, we introduce the related works; in Section 3, we illustrate the proposed method; in Section 4, we present and analyze the experimental results; in Section 5, we summarize the whole work.

Generative Adversarial Networks
Generative adversarial networks (GAN) [29] is a generative model used to generate data subject to the same distribution of the given data, which is made up of one generator and one discriminator. The generator tries to generate data that looks realistic and the discriminator tries to distinguish between the true and the generated data. So, the two modules form an adversarial relationship. By competing with each other, the two modules mutually promote, finally making the generator generate ideal data. The process of the two modules' competition with each other is called adversarial learning. Due to the unsupervision property of GAN, many unsupervised domain adaptation works [

SIFA
Synergistic Image and Feature Adaptation (SIFA) [20] is an unsupervised domain adaptation method which creatively proposes the synergistic alignments of image and feature and achieves great performance in cross-modality medical image segmentation.
SIFA first uses CycleGAN to narrow the appearance gap between two domains for image adaptation. Then, by sharing the encoder of CycleGAN and the segmentation network, the model has two output spaces, SIFA further aligns the outputs of the two spaces for feature adaptation.
As the CycleGAN and segmentation network share the same encoder, when training, the image adaptation and feature adaptation mutual affect, promoting the synergistic adaptation of image and feature.
In this paper, we adopt the synergistic adaptation strategy of SIFA for global alignment.

Proposed Method
Our proposed method considers both global distribution alignment and category-wise alignment for unsupervised domain adaptation. Figure 2 shows an overview of the proposed method. For global distribution alignment, we use the strategy proposed by SIFA [20]; for category-wise alignment, we introduce a new module to the semantic prediction space. The introduction of the proposed method is divided into five sections: image modality transformation, segmentation network, global alignment in image generating space, global alignment in semantic prediction space and category-wise alignment in semantic prediction space.
In Figure 2, the blue arrows represent the source domain data flow and the red arrows represent the target data flow. The fill color of the rectangle represents the modality of images, where blue represents the source domain modality, and red represents the target domain modality; in addition, the color of the text also represents the source of the data, blue represents the data come from source domain data s x , and red represents the data come from target domain data t x .

Image Modality Transformation
A large distribution difference exists between cross-modality medical images. If we directly apply the model trained by the source images to the target images, the task performance would be poor. In this section, we use CycleGAN to reduce the appearance gap between two domains.
First, we use a generative adversarial network { } , t t G D to transform the modality of the source images to that of the target images. The generator t G aims to transform the source images s x into target-like images the discriminator t D tries to distinguish between the generated target images s t x → and the real target images t x . t G and t D form a mutual competing relationship. The corresponding objective function of their adversarial learning is: By the adversarial learning of t G and t D , source images are transformed to target-like images.
By imposing the following cycle consistency constraint to the reconstructed images, generators tend to generate structure-invariant images. Figure 3 shows the image transformed by the generator t G . The left is the , and the right is the target image t x . It can be seen that the generated image is as the same style as target image while as the same structure as source image.

Segmentation Network
In Section 3.1, the generator t G transforms the source images s x to the target modality images s t x → , thus the transformed images s t x → and target images t x are both of target modality, they can share a common segmentation network.
As the encoder E learns the features of s t x → and t x , we introduce a pixel-wise classifier C after encoder E, forming a segmentation network E C  . The training of the segmentation network is supervised by the transformed images s t x → and their labels. As s t x → and s x are of the same structure, they share the same label s y . Therefore, the objective function of segmentation network is: where ˆs t y → is the semantic prediction of s t x → , s y is the one-hot label of s t x → ; H is the cross-entropy, Dice is the dice similarity coefficient, α is a hyperparameter using to balance the cross-entropy and the Dice. In the experiment, α is set as 1.

Global Alignment in Image Generating Space
Image modality transformation alleviates the domain shift between two domains. But when meeting severe domain shift, image adaptation may not be enough to achieve ideal domain adaptation performance. In this section, we further align the distributions of features.
Due to the high dimension of the feature space, aligning features in the feature space takes a lot of computation, so we turn to align the outputs in low-dimensional As shown in Figure 2, the proposed domain adaptation framework has two output spaces. Call the output space of E U  the image generating space, the output space of E C  the semantic prediction space. We implement distribution alignment in both these two output spaces. In this section, we first introduce the alignment in image generating space.
The blue dotted box in Figure 2 where E and U aim to minimize the objective function, and s D aims to maximize the objective function.

Global Alignment in Semantic Prediction Space
In this section, we introduce the alignment in the semantic prediction space, corresponding to the orange dotted box in Figure 2. Similar to the alignment in the image generating space, a discriminator p D is introduced to the semantic prediction space to distinguish the source of input data. Forwarding two do- Considering that the output space is far away from the shallow layer of the network, the gradient of adversarial learning may not effectively back propagate to the shallow features. Therefore, an auxiliary pixel-wise classifier a C is further introduced to the second last feature layer, then there appears an additional output space, calling it auxiliary semantic prediction space. Similarly, a discri- The objective function of auxiliary semantic prediction space alignment is: The segmentation loss function of auxiliary semantic prediction network a E C  is:  (6), the ( ) E ⋅ in Equation (7) and Equation (8) represents the second last features.

Category-Wise Alignment in Semantic Prediction Space
The image modality transformation and output space alignment both treat every input or output image as a whole sample, aligning the distributions from a global perspective, without considering the multiple categories within each image. In this section, we introduce category-wise alignment to the proposed framework, further optimizing the alignment between each category.
The proposed category-wise alignment corresponds to the green dotted box in Figure 2. Aligning the features of the same category is a commonly used method for category-wise alignment. The disadvantages of this kind of method is that it requires a large amount of computation due to the high dimension of features and the categories of feature vectors need to be inferenced by the segmentation results, to some extent inconvenient. Intuitively, we can try to implement category-wise alignment directly in the semantic prediction space. On the one hand, for medical image, each category of its foreground corresponds to a structure of human body, whose segmentation result should share a lot of shape and position consistency between domains. On the other hand, the segmentation result directly gives the probability of each pixel belonging to each category. Meanwhile, the dimension of the semantic prediction space is much lower than the feature space, saving a considerable amount of computation. Based on the above points, in this section, we align the category-wise distribution in the semantic prediction space.
are respectively the segmentation results of the k th category of two domains. E and C aim to minimize the objective function, p k D aims to maximize the objective function.
Overall, the full objective function of the proposed method is: where all the discriminators aim to maximize the above objective function, other modules aim to minimize the objective function. All the modules update in an alternative way. The parameters  [20], the generator contains three convolution layers, nine residual blocks, two deconvolution layers and two convolution layers in turn; the encoder contains three convolution-pooling operations, eight residual blocks, two deconvolution layers and two convolution layers; the decoder consists of one convolution layer, four residual blocks, three deconvolution layers and one convolution layer; the classifiers contain a convolution layer and an up-sampling operation; and all the discriminators consist of five convolution layers.
When training, the batch size is set as 8, the learning rate is set as 4 2 10 − × , and the optimizer is set as Adam optimizer. The alternate update order of modules is: , ,

Dataset
The proposed method is evaluated on the cardiac dataset of Multi-Modality Whole Heart Segmentation Challenge 2017 (MMWHS2017) [31]. The dataset contains 20 MRI and 20 CT volumes, which are unpair and from different patients. Four important cardiac substructures not covering each other in 2D coronal view are selected for segmentation, respectively, the ascending aorta (AA), the left atrium blood cavity (LAC), the left ventricle blood cavity (LVC) and the myocardium of the left ventricle (MYO). The adaptation direction is from MRI to CT, which means that MRI is the source, CT is the target. For MRI, sixteen volumes are used for training and four for validation. For CT, fourteen volumes are used for training, two volumes for validation and others for testing. When training, we use the processed coronal slices provided by SIFA [20], which are clipped, resampled, standardized and enhanced on the basis of the original MMWHS2017 dataset.

Evaluation Metrics
We use two commonly used metrics in segmentation for evaluation, which are dice similarity coefficient (Dice) and average symmetric surface distance (ASSD). Dice measures the volume overlap between the predicted result and the ground truth, and ASSD measures the surface distance between this two. The higher Dice and lower ASSD represents the better prediction result. The expressions of Dice and ASSD are as follows: where A and B represent the 3D prediction result and the ground truth respectively, ( ) S ⋅ represents the set of voxels in 3D surface.  Figure 2 as their segmentation model. Comparing the results of w/o adaptation and two domain adaptation methods, in both cases of no labeled images in target domain, the domain adaptation methods increase the Dice from 26.7 to 80.0 and 82.1, and decrease the ASSD from 24.5 to 6.0 and 4.6, which greatly improves the numerical performance, demonstrating the effectiveness of domain adaptation.

Numerical Results
In addition, comparing our proposed method with SIFA. Our method additionally considers category-wise alignment than SIFA, which increases Dice by

Visualization Results
We represent the visualized segmentation results of different methods in Figure   5. From left to right are respectively: CT test image, w/o adaptation, SIFA, our method, CT supervision, and ground truth. The correspondence between color and cardiac substructure is shown in the right legend.
As shown in Figure 5, Then, we compare the performance of our proposed method and SIFA rowby-row. The first and second rows show that SIFA under segments and over segments some substructures, for example, AA is omitted in the first row and part of the background is mistakenly divided into AA in the second row. Our proposed method improves these deficiencies. In the third row, the segmented LAC by SIFA is not completely closed, existing a small hole, and in the fourth row, the shape of MYO is discontinuous and the segmentation of LVC is also inaccurate. Our method improves the segmentation continuity of substructures and the cohesion between substructures. For future research, we consider introducing the category-wise alignment into the appearance transformation, as the appearance difference of two domains may vary with the region, i.e., for some categories, the difference may be large, while the others may be slight. Taking the category information into consideration may further improve the performance of image appearance transformation.

Funding
This work was supported by the National Natural Science Foundation of China (NSFC, 11771276), and the Shanghai Science and Technology Innovation Action Plan (18441909000).