From U-Net to Swin-Unet Transformers: The Next-Generation Advances in Brain Tumor Segmentation with Deep Learning ()
1. Introduction
Brain tumors are among the most life-threatening neurological disorders, significantly impacting morbidity and mortality worldwide. According to the World Health Organization (WHO), brain tumors are classified into various grades (I - IV) based on their malignancy, with glioblastoma multiform (GBM) being the most aggressive [1, 2]. Early and accurate diagnosis is crucial for treatment planning, surgical intervention, and patient prognosis.
Magnetic Resonance Imaging (MRI) is the primary diagnostic tool for brain tumor assessment due to its superior soft-tissue contrast and non-invasive nature [3]. However, manual segmentation of tumor regions by radiologists is time-consuming, subjective, and prone to inter-observer variability. This has led to the development of automated and semi-automated brain tumor segmentation techniques to improve efficiency and reproducibility.
Brain tumor segmentation involves delineating different tumor subregions, such as the enhancing tumor (ET), peritumoral edema (ED), and necrotic core (NCR), from multimodal MRI scans (T1, T1c, T2, FLAIR).
Figure 1 showcases the integration of multiple MRI modalities FLAIR, T1, T1ce, and T2 to visualize different tissue characteristics and tumor components in the brain. Each modality highlights specific pathological features: FLAIR emphasizes edema, T1 provides anatomical details, T1ce highlights actively enhancing tumor regions, and T2 assists in differentiating tumor from healthy tissue. The segmented masks, displayed in both custom colors and grayscale, represent the output of an automated segmentation model, classifying tumor subregions such as enhancing tumor, tumor core, and edema with distinct labels.
The 3D U-Net model plays a critical role in this process by processing the entire volumetric MRI data to produce accurate voxel-wise segmentation. Its encoder-decoder architecture captures spatial context in three dimensions, preserving detailed anatomical structures through skip connections. By effectively combining the complementary information from multiple MRI sequences, 3D U-Net generates detailed, multi-class segmentation maps that are essential for precise tumor localization, treatment planning, and prognosis evaluation in clinical practice.
Figure 1. MRI modalities (Flair, T1, T1ce, T2), segmented color and grayscale mask. Color-coded: (background) in dark blue, (non-enhancing tumor) in cyan, (edema) in yellow, and (enhancing tumor) in red.
Over the past decade, advancements in machine learning (ML) and deep learning (DL) have revolutionized segmentation accuracy. Traditional methods, such as thresholding, region-growing, and clustering (K-means, Fuzzy C-means), have been increasingly replaced by convolutional neural networks (CNNs) and transformer-based architectures.
Figure 2 visualizes the segmentation of a brain tumor into its constituent subregions using a multi-class labeling scheme. The first panel (“Original Segmentation”) shows the complete labeled mask, where different tumor components are color-coded: class 0 (background) in dark blue, class 1 (non-enhancing tumor) in cyan, class 2 (edema) in yellow, and class 3 (enhancing tumor) in red. The subsequent panels isolate each class for clearer interpretation. Panel 2 displays the non-tumor region (class 0), while panels 3 to 5 separately highlight the non-enhancing core (class 1), the surrounding edema (class 2), and the enhancing tumor core (class 3), respectively. This breakdown aids in analyzing tumor heterogeneity and is crucial for diagnosis, treatment planning, and model evaluation.
Figure 2. Illustrates the classification of brain tumor segmentation into four categories: class 0—Non-tumor region, class 1—Non-enhancing tumor, class 2—Edema, and class 3—Enhancing tumor.
Despite significant progress, challenges remain, including heterogeneous tumor appearance, class imbalance, and limited annotated datasets. Publicly available datasets like the BraTS (Brain Tumor Segmentation Challenge) have played a pivotal role in benchmarking algorithms [4, 5]. Recent trends include the integration of attention mechanisms, 3D CNNs, and hybrid models to enhance segmentation performance [6].
In recent years, deep learning (DL)-based segmentation has dominated the field, surpassing traditional machine learning (ML) techniques such as random forests, support vector machines (SVMs), and atlas-based methods. The introduction of U-Net [7] revolutionized medical image segmentation, and its 3D variants (e.g., 3D U-Net, V-Net) further improved volumetric tumor analysis.
This review paper provides a comprehensive analysis of state-of-the-art brain tumor segmentation techniques, discussing their strengths, limitations, and future directions. We cover traditional ML approaches, deep learning models, evaluation metrics, and emerging trends in the field.
2. Materials and Methods
2.1. Review Methodology
This review aims to comprehensively analyse the current state of brain tumour segmentation using deep learning techniques, with a particular focus on advanced techniques such as U-Net architectures and transformers based models. To ensure a systematic and thorough examination of the literature, a structured search was conducted using electronic databases, including PubMed, IEEE Xplore, Science Direct, and Google Scholar.
Search Strategy: The search terms used included “brain tumour segmentation,” “deep learning,” “U-Net,” “MRI,” “BraTS challenge,” “glioma segmentation,” “convolutional neural networks,” “activation functions,” “transformers”, and “medical image analysis.” The review studies were considered to capture the most recent advancements, especially those related to the BraTS challenges. Only articles published in English were included.
Inclusion Criteria: The review included peer-reviewed journal articles and conference papers that focused on brain tumour segmentation using deep learning techniques. Research involving U-Net architectures or their variants was prioritised. Papers discussing next generation advances techniques in brain tumor segmentations.
Exclusion Criteria: Studies not related to brain tumour segmentation or those not utilising deep learning methods were excluded. Non-English publications were also excluded.
Data Extraction and Synthesis: Relevant information from the selected studies was extracted, including the proposed methods, datasets used, CNN, activation functions, performance metrics, and key findings. Emphasis was placed on studies that provided critical insights into the advancements, challenges, and future directions of U-Net architectures, and transformers in brain tumour segmentation.
The BraTS (Brain Tumor Segmentation) challenges have served as a pivotal benchmark for evaluating AI-driven segmentation methods, catalyzing remarkable progress in the field [3, 5, 8]. Traditional machine learning techniques, while initially useful, struggled with the heterogeneous appearance of tumors in BraTS multi-modal MRI datasets, achieving limited Dice scores (typically 60% - 75%). The introduction of deep learning, particularly U-Net variants [9], dramatically improved performance (Dice ~85% - 90%) by automatically learning discriminative features across T1, T2, FLAIR, and T1ce sequences. Subsequent BraTS editions witnessed transformer-based models like Swin UNETR [10] pushing boundaries further (Dice > 90%) through global context modeling, while diffusion models enhanced edge detection in tumor sub-regions. The challenges also spurred innovations in federated learning [11] to address data privacy concerns and weakly supervised techniques [12] to mitigate annotation bottlenecks. Notably, BraTS-2023 highlighted how ensemble methods combining CNNs and transformers achieved state-of-the-art results (Dice ~92%), demonstrating the synergistic potential of hybrid architectures [6]. These methodological advances, rigorously tested through BraTS, have not only improved algorithmic performance but also translated to more reliable clinical decision-support systems, reducing inter-rater variability from 15% - 20% to under 5% in tumor volume estimation.
2.2. Traditional Machine Learning (ML) Approaches
Before the deep learning era, classical ML techniques formed the foundation of brain tumor analysis by leveraging statistical models and manually engineered features.
2.2.1. Feature-Based Methods
Texture and intensity-based approaches were pivotal in early tumor characterization. Haralick features and Gabor filters extracted textural patterns from MRIs, while Local Binary Patterns (LBP) captured local contrast variations [13]. Histogram-based methods analyzed intensity distributions across T1, T2, and FLAIR sequences to identify abnormal tissue. For morphological analysis, Active Contours (Snakes) and Level Sets evolved initial contours to match tumor boundaries [14], with Region Growing techniques propagating seeds based on intensity similarity [15].
2.2.2. Supervised Learning Classifiers
Supervised methods enabled automated tumor classification using labeled data. Support Vector Machines (SVMs) with RBF kernels effectively separated tumor and healthy tissue by maximizing margin hyperplanes in high-dimensional feature spaces [16]. Random Forests improved robustness through ensemble decision trees that handled multi-class segmentation tasks [17]. While simpler, k-Nearest Neighbors (k-NN) provided baseline performance by classifying voxels based on neighboring annotated samples [18].
2.2.3. Unsupervised Learning Methods
These techniques identified tumor regions without prior labels. K-means and Fuzzy C-Means (FCM) clustered voxels based on intensity similarity, with FCM allowing partial membership to account for tissue heterogeneity [19]. Gaussian Mixture Models (GMMs) offered probabilistic segmentation by fitting MRI intensities to multiple Gaussian distributions, particularly effective for differentiating tumor sub-regions.
2.3. Deep Learning (DL) Approaches
Deep Learning has revolutionized brain tumor segmentation through its ability to automatically learn hierarchical features from medical images without manual feature engineering.
2.3.1. Convolutional Neural Networks (CNNs)
2D CNNs established foundational architectures for medical image analysis. The U-Net architecture [7] became the gold standard by combining an encoder-decoder structure with skip connections to preserve spatial details. SegNet [20] improved computational efficiency by using pooling indices for precise upsampling. For volumetric analysis, 3D U-Net [21] extended this approach to process whole MRI volumes, while V-Net [22] incorporated residual connections to enhance gradient flow in deep networks. DeepMedic [23] introduced parallel processing pathways to capture multi-scale tumor features simultaneously.
2.3.2. Advanced CNN Architectures
Attention mechanisms significantly improved segmentation precision. Attention U-Net [24] learned to focus computational resources on tumor regions while suppressing irrelevant areas, enhancing the localization of complex tumor boundaries. Squeeze-and-Excitation blocks [25] dynamically recalibrate channel-wise feature responses, allowing the network to emphasize informative features while diminishing noise. Hybrid architectures like DenseUNet [26, 27] employed pyramid pooling to capture contextual information at various receptive fields, which is essential in recognizing tumors of varying sizes and shapes. These designs offer richer semantic understanding and more robust feature representation. Further advancements incorporated residual connections, dilated convolutions, and deep supervision to enhance learning stability and accuracy.
Networks like DeepLabV3+, and High Resolution Network (HRNet) pushed the boundaries by integrating high-resolution representations and spatial hierarchies, improving both edge delineation and intra-tumoral heterogeneity recognition. Collectively, these innovations in CNN architecture have led to notable improvements in segmentation accuracy, sensitivity, and clinical applicability, marking a significant leap from early deep learning models.
2.3.3. U-Net Architecture
U-Net is a convolutional neural network (CNN) architecture designed for biomedical image segmentation [7], featuring a symmetric encoder-decoder structure with skip connections to preserve spatial details.
Figure 3 illustrates a 3D U-Net architecture for image segmentation, detailing the input and output dimensions along with the constraints and costs associated with the network’s operations. The input image undergoes a series of transformations through the network, with dimensions progressively changing from 64 × 64 to 1024 × 1024 and then downsampling back to 64 × 64. The network employs upsampling (denoted as “Up”) with a cost of 2 × 2 and standard convolutions with a cost of 1 × 1. A key constraint is that the architecture avoids 1 × 2 operations, ensuring specific dimensional transformations. The output image mirrors certain dimensions of the input (e.g., 256 × 256, 32 × 32, etc.), highlighting the U-Net’s symmetric structure where high-resolution features are combined with upsampled outputs to achieve precise segmentation. This design is critical for tasks requiring detailed spatial accuracy, such as medical image segmentation, where the U-Net’s ability to capture both local and global features is essential. The inclusion of cost metrics suggests an emphasis on computational efficiency, which is vital for practical implementations. Overall, the image encapsulates the U-Net’s hierarchical approach, balancing depth, resolution, and computational constraints to optimize performance.
![]()
Figure 3. The U-Net architecture for brain tumor segmentation in MRI, illustrating the input MRI scan and corresponding segmented output.
2.3.4. Activation Functions
Activation functions are crucial components of deep learning models, introducing non-linearity that a lows networks to approximate complex, non-linear functions and learn intricate patterns from input data [28, 29]. In the context of medical image segmentation, the choice of activation function can significantly influence model performance, convergence speed, and the ability to mitigate issues like the vanishing gradient problem [30, 31].
Table 1 shows the comparative analysis for various activation functions used in deep learning such as ReLU, Leaky ReLU, Swish, Mish, ELiSH, HardELiSH, Softsign, and Tanh, and these are among the most widely used in neural network models. The table highlights their individual strengths, such as computational efficiency (ReLU), smooth gradient flow (Swish and Mish), and ability to avoid vanishing gradients (HardELiSH). Limitations like dead neuron problems (ReLU) or computational cost (Mish) are also noted. The table further outlines their contributions to model performance in terms of convergence speed, stability, and accuracy.
These functions are critical for shaping the learning dynamics and overall performance of neural networks, influencing the convergence speed, gradient flow, and ultimately the model’s accuracy.
Mushtaq Salih et al. (2019) conducted a comparative study on activation functions and concluded that the HardELiSH activation function outperformed ReLU, particularly in addressing the vanishing gradient problem. Their findings demonstrated that HardELiSH not only mitigated this issue more effectively than ReLU but also led to an overall improvement in detection accuracy [37].
These advanced activation functions enable deep learning models to overcome the limitations of traditional functions, such as Sigmoid or Tanh, which suffer from the vanishing gradient problem in deeper networks. In the context of brain tumour segmentation, they provide essential benefits:
Improved Gradient Flow: By mitigating the vanishing gradient problem, advanced functions ensure that deeper layers in models like 3D U-Net continue learning effectively, resulting in better segmentation performance.
Enhanced Feature Extraction: These functions allow for more nuanced feature mapping, critical for detecting tumour boundaries and distinguishing different tissue types in MRI data.
Table 1. Comparative analysis of activation functions: strengths and limitations in deep learning for brain tumor segmentation.
Activation
Function |
Strengths |
Limitations |
Contribution to Brain Tumor Segmentation |
ReLU [28] |
Simple and fast; effective in shallow networks |
Dying ReLU problem (zero gradient); ignores
negative input |
Used as a baseline in early
U-Net models; limited capacity in handling complex tumor boundaries |
Leaky ReLU [32] |
Allows gradient for
negative inputs;
fixes dying ReLU |
Still linear in nature; slight performance gain only; costlier to compute |
Improves segmentation
robustness over ReLU; better handling of low-intensity
tumor regions |
Swish [30] |
Smooth and
non-monotonic; promotes better generalization |
May slow training in resource-limited settings |
Enhances model expressiveness; improves Dice score by
capturing complex tumor
patterns |
ELiSH [33] |
Strong non-linearity;
effective in both positive and negative ranges |
Computationally complex; slower convergence |
Provides better boundary
delineation; improves gradient flow in deep U-Nets |
HardELiSH [33] |
Combines Swish & ELU benefits; fast; excellent
gradient flow |
Relatively new;
needs fine-tuning |
Shows superior segmentation accuracy, mitigates vanishing gradient in deep models |
Mish [34] |
Smooth & non-monotonic; good generalization;
outperforms Swish in some tasks |
High computational cost; may be unstable in very deep networks |
Demonstrates high
segmentation accuracy, effective in capturing fine tumor details |
Softsign [35] |
Smooth & continuous; bounded output [−1, 1]; simple math |
Vanishing gradient;
limited dynamic range |
Rarely used in modern models; limited impact on tumor
segmentation performance |
Tanh [36] |
Bounded output [−1, 1]; smooth & differentiable |
Severe vanishing gradient; saturates quickly; slow convergence |
Historically used in early
models; largely replaced by
advanced activations |
Incorporating these advanced activation functions in brain tumour segmentation models can lead to more accurate, robust, and clinically viable results, making them a vital component in modern medical imaging techniques. Continued research into novel activation functions remains crucial for further improving the performance and efficiency of deep learning models in medical image analysis, Mushtaq et al. (2025) [38] provided a detailed review of various activation functions, including their mathematical formulations and graphical representations, offering clear insights into their operational behaviors.
2.4. Transformer-Based Models
The advent of Vision Transformers (ViTs) has revolutionized medical imaging by overcoming the limitations of traditional CNNs, particularly in capturing long-range dependencies and global context. Swin UNETR [10] pioneered the use of hierarchical Swin Transformers for 3D medical segmentation, enabling more precise delineation of complex anatomical structures through shifted window-based self-attention. TransBTS [39] further bridged the gap between CNNs and Transformers, synergizing the local feature extraction strength of convolutional networks with the global contextual reasoning of Transformers, leading to robust performance in tumor and lesion segmentation. Meanwhile, the Medical Transformer (MedT) [40] introduced a breakthrough with gated axial attention, significantly improving computational efficiency while processing high-resolution MRI scans, making it feasible to handle large volumetric data without compromising accuracy. These innovations underscore the transformative potential of Transformer-based models in medical imaging, paving the way for more interpretable, scalable, and high-performance AI-driven diagnostic systems.
Swin 3D U-Net
The Swin 3D U-Net is an advanced deep learning architecture that integrates the hierarchical Swin Transformer with the 3D U-Net framework to enhance volumetric medical image segmentation [41]. Unlike traditional U-Nets that rely solely on convolutional operations, this hybrid model leverages the self-attention mechanism of Swin Transformers to capture long-range dependencies in 3D medical scans (e.g., MRI, CT) while maintaining the U-Net’s ability to preserve spatial hierarchies through its encoder-decoder structure. The shifted window (Swin) mechanism improves computational efficiency by processing non-overlapping local windows in 3D space, reducing memory overhead compared to standard Vision Transformers (ViTs). By combining shifted window-based self-attention with 3D convolutions, the Swin 3D U-Net achieves superior performance in tasks like tumor segmentation, organ delineation, and multimodal image analysis, offering better scalability and accuracy than purely convolutional or transformer-based approaches. This architecture is particularly effective for high-resolution 3D datasets where global context and fine-grained localization are critical.
The Swin 3D U-Net architecture consists of an encoder, bottleneck, and decoder with specialized layers for hierarchical feature learning in volumetric medical images.
Encoder: Processes input patches (224 × 224, patch size 4) using Swin Transformer blocks, maintaining feature resolution while progressively downsampling via patch merging layers (2 × reduction in resolution, 2 × increase in dimension). This repeats three times.
Bottleneck: Two Swin Transformer blocks capture deep features without altering resolution or dimension.
Decoder: Uses patch expanding layers to upsample features (2 × resolution increase, halving dimension) and skip connections to fuse multi-scale encoder features, preserving spatial details [42].
2.5. Generative Models
Generative approaches have significantly expanded the possibilities of tumor analysis by leveraging advanced deep learning techniques to improve accuracy and realism in medical imaging. SegAN [43] introduced an adversarial training framework to generate more realistic and precise segmentation masks, enhancing the delineation of tumor boundaries in MRI and CT scans. Meanwhile, CycleGAN [44] addressed the challenge of unpaired data by enabling image-to-segmentation translation without requiring exact correspondences between input and output images, thus facilitating domain adaptation in heterogeneous datasets. More recently, diffusion models like DDPM [45] have achieved state-of-the-art results in tumor segmentation and synthesis by employing iterative denoising processes, which enhance the model’s ability to capture fine-grained details and reduce artifacts. These generative approaches not only improve diagnostic accuracy but also enable synthetic data generation for training robust models in scenarios where annotated medical data is scarce. Furthermore, their application extends to treatment planning, where realistic synthetic images can aid in simulating tumor progression and therapeutic responses. As these methods continue to evolve, they hold great promise for advancing personalized medicine and improving clinical decision-making in oncology.
2.6. Weakly/Self-Supervised Learning
These methods addressed data scarcity challenges. Contrastive learning frameworks [46] are effective pre-training on unlabeled MRI datasets. Pseudo-labeling and scribble learning approaches [12] significantly reduced annotation requirements while maintaining competitive performance.
2.7. Federated Learning (FL)
Federated learning has emerged as a groundbreaking paradigm for collaborative model development while addressing data privacy concerns in healthcare. FedAvg (Federated Averaging) and FedBN (Federated Batch Normalization) [11] enable multiple medical institutions to jointly train segmentation models without sharing raw patient data. These approaches work by distributing model training across institutions and aggregating only the learned parameters, not the sensitive imaging data. This is particularly valuable for brain tumor segmentation, where datasets are often small and fragmented across hospitals. Recent implementations have demonstrated that FL can achieve comparable performance to centralized training while complying with strict medical data regulations like Health Insurance Portability and Accountability Act (HIPAA), and General Data Protection Regulation (GDPR). Advanced variants now incorporate differential privacy and secure multi-party computation to further enhance data protection during the federated training process.
2.8. Explainable AI (XAI) for Clinical Trust
The increasing complexity of deep learning models has necessitated the development of explainability techniques to facilitate clinical adoption. Grad-CAM (Gradient-weighted Class Activation Mapping) and SHAP (SHapley Additive exPlanations) [47, 48] provide intuitive visualizations that highlight which image regions most influenced the model’s segmentation decisions. These methods help clinicians understand why a model classified certain areas as tumorous, enabling them to verify the algorithm’s reasoning against their medical expertise. Recent advances in XAI for medical imaging now combine attention maps with uncertainty quantification, providing not only localization of important features but also confidence estimates in the predictions. This dual approach has been shown to improve radiologists’ trust and diagnostic efficiency when working with AI-assisted segmentation systems.
2.9. Real-Time Models for Clinical Deployment
The translation of segmentation algorithms into clinical practice requires models that can operate efficiently on medical hardware. Lightweight architectures like MobileNet [49] and EfficientNet [50] have been specifically adapted for this purpose through techniques such as depthwise separable convolutions and neural architecture search. These optimizations enable near real-time tumor segmentation on standard hospital workstations and even mobile devices, with inference times often under one second per MRI slice. Recent work has focused on developing hybrid models that maintain this efficiency while incorporating 3D contextual information crucial for accurate tumor volume estimation. Such models are now being integrated into surgical navigation systems and intraoperative MRI suites, providing surgeons with continuously updated tumor delineations during procedures.
2.10. Multimodal Fusion for Comprehensive Analysis
Modern brain tumor characterization benefits immensely from combining information across multiple imaging modalities. Early fusion networks process concatenated MRI, PET, and DTI inputs through shared feature extractors, while late fusion approaches combine predictions from modality-specific networks [51, 52]. The latest architectures employ attention-based fusion mechanisms that dynamically weight the contribution of each modality based on contextual relevance. Advanced implementations now incorporate cross-modal contrastive learning during pretraining to better align feature spaces across modalities. This multimodal approach has proven particularly valuable for distinguishing tumor recurrence from radiation necrosis and for precisely delineating infiltrative tumor margins that appear ambiguous in single-modality scans.
Table 2 presents a comprehensive comparison of techniques used in brain tumor segmentation across various categories. Traditional machine learning methods like SVM and Random Forests are interpretable but limited by manual feature extraction. CNN architectures such as U-Net and 3D U-Net improve spatial understanding, though they require high computational resources. Advanced CNNs (e.g., Attention U-Net) introduce focus mechanisms but add complexity. Transformer-based models like Swin UNETR capture global context yet demand large datasets. Generative models enhance realism but suffer from training instability. Weak supervision reduces labeling costs at the expense of accuracy. Federated learning ensures privacy in multi-center settings but involves communication challenges. Explainable AI tools foster clinical trust, though their outputs are sometimes unreliable. Real-time models and multimodal fusion enable mobile deployment and cross-modality synergy, respectively, but face trade-offs in accuracy and data alignment.
Table 2. Review summary of related works: techniques, strengths and limitations.
Category |
Key Paper |
Technique |
Strengths |
Limitations |
Traditional ML |
Caulier et al. (2011) [53] |
Haralick features,
Gabor filters |
Interpretable, works on small datasets |
Manual feature
engineering, poor
generalization |
Zahra et al (2021) [54] |
Active Contours, Level Sets |
Precise boundary detection |
Sensitive to
initialization |
Zhang et al. (2015) [55] |
SVM with RBF kernel |
Effective for binary classification |
Struggles with
multi-class tasks |
Geremia et al. (2013) [56] |
Random Forests |
Handles multi-class segmentation |
Limited to extracted features |
CNN
Architectures |
Ronneberger et al. (2015) [7] |
U-Net |
Skip connections
preserve spatial details |
2D version loses
volumetric context |
Çiçek et al. (2016) [21] |
3D U-Net |
Volumetric processing |
High memory
requirements |
Milletari et al. (2016) [22] |
V-Net |
Residual connections improve gradient flow |
Computationally
intensive |
Advanced CNNs |
Oktay et al. (2018) [24] |
Attention U-Net |
Focuses on relevant
regions |
Additional parameters to train |
Hu et al. (2018) [25] |
Squeeze and Excitation Blocks |
Channel-wise feature recalibration |
Minor computational overhead |
Transformers |
Hatamizadeh et al. (2021) [10] |
Swin UNETR |
Captures long-range dependencies |
Requires large datasets |
Wang et al. (2021) [39] |
TransBTS |
Combines CNN +
Transformer strengths |
Complex
architecture |
Generative Models |
Xue et al. (2018) [43] |
SegAN |
Produces realistic
segmentations |
Training instability |
Pinaya et al. (2022) [45] |
DDPM |
High-precision
iterative refinement |
Slow inference time |
Weak
Supervision |
Zhou et al. (2018) [12] |
Scribble learning |
Reduces annotation burden |
Lower accuracy than full supervision |
Federated Learning |
Li et al. (2021) [11] |
FedAvg, FedBN |
Privacy-preserving multi-center collaboration |
Communication
overhead |
Explainable AI |
Santos et al. (2024) [48] |
Grad-CAM, SHAP |
Increases clinical trust |
Explanations
sometimes unreliable |
Real-Time Models |
Howard et al. (2017) [49] |
MobileNet |
Mobile/edge device deployment |
Reduced
accuracy |
Multimodal Fusion |
Zhou et al. (2023) [52] |
Attention fusion |
Leverages
complementary modality information |
Requires co-registered data |
3. Evaluation Metrics in Brain Tumor Segmentation
Tumor segmentation in medical imaging relies on robust evaluation metrics to assess model performance accurately. The most common metrics include the Dice Similarity Coefficient (DSC) [57], which measures overlap between predicted and ground truth segmentations, and the Hausdorff Distance (HD) [58], which evaluates boundary agreement. Sensitivity (Recall) and Specificity assess detection accuracy for tumor vs. non-tumor regions, while Precision minimizes false positives. The Jaccard Index (IoU) complements DSC by measuring intersection-over-union [57].
For clinical relevance, Volume Difference (VD) quantifies size discrepancies, and Accuracy provides overall pixel-wise correctness. Advanced metrics like Normalized Mutual Information (NMI) and Receiver Operating Characteristic (ROC) curves help evaluate probabilistic predictions. Recent challenges (e.g., BraTS), which are summarized from 2014 to 2023 [59], also emphasize Uncertainty Quantification to gauge model confidence. These metrics collectively ensure segmentation models meet diagnostic precision and reliability standards in oncology.
Table 3 summarizes both standard and emerging evaluation metrics used in tumor segmentation. The Dice Similarity Coefficient (DSC) is the most widely adopted, capturing overall segmentation accuracy. Hausdorff Distance (HD) complements DSC by highlighting boundary mismatches, crucial in clinical edge detection. Sensitivity and specificity evaluate a model’s ability to detect tumor tissue accurately and avoid false alarms, respectively. Precision is especially critical in high-risk diagnoses. Volumetric similarity assesses how well the predicted tumor size matches the actual, which is vital for planning treatments. The Jaccard Index (IoU) offers a stricter alternative to DSC, while Jaccard Distance provides a complementary view for identifying segmentation errors. These metrics collectively ensure robust and clinically meaningful model evaluation.
Table 4 compares several deep learning architectures on the BraTS2021 validation dataset using key metrics such as parameter count, model size, and Dice scores across tumor subregions: Enhancing Tumor (ET), Tumor Core (TC), and Whole Tumor (WT).
Table 3. Standard and emerging metrics for evaluating tumor segmentation.
Metric |
Description |
Clinical Relevance |
Dice Similarity
Coefficient (DSC)
[57] |
Measures overlap between predicted and ground truth regions,
|
Most commonly used;
reflects segmentation accuracy |
Hausdorff Distance
(HD) [58] |
Measures the maximum distance
between the boundary points of two sets |
Sensitive to boundary errors |
Sensitivity (Recall)
[60] |
True Positives/(True Positives +
False Negatives) |
Reflects ability to detect tumor pixels |
Specificity [61] |
True Negatives/(True Negatives +
False Positives) |
Reflects ability to avoid false tumor
detection |
Precision [62] |
True Positives/(True Positives +
False Positives) |
Important in high-risk clinical decisions |
Volumetric Similarity |
Compares total segmented volume vs.
actual tumor volume |
Key in treatment planning and prognosis estimations |
Jaccard Index (IoU) [57] |
Ratio of intersection to union:IoU = ∥A∪B∥/∥A∩B∥ |
Similar to DSC but stricter (always ≤ DSC); widely used in detection tasks. |
Jaccard Distance (Jarrad) |
Complement of IoU:
Jarrad = 1 − IoU |
Quantifies dissimilarity; useful for error analysis. |
Table 4. Presents the parameter counts, memory footprint, and segmentation performance (Mean Dice, ET, TC, WT) of several state-of-the-art models. Statistical significance is assessed for each model in comparison to Swin-Unet3D using the Wilcoxon signed-rank test [44].
Model Name |
Params (M) |
Param Size (MB) |
Mean Dice |
ET Dice |
TC Dice |
WT Dice |
3D U-Net |
7.9 |
15.834 |
0.825 |
0.825 |
0.844 |
0.900 |
V-Net |
45.6 |
182.432 |
0.815 |
0.815 |
0.840 |
0.751 |
UnetR |
102 |
204.899 |
0.842 |
0.842 |
0.853 |
0.905 |
TransBTS |
33.0 |
65.975 |
0.824 |
0.824 |
0.843 |
0.889 |
SwinBTS |
35.7 |
71.394 |
0.828 |
0.828 |
0.843 |
0.896 |
Attention U-Net |
23.6 |
47.257 |
0.841 |
0.841 |
0.851 |
0.870 |
Swin Pure Unet3D |
33.6 |
67.163 |
0.817 |
0.817 |
0.822 |
0.885 |
Swin Unet3D |
33.7 |
67.403 |
0.834 |
0.834 |
0.866 |
0.905 |
Swin-Unet3D achieved the highest Dice scores for TC (0.866) and WT (0.905), indicating superior segmentation accuracy, particularly for complete tumor regions. UnetR also showed strong performance across all metrics, especially with the highest mean Dice (0.842). Lightweight models like 3D U-Net had fewer parameters and smaller memory usage but performed moderately in Dice scores. V-Net had a relatively large parameter size but underperformed in WT segmentation (Dice: 0.751). Transformer-based models like TransBTS and SwinBTS demonstrated a good trade-off between performance and complexity. Statistical significance (Wilcoxon test) showed that Swin-Unet3D outperformed most models significantly in at least one of the tumor subregions.
Hatamizadeh et al. (2022) have introduced Swin UNETR, a novel transformer-based model for 3D brain tumor segmentation in multi-modal MRI, addressing the limitations of traditional FCNNs (e.g., U-Net) in capturing long-range dependencies due to their restricted kernel sizes. By reformulating segmentation as a sequence-to-sequence task, Swin UNETR leverages a hierarchical Swin transformer encoder to process input data as 1D embeddings, extracting multi-scale features through shifted-window self-attention, while a CNN-based decoder connected via skip connections refines the output. This hybrid architecture effectively combines the strengths of transformers (long-range modeling) and CNNs (local feature extraction), achieving state-of-the-art performance in the BraTS 2021 challenge and demonstrating the potential of transformers in medical image analysis [63].
Overall, Swin-Unet3D and UNETR appear to be leading models in terms of both accuracy and generalization across tumor structures, while maintaining reasonable parameter sizes.
Table 5 demonstrates that Swin UNETR achieves the highest overall performance among the evaluated models, with an average Dice score of 0.913, outperforming nnU-Net, SegResNet, and TransBTS across all tumor regions (Enhancing Tumor, Whole Tumor, and Tumor Core). Both nnU-Net and SegResNet exhibit nearly identical performance, each with an average Dice score above 0.907, indicating their consistent and robust segmentation capabilities. TransBTS, while still competitive, trails behind the other models with a lower average Dice score of 0.891, suggesting reduced effectiveness in capturing fine-grained tumor structures. Overall, transformer-based Swin UNETR shows the most promising segmentation performance in this five-fold cross-validation setting.
Table 5. 5-Fold cross-validation (mean dice scores) [63].
|
|
Dice Score |
|
|
Model |
ET |
WT |
TC |
Avg. Dice |
Swin UNETR |
0.891 |
0.933 |
0.917 |
0.913 |
nnU-Net |
0.883 |
0.927 |
0.913 |
0.908 |
SegResNet |
0.883 |
0.927 |
0.913 |
0.907 |
TransBTS |
0.868 |
0.911 |
0.898 |
0.891 |
Wang et al. (2021) [39] have proposed TransBTS, which is a hybrid encoder-decoder architecture that integrates convolutional neural networks (CNNs) with transformer blocks. It leverages CNNs for local feature extraction and transformers for modeling long-range dependencies, enabling accurate brain tumor segmentation with improved contextual understanding.
Table 6 shows that among the compared methods, TransBTS, particularly with test-time augmentation (TTA), achieved the highest Dice scores and the lowest Hausdorff distances across all tumor subregions on the BraTS 2020 validation set. This demonstrates its superior ability to segment tumors accurately while preserving boundary precision, outperforming traditional 3D U-Net and V-Net variants.
Table 6. Performance comparison on BraTS 2020 validation set [39].
Method |
Dice Score (%) |
Hausdorff Distance (mm) |
ET |
WT |
TC |
ET |
WT |
TC |
3D U-Net [6] |
68.76 |
84.11 |
79.06 |
50.98 |
13.37 |
13.61 |
Basic V-Net [11] |
61.79 |
84.63 |
75.26 |
47.70 |
20.41 |
12.18 |
Deeper V-Net [11] |
68.97 |
86.11 |
77.90 |
43.52 |
14.50 |
16.15 |
Residual 3D U-Net |
71.63 |
82.46 |
76.47 |
37.42 |
12.34 |
13.11 |
TransBTS (w/o TTA) |
78.50 |
89.00 |
81.36 |
16.72 |
6.47 |
10.47 |
TransBTS (w/ TTA) |
78.73 |
90.09 |
81.73 |
17.95 |
4.96 |
9.77 |
4. Discussion
Brain tumor segmentation has undergone remarkable advancements, transitioning from traditional machine learning approaches to sophisticated deep learning architectures and hybrid techniques. This discussion synthesizes the key developments, highlights their clinical and technical impacts, and identifies critical challenges that must be addressed to facilitate broader adoption in clinical practice.
4.1. The Transition from Traditional ML to Deep Learning: A Paradigm Shift
Traditional machine learning methods, such as SVM [64], Random Forests [65], and texture-based feature extraction [66], laid the foundation for automated brain tumor segmentation. These approaches were interpretable and computationally efficient, but their reliance on handcrafted features limited their ability to generalize across diverse datasets [5]. The introduction of deep learning, particularly U-Net [7]and its 3D variants [21] marked a turning point by enabling end-to-end learning of hierarchical features directly from imaging data. This shift significantly improved segmentation accuracy, with Dice scores rising from ~70% (traditional ML) to over 85% (DL) on BraTS benchmarks [9]. However, early CNNs faced challenges in handling multi-modal MRI inconsistencies and required large annotated datasets for training, a limitation partially addressed by data augmentation and transfer learning.
4.2. The Rise of Transformers and Hybrid Architectures
The introduction of Vision Transformers (ViTs) and their medical adaptations (e.g., Swin UNETR [63], TransBTS [39]) addressed CNNs’ inability to model long-range spatial dependencies [10]. These models excel in capturing global context, making them particularly effective for segmenting diffuse tumor margins in glioblastoma [67]. Hybrid architectures, such as CNN-Transformer ensembles, further bridged the gap between local feature extraction and global reasoning, achieving Dice scores > 90% on BraTS 2021 [8]. Despite these advances, Transformers demand substantial computational resources and training data, limiting their accessibility for smaller institutions.
4.3. Generative Models and Weak Supervision: Mitigating Data Scarcity
Generative approaches, including GANs [43] and diffusion models [45], have enhanced segmentation by generating synthetic training data or refining predictions through iterative denoising. Weakly supervised methods [12] reduced reliance on pixel-level annotations by leveraging scribbles or bounding boxes, making AI more feasible for rare tumor subtypes. However, GANs suffer from training instability, while diffusion models are computationally expensive for real-time applications.
Effectiveness of Generative Models and Weakly Supervised Learning
Manual annotation of 3D brain tumor volumes is time-consuming and requires domain expertise, limiting the scalability of supervised learning. Generative models such as GANs and VAEs have been successfully employed to synthesize realistic tumor images and augment training data, improving model generalization and robustness. For instance, Mok & Chung (2018) [68] introduced a coarse-to-fine GAN-based augmentation strategy (CB-GAN) that improved segmentation performance by 3.5% Dice score over traditional augmentation on the BraTS15 dataset. Their two-stage GAN was effective in generating anatomically plausible tumors with refined boundaries, reducing dependence on manual labeling.
In parallel, weakly supervised learning offers a practical alternative by utilizing partial annotations such as image-level labels or scribbles. Mlynarski et al. (2019) [69] demonstrated that combining a small number of fully annotated MRI scans (e.g., 5 or 15) with a larger set of weakly labeled data could achieve up to 78.3% Dice score, closely matching models trained on extensive full supervision. This underscores the potential of mixed supervision in minimizing annotation costs while maintaining segmentation accuracy.
Together, these strategies offer scalable and efficient training solutions, particularly valuable for clinical environments with limited annotation resources.
4.4. Federated Learning and Explainability: Toward Clinical Translation
Federated learning [11] has emerged as a privacy-preserving solution for multi-institutional collaborations, crucial for rare tumor types. This innovation is especially vital for brain tumor segmentation, where access to large, diverse, and annotated datasets is limited, particularly for rare subtypes such as pediatric gliomas or atypical meningiomas. By harnessing the statistical power of geographically distributed data, FL mitigates bias associated with single-institution models and fosters the development of robust, generalizable algorithms applicable across a wide range of patient populations and imaging protocols [70].
Performance Variation Across Brain Tumor Subtypes
Deep learning models, particularly those trained on standard datasets like BraTS, have demonstrated strong performance in segmenting high-grade gliomas (HGGs). However, their effectiveness varies significantly when applied to different brain tumor subtypes, such as pediatric gliomas and atypical meningiomas, which often exhibit distinct morphological and radiographic characteristics.
Pediatric gliomas often present with distinct imaging characteristics, such as more diffuse growth patterns and lower contrast on MRI compared to adult high-grade gliomas (HGGs). These differences contribute to challenges in segmentation, including decreased Dice scores and higher boundary uncertainty. Liu et al. (2023) [71] investigated this issue and reported that models trained on adult glioma datasets exhibit a Dice score reduction of over 10% when applied to pediatric brain tumor cases, underscoring a significant domain shift. Similarly, atypical meningiomas, which are less frequent and show considerable heterogeneity, present additional challenges for model generalization due to limited annotated data and intra-class variability. These findings emphasize the need for subtype-specific training approaches, domain adaptation methods, and more diverse datasets to enhance the accuracy and generalizability of brain tumor segmentation models across different tumor types.
In parallel, the push for explainable AI (XAI) has become indispensable for clinical adoption. Tools such as Gradient-weighted Class Activation Mapping (Grad-CAM), SHAP (SHapley Additive exPlanations), and Integrated Gradients allow clinicians to peer inside the “black box” of deep neural networks, offering intuitive visualizations of which regions in the MRI contribute most to the model’s prediction [47, 48]. These methods have increased clinician trust and enabled model auditing, especially in high-stakes applications like preoperative planning and radiotherapy targeting. However, FL introduces communication overhead, and XAI methods sometimes produce unreliable explanations for complex models.
4.5. Real-Time and Multimodal Systems: The Next Frontier
Lightweight deep learning models such as MobileNet [49], EfficientNet [50], and their derivatives are playing a pivotal role in enabling real-time intraoperative segmentation of brain tumors. These architectures, optimized for computational efficiency and low-latency inference, allow deployment on edge devices and integration into surgical navigation systems, thereby assisting neurosurgeons in achieving maximal tumor resection while preserving healthy tissue [49, 50]. For example, MobileNetV3 and EfficientNet-Lite have demonstrated promising accuracy-speed trade-offs when applied to real-time segmentation tasks on resource-constrained hardware like portable workstations or intraoperative imaging consoles [72].
Concurrently, the integration of multimodal imaging, such as combining structural MRI with functional modalities like Positron Emission Tomography (PET), Diffusion Tensor Imaging (DTI), and MR spectroscopy, has emerged as a powerful strategy for comprehensive tumor characterization [73]. These multimodal approaches leverage complementary features: while MRI offers high-resolution anatomical details, PET reveals metabolic activity, and DTI provides insight into white matter tract integrity.
Deep learning models equipped with multimodal fusion layers, such as attention-guided or cross-modal transformers, are showing promise in exploiting these heterogeneous data streams to improve segmentation precision and tumor sub-region delineation [74]. However, a major challenge lies in the harmonization of imaging protocols across different scanners, vendors, and institutions. Variability in acquisition parameters, field strengths, and contrast agent usage can significantly degrade model generalizability. To address this, techniques such as domain adaptation, intensity normalization, and synthetic modality generation using GANs are actively being explored [75]. Moreover, federated learning paradigms are gaining traction as a means to train models collaboratively across institutions without sharing raw patient data, thereby enhancing data diversity while preserving privacy [70].
4.6. Key Challenges and Future Directions
Despite significant advancements in brain tumor segmentation, several critical challenges persist that hinder widespread clinical adoption. A primary obstacle is the inherent data heterogeneity stemming from variations in MRI scanner protocols and acquisition parameters across institutions, which compromises model generalizability [3]. Additionally, class imbalance remains a persistent issue, as certain tumor subregions like necrotic cores are often underrepresented in training datasets, leading to biased predictions. Perhaps most crucially, the majority of segmentation models have not undergone rigorous prospective clinical validation, creating a translational gap between algorithmic performance and real-world clinical utility. These limitations underscore the need for more robust and standardized approaches to ensure reliable deployment in healthcare settings.
Looking ahead, the field must prioritize several key directions to address these challenges:
First, expanding standardized benchmarks like BraTS to include underrepresented tumor types, such as pediatric and diffuse midline gliomas, would enhance model versatility.
Second, developing efficient models through techniques like neural architecture search and quantization will be essential for real-time deployment on edge devices in surgical settings.
Third, advancing robust explainable AI (XAI) methods will be critical for regulatory approval and clinician trust.
Finally, multimodal federated learning approaches that can harmonize disparate data sources while preserving patient privacy represent a promising avenue for creating more comprehensive and generalizable models.
These strategic focuses will be instrumental in bridging the current gaps between technical innovation and clinical application in neuro-oncology.
The evolution of brain tumor segmentation techniques has brought forth significant strengths that have transformed neuro-oncology research and clinical practice. Deep learning approaches, particularly U-Net and its variants, have demonstrated remarkable capabilities in automatically extracting hierarchical features from multi-modal MRI data, achieving unprecedented segmentation accuracy with Dice scores exceeding 90% in some cases [9]. The advent of transformer-based architectures has further enhanced performance by effectively capturing long-range spatial dependencies in volumetric scans, while generative models have shown promise in addressing data scarcity through synthetic data generation [10, 45]. These technical advancements have been complemented by the development of privacy-preserving federated learning frameworks and explainable AI techniques, which facilitate multi-institutional collaboration and improve clinical interpretability [20, 48]. However, these approaches are not without limitations that require careful consideration.
The computational complexity of advanced architectures, particularly 3D CNNs and transformers, poses significant challenges for clinical deployment due to their high memory requirements and inference times.
Data heterogeneity across imaging protocols and scanners remains a persistent obstacle to model generalizability, while class imbalance in tumor subregions continues to affect segmentation accuracy [3].
Furthermore, the black-box nature of many deep learning models and the lack of standardized clinical validation protocols hinder their widespread adoption in healthcare settings.
Addressing these limitations requires a multi-faceted approach:
Computational challenges can be mitigated through model compression techniques such as quantization and knowledge distillation, which can reduce model size without significant performance degradation [49].
Data heterogeneity could be overcome by developing advanced normalization techniques and domain adaptation methods that account for scanner-specific variations.
To tackle class imbalance, innovative loss functions like focal loss and tailored data augmentation strategies could be employed to better represent rare tumor subregions.
The interpretability gap might be bridged by integrating attention mechanisms with clinically meaningful feature visualizations, while prospective multicenter trials could establish standardized validation protocols [48]. Federated learning frameworks, combined with synthetic data generation, offer a promising solution to data scarcity while maintaining patient privacy.
The path forward for brain tumor segmentation lies in developing more efficient, interpretable, and clinically validated models that can seamlessly integrate into existing healthcare workflows. Future research should focus on creating adaptive systems that can learn from limited annotations, generalize across diverse imaging protocols, and provide clinically actionable insights with measurable confidence intervals. By addressing these challenges through collaborative efforts between computer scientists and clinicians, the next generation of segmentation tools could significantly improve diagnostic accuracy, treatment planning, and patient outcomes in neuro-oncology.
5. Conclusions
Brain tumor segmentation has undergone a profound evolution from classical machine learning with handcrafted features to modern deep learning frameworks that incorporate transformers, generative modeling, and federated learning. These advancements have significantly improved segmentation accuracy, interpretability, and real-time clinical applicability. U-Net and its variants enabled volumetric analysis, while transformers like Swin UNETR brought long-range context modeling. Federated learning has enhanced data privacy and cross-institutional training, while explainable AI techniques have improved clinician trust. However, challenges such as data heterogeneity, class imbalance, and computational demands still hinder real-world deployment. Future research should prioritize lightweight models, robust validation protocols, and clinically meaningful, multimodal systems that integrate seamlessly into neuro-oncology workflows. Bridging the gap between algorithmic performance and clinical reliability remains essential for the successful translation of AI into routine medical practice.
The journey toward fully automated, clinically validated brain tumor segmentation systems is ongoing, but the remarkable progress to date offers compelling evidence of AI’s potential to transform neuro-oncology. By continuing to bridge the gap between technical innovation and clinical needs, researchers can deliver tools that not only achieve high performance on benchmarks but also provide tangible benefits to patients and healthcare providers worldwide. The future of brain tumor segmentation lies in creating adaptable, transparent, and clinically relevant solutions that can keep pace with the evolving landscape of precision medicine.