Introduction
Computer-aided diagnosis system has become an indispensable technology for disease treatment, aiding in both preoperative diagnosis and postoperative rehabilitation monitoring. These systems aim to help clinicians make diagnostic decisions more accurately and objectively [1]. Medical image segmentation is a crucial step in computer-aided clinical diagnosis. The development and integration of advanced methods in medical image segmentation are essential for enhancing diagnostic accuracy and improving patient rehabilitation. These innovations support clinicians in diagnosing diseases and monitoring patient rehabilitation with greater precision and reliability [2]. Segmenting and quantitatively analyzing disease regions not only provide valuable information for pathological diagnosis but also significantly aid in planning treatment strategies, monitoring disease progression and patient rehabilitation [3], [4]. Currently, the state-of-the-art models for medical image segmentation are variants of U-Net, which include an encoder and a decoder [5], [6]. Among these, U-Net [7] is the most frequently utilized segmentation network. The encoder successively extracts the semantic features of the image, while the decoder progressively recovers its fine-grained details. Skip connection is a critical component of U-Net, which propagates the spatial information from the encoder to the corresponding layers of the decoder before pooling operations, thereby preventing the loss of these information during pooling. However, despite U-Net’s robust feature representation capabilities, its skip connection scheme struggles to adequately manage multiscale variations in complex medical images [8], [9], [10]. These challenges highlight the need for advanced models and engineering strategies to enhance the precision and utility of machine learning in medical image segmentation, ensuring effectiveness in complex clinical applications.
Some studies have proposed more efficient U-Net variants by enhancing and optimizing the encoder and decoder components, such as Edge U-Net [11], MDU-Net [12], and DCSAU-Net [13]. Additionally, recent research has leveraged the multiscale global modeling capabilities of Transformers [14] to construct more advanced encoders and decoders [15], [16], as demonstrated in models like TransUnet [17], Swin-Unet [18], and DS-TransUNet [19]. Furthermore, efforts have been made to improve U-Net’s precision and reliability by optimizing the skip connection scheme, as evidenced by Attention U-Net [20], UNet++ [21], and UNet3+ [22]. However, these approaches often overlook the semantic gap present in skip connections, where features mapped from the decoder may be semantically discordant with those from the encoder. In 2018, researchers enhanced the U-Net architecture by replacing the direct skip connection (DSC) with short connections consisting of nested dense convolutional blocks, significantly improving its performance [23]. Following this, further advancements were made in 2019 with the design of a semantic enhancement module and a boundary attention module, which were integrated into a parallel pyramid structure, achieving state-of-the-art results [24]. These developments underscore the notable semantic gap between the encoder and decoder when using DSC, which these modifications aim to address.
In 2020, Ibtehaz and Rahman first identified the concept of the semantic gap in U-Net architecture in their work titled “MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation” [25]. They emphasized that this semantic gap significantly impairs U-Net’s segmentation performance. Their analysis, based on experimental segmentation results from U-Net across five public databases, demonstrated a semantic gap between the corresponding layers of the encoder and decoder. They noted that merging high-level semantic information from the decoder with low-level spatial information from the encoder via the DSC could detrimentally affect segmentation precision. To address this issue, the study proposed replacing the DSC with a sequence of convolutional layers that include residual connection. This adjustment introduces a structure comprising four
In recent years, a growing body of research has identified the semantic gap as a significant obstacle in medical image segmentation, limiting the effectiveness of segmentation algorithms in transitioning from experimental research to clinical applications [5]. Consequently, this issue has attracted considerable attention and research efforts from scholars. In 2022, a systematic experimental analysis confirmed the existence of the semantic gap, highlighting that the DSC scheme might impair segmentation performance [26]. In some instances, U-Net’s segmentation performance without skip connections surpassed that achieved using skip connections. Concurrently, researchers attempted to mitigate the semantic gap by replacing the DSC with the channel Transformer (CTrans), leveraging the Transformer’s superior multiscale global modeling capabilities [26]. In 2023, a dense skip connection with cross co-attention was designed in U-Net to address the semantic gap, achieving more precise segmentation [27]. Furthermore, in 2024, a Skip-NAT scheme was introduced to replace the DSC, where the encoder’s feature maps are processed by a neighborhood attention Transformer and then fused with the decoder’s corresponding feature maps, minimizing the semantic gap and enhancing U-Net’s segmentation effects [28].
While these studies have recognized the semantic gap and its detrimental impacts, the absence of a quantitative evaluation method for this gap hinders the development of targeted mitigation strategies to enhance segmentation precision and reliability. In summary, the semantic gap issue has significantly constrained the achievement of accurate and reliable automatic medical image segmentation. Therefore, there is an urgent need to understand the patterns of the semantic gap between corresponding layers of the encoder and decoder and to develop methods to eliminate it.
Motivated by these issues, we conducted an in-depth study of the U-Net architecture concerning the semantic gap. Firstly, we quantified the semantic gap between corresponding layers of the decoder and encoder and performed a systematic analysis of their segmentation performance across the 2018 Data Science Bowl challenge (DSB-2018) database [29], the Lesion Boundary Segmentation Challenge dataset (ISIC-2018) [30], [31], the GlaS dataset [32], and MoNuSeg database [33], [34]. We observed two key characteristics:
The existence of the semantic gap means that the direct fusion of the decoder’s high-level semantic information and the encoder’s low-level spatial information via the DSC can compromise the model’s segmentation performance when the information is incompatible.
The semantic gap varies in severity across different layers between the encoder and decoder, leading to diverse impacts on the model’s segmentation performance. Specifically, shallower layers exhibit a worse semantic gap and a consequent greater degradation in performance, whereas deeper layers present a minor semantic gap and a lesser impact on performance.
Secondly, to address these challenges and enhance the trustworthiness of medical image segmentation, we introduce a multichannel fusion Transformer (MCFT) skip connection to counter the semantic gap’s adverse impacts. We further propose a novel USCT-UNet segmentation network to accomplish higher-precision and more generalizable automatic medical image segmentation. Specifically, we use MCFT blocks to construct U-shaped skip connection (USC) and allocate a variable number of MCFT blocks according to the semantic gap pattern at different layers. This allocation aims to establish long-term dependencies between the decoder’s high-level features and the encoder’s low-level features, thus minimizing the semantic gap and maximizing segmentation precision. Moreover, we designed a spatial channel cross-attention (SCCA) module to guide the fusion of features from the decoder and USC. Notably, our USC and SCCA modules can be easily embedded into other U-Net variants.
Finally, comprehensive experimental results from DSB-2018, ISIC-2018, GlaS and MoNuSeg databases show that our USCT-UNet outperforms U-Net on multiple segmentation evaluation metrics. This improvement not only significantly enhances traditional segmentation pipelines and offers an effective solution for the semantic gap issue, but also demonstrates a practical path towards improving the explainability, generalization, and accountability capabilities of machine learning in health informatics.
The principal contributions of this study are as follows:
Our research is the first to quantify the semantic gap between corresponding layers of the encoder and decoder, providing a systematic analysis of its laws. We found that the semantic gap’s severity varies among different corresponding layers, leading to diverse impacts on the model’s segmentation performance. These findings suggest that simple direct and uniform skip connections may be inappropriate strategies.
We proposed the USCT-UNet segmentation network to mitigate this semantic gap and enhance segmentation precision, in which the USC was designed to replace the DSC, allocating different numbers of MCFT blocks based on the semantic gap magnitude at various layers, and the SCCA module was designed to guide the fusion of features from the decoder and USC module.
We conducted extensive experiments on various challenging datasets, and the results demonstrate that USCT-UNet effectively eliminates the semantic gap between the decoder and encoder, achieving precise and reliable medical image segmentation. When compared with other advanced segmentation methods, our method demonstrates superior performance.
Related Work
A. Studies on the Semantic Gap in U-Net
Although early research did not explicitly identify the semantic gap in U-Nets, these studies acknowledged the significant role of efficient skip connection schemes in improving U-Net’s segmentation precision. For instance, Attention U-Net [20] employs attention gates instead of the direct skip connection (DSC) (as shown in Fig. 1(b)) to guide feature fusion between the decoder and encoder, effectively highlighting relevant regions while suppressing unimportant background areas. UNet++ [21] introduces a dense connection scheme to replace the DSC, resulting in superior segmentation performance compared to U-Net. UNet3+ [22] designs a full-scale connection scheme to supplant the DSC, achieving advanced performance across multiple databases. These methods enhance segmentation precision by improving the skip connection scheme, yet they do not explicitly address the underlying issue of the semantic gap.
Comparison of skip connection schemes between the proposed USCT-UNet and other models.
It was not until 2020 that MultiResUNet [25] suggested the potential existence of a semantic gap between corresponding layers of the encoder and decoder, based on observations of U-Net’s segmentation results across five public databases. To address this semantic gap, they designed the residual path (Res Path) with convolutional blocks instead of DSC, as illustrated in Fig. 1(c). These additional nonlinear operations aimed to mitigate the semantic gap. Subsequently, in 2022, UCTransNet [26] confirmed the existence of this semantic gap through systematic analysis and noted that the DSC scheme could impair model performance. They proposed using the channel Transformer (CTrans) as a substitute for the DSC to further reduce the semantic gap, as shown in Fig. 1(d).
In 2023, the literature [35] employed a Transformer-based cross-layer feature enhancement module that fuses feature sets from neighboring encoder layers and integrates them with decoder feature sets. This approach enhances low-layer features through cross-layer feature learning, effectively mitigating semantic gaps and improving medical image segmentation performance. In 2024, the literature [36] introduced a composite attention module combining channel and spatial attention, featuring a three-branch structure with double squeeze-and-excitation blocks, convolutional blocks, and batch normalization. This module replaces the DSC and reduces the semantic gap across different layers between the encoder and decoder.
Although these studies have acknowledged the existence of the semantic gap between the decoder and encoder, they have not provided a comprehensive quantitative measurement or detailed analysis of it. Furthermore, they continue to use a uniform skip connection scheme, suggesting that there is still room for improvement in segmentation performance. To address this, we developed the U-shaped skip connection (USC) solution, which allocates a variable number of MCFT blocks based on the semantic gap laws at different layers between the encoder and decoder, as illustrated in Fig. 1(a).
B. Transformer-Based Segmentation Methods
The Transformer is a neural network architecture that leverages a self-attention mechanism and exhibits exceptional global modeling capabilities [14]. Recently, the Vision Transformer (ViT) [37], which has demonstrated impressive results in image recognition tasks, has prompted research into replacing convolutional neural networks (CNNs) with ViT for certain medical image segmentation tasks [15], [16]. For instance, TransUnet [17] integrates ViT to construct the encoder for U-Net, while Swin-Unet [18] employs the Swin Transformer (SwinT) [38] to develop a Transformer-based U-Net. Similarly, DS-TransUNet [19] introduces a dual-scale encoder to capture both coarse- and fine-grained image features. It is worth noting that all the aforementioned methods retain the DSC component.
In a recent study [39], integrating SwinT into U-Net, along with spatial interaction, feature compression, and relationship aggregation blocks, enhanced the representation of irregularly shaped tumors. Other research [40] utilized SwinT as an encoder to extract image features and designed a cascading upsampling block as a decoder to optimize segmentation results. Additionally, some studies [41] proposed a hybrid network combining CNNs as the encoder and SwinT as the decoder for medical image segmentation, achieving remarkable performance. In another study [35], researchers used SwinT for the encoder and CNNs for the decoder, introducing a cross-layer feature enhancement module and a spatial channel squeeze-excitation module to enhance feature learning across different layers. Despite these advancements, these approaches primarily focus on enhancing the encoder and decoder of U-Net without effectively addressing the semantic gap issue by improving the skip connetion.
Quantitative Analysis of Semantic Gap
A. Databases
In this section, we quantitatively analyze the semantic gap between the layers of the encoder and decoder, assessing their impact on U-Net segmentation performance across four challenging datasets: DSB-2018, ISIC-2018, GlaS, and MoNuSeg. These datasets were chosen because they offer diverse and comprehensive cases for various biomedical imaging challenges, which can effectively indicate the robustness and generalization capabilities of segmentation models. Details regarding each dataset and the distribution of images among the training, validation, and testing sets are outlined in TABLE I.
The 2018 Data Science Bowl Challenge (DSB-2018) dataset [29] presents a wide variety of nuclei images, offering challenges due to the diversity in the structure and morphology of the nuclei. This diversity necessitates that segmentation models capture precise semantic information, thereby supporting the resolution of semantic gap.
The GlaS dataset [32] focuses on the segmentation of glandular structures in colon histology images, encompassing glands with diverse shapes, sizes, and appearances. These irregular shapes pose a challenge to segmentation models, requiring them to capture precise semantic information of glandular contours, making the dataset an ideal platform for testing model performance in segmenting complex and irregularly images.
The MoNuSeg dataset [33], [34] includes nuclear segmentation images from multiple organs, such as the liver, kidney, prostate, and breast. This diversity in data necessitates that segmentation models possess strong generalization capabilities, enabling them to perform well across various organ and tissue types. By employing a multi-organ dataset, we aim to evaluate the model’s ability to bridge semantic gap.
The Lesion Boundary Segmentation Challenge dataset (ISIC-2018) [30], [31] focuses on skin lesion segmentation, encompassing a variety of shapes, sizes, and locations. This diversity aids in assessing the capability of segmentation models to handle complex lesion features, capture variable semantic information, and address the semantic gap issue.
B. Implementation Details and Assessment Metrics
The experiments were conducted using a standalone NVIDIA GeForce RTX 3090 Ti tensor core GPU, an 8-core CPU, and 24 GB RAM, employing the PyTorch framework. To prevent overfitting, data augmentation strategies such as image flipping and random rotation were used. Before being input into the network, the resolution of all images was standardized to
The model’s performance was evaluated using the Dice coefficient (Dice), the mean intersection over union (MIoU), and the Hausdorff distance (HD) [42]. The Dice coefficient, a metric used to assess sample similarity, measures the ratio of twice the intersection of two sets to their total size. MIoU represents the average ratio of the intersection to the union of the ground truth and predicted sets. The formulas for calculating Dice and MIoU are as follows:\begin{align*} Dice & = \frac {2*TP}{FP + 2*TP + FN} \times 100\%, \ \tag {1}\\ MIoU & = \frac {1}{k + 1}\sum \limits _{i = 0}^{k} {\frac {TP}{FP + TP + FN}} \times 100\%,\ \tag {2}\end{align*}
The Hausdorff distance (HD) quantifies the spatial separation between the ground truth and predicted sets within a metric space. A smaller HD value indicates a closer correspondence between the model prediction and the actual target region. The formula for calculating HD is as follows:\begin{align*} & {d_{H}}(X,Y) \\ & = \max \left \{{{\begin{array}{cccccccccccccccccccc} {\max \limits _{x \in X} \left ({{\mathop {\min d(x,y)}\limits _{y \in Y} }}\right ),}& {\max \limits _{y \in Y} \left ({{\mathop {\min d(y,x}\limits _{x \in X} }}\right )} \end{array}}}\right \}, \tag {3}\end{align*}
C. Quantitative Results and Analysis
This analysis is conducted for various skip connection configurations, as depicted in Fig. 2. The Dice, MIoU, and HD results across the four datasets indicate that the DSC scheme reduces model’s segmentation performance, whether applied individually or in combination. The performance drop is particularly pronounced when SC1 and SC2 are used separately. This is due to the complex and variable morphology of target regions in these datasets, which exacerbates semantic gap at corresponding layers between the encoder and decoder. The DSC scheme failed to effectively address this issue.
Quantifying the influence of semantic gap on U-Net’s segmentation performance across multiple assessment metrics.
The semantic gap between different layers of the encoder and decoder are quantified, as shown in TABLEs II, III, and IV. Here, “None” indicates the absence of skip connections, “SC1” implies that only the first level of skip connections is preserved, and “SC4+SC1” means that the first and fourth levels of skip connections are maintained. In this study, the scenario denoted as “None” is used as a reference point to gauge the impact of the semantic gap. In this configuration, there are no skip connections. Notably, skip connections in U-Net typically bypass certain layers, forwarding the output from one layer directly to a subsequent layer, thus preserving spatial information that might otherwise be lost during the pooling operation. However, in the “None” scenario, this bypassing mechanism is absent. Consequently, there is no direct transfer of information between the corresponding layers of the encoder and decoder, effectively eradicating the opportunity for a semantic gap to emerge. Thus, the “None” scenario acts as a control condition, forming a baseline against which the impacts of the semantic gap in other scenarios incorporating skip connections can be measured and compared. This setup allows for a comprehensive understanding of the semantic gap’s impacts under different circumstances. In these TABLEs, “+” indicates improvement, while “-” signifies decline. A systematic analysis of the quantified semantic gap results reveals three critical findings.
1) Finding 1:
The degree of the semantic gap varies across different layers, leading to different impacts on U-Net’s segmentation performance. The shallower the layer, the more significant the semantic gap and its impacts on segmentation performance. Conversely, the deeper the layer, the smaller the semantic gap and its influence. For instance, in TABLE II, on DSB-2018 dataset using the Dice metric, the semantic gap for “SC1” is −8.54, for “SC2” is −6.51, for “SC3” is −3.02, and for “SC4” is −0.59. Similarly, in TABLE III, on MoNuSeg dataset using the MIoU metric, the semantic gap for “SC1” is −4.40, for “SC2” is −0.48, for “SC3” is −0.34, and for “SC4” is −0.18. Therefore, adopting a uniform skip connection scheme is unsuitable.
2) Finding 2:
Due to the presence of the semantic gap, the direct skip connection (DSC), which directly fuses features from the decoder and encoder, may impair the model’s segmentation performance. This study evaluates the performance by incrementally adding skip connections at various layers based on “SC4”. For example, in TABLE IV, on ISIC-2018 dataset using the HD metric, the semantic gap for “SC4” is −0.11. When additional skip connections are added, “SC4+SC1” results in −0.18, “SC4+SC1+SC2” increases to −0.23, and “SC4+SC1+SC2+SC3” further increases to −0.40. Similarly, on Glas dataset using the HD metric, “SC4” has a semantic gap of −0.48, “SC4+SC1” increases to −1.37, “SC4+SC1+SC2” rises to −1.97, and “SC4+SC1+SC2+SC3” is −1.79. Thus, direct fusion of decoder and encoder features is impractical.
3) Finding 3:
The detrimental impact of the semantic gap has led to low overall segmentation precision for U-Net across the four datasets, significantly hindering the achievement of precise and reliable automatic medical image segmentation. For example, the MIoU scores of U-Net with DSC are 75.92% on ISIC-2018 dataset and 80.98% on GlaS dataset. In comparison, the MIoU scores of U-Net without DSC are 77.16% and 82.81%, respectively. This indicates a performance degradation due to the semantic gap of 1.24% on ISIC-2018 and 1.83% on GlaS.
Based on the results of the quantitative analyses, we identify two key characteristics: 1) The direct skip connection (DSC) exhibits a semantic gap that adversely impacts the model’s segmentation performance; 2) The magnitude of the semantic gap varies across different layers. Consequently, it is essential to develop an efficient skip connection scheme to replace the DSC, thereby eliminating the semantic gap and enhancing the model’s segmentation precision.
Methodology
A. Overall Architecture
To address the challenges posed by the semantic gap, this study introduces the USCT-UNet architecture, an extension of the U-Net. The structure of the proposed USCT-UNet is illustrated in Fig. 3. In this architecture, the U-shaped skip connection (USC) replaces the direct skip connection (DSC), and the fusion of features from the decoder and USC is achieved through the SCCA module in the decoder. The process begins with an image
Schematic of the overall structure of USCT-UNet. The DSC is replaced with the USC, built using MCFT blocks. A SCCA module is introduced to guide the fusion of feature information between the decoder and the output of the USC.
Algorithm 1 Encoder Feature Embedding
Input: Feature matrix
Output: Feature embedding matrix
Initialize patch size
for
Set the feature matrix
Compute the total number of patches
Divide
for
Flatten each patch into a vector
Embed the flattened vector
Add position encoding
end for
Add a learnable classification token [CLS] at the beginning of the sequence, initialized as a special vector
Combine the classification token and all patch embeddings into the feature embedding matrix:
end for
Output feature embedding matrix
Algorithm 2 Medical Image Segmentation Based on USCT-UNet
Input image
Segmentation result
Encoder Phase: Input X into the U-Net encoder
for
Perform convolution and pooling to extract features
Store the feature map as
end for
Encoder Feature Embedding:
Apply algorithm 1 for
Store the embedded features as
USC Module Processing:
for
Determine the number of MCFT modules to pass
Pass
for each MCFT module do
Perform feature embedding using Eq. (4) to obtain Q, K, V
end for
Store the disambiguated features as
end for
Decoder Phase:
Upsample
for
Input
Perform feature embedding on
Apply linear embedding to
Compute matching features
Reconstruct
Upsample
end for
Apply a
B. U-Shaped Skip Connection
In response to the observed characteristics of the semantic gap, this study introduces the USC scheme using the MCFT to replace the DSC scheme, thereby eliminating the semantic gaps between the encoder and decoder, as shown in Fig. 3. Specifically, the severity of the semantic gap determines the number of MCFT blocks used to address it. For example, “SC1,” with the most pronounced semantic gap, consists of four MCFT blocks. Similarly, “SC2” contains three MCFT blocks, “SC3” has two, and “SC4,” with the least severe semantic gap, includes a single MCFT block. The output from each encoder layer,
C. Multichannel Fusion Transformer
To address the semantic gap between the encoder and decoder, we employ a multichannel fusion Transformer (MCFT) for multiscale global modeling on the output features of each encoder layer,
1) Encoder Feature Embedding:
We first use filters of size
2) Multihead Channel Self-Attention:
We then linearly map \begin{equation*} \ {\textbf {Q}_{k}} = {\textbf {T}_{k}}{\textbf {W}_{Q_{k}}}, \textbf {K} = {\textbf {T}_{\Sigma }}{\textbf {W}_{K}}, \textbf {V} = {\textbf {T}_{\Sigma }}{\textbf {W}_{V}},\ \tag {4}\end{equation*}
After LN, \begin{equation*} \ {\textbf {CSA}_{k}} = {\text {SoftMax}}\left ({{{\mathrm { SN}}\left ({{\frac {\textbf {Q}_{_{k}}^{\mathrm { T}}\textbf {K}}{\sqrt {C_{\Sigma }} }}}\right )}}\right ){\textbf {V}^{\mathrm { T}}},\ \tag {5}\end{equation*}
MCSA is formed by combining multiple CSA, which can be computed in parallel to obtain the output result. With m heads, the output result \begin{equation*} {\textbf {MCSA}_{k}} = \frac {\textbf {CSA}_{_{k}}^{1} + \textbf {CSA}_{_{k}}^{2} +, \cdot \cdot \cdot, + \textbf {CSA}_{_{k}}^{m}}{m} + {\textbf {T}_{k}},\ \tag {6}\end{equation*}
Then, applying the IMLP and residual operator, we derive the final output feature \begin{equation*} \ {\textbf {O}_{k}} = {\mathrm { IMLP}}\left ({{{\mathrm { LN}}\left ({{\textbf {MCSA}_{k}}}\right )}}\right ) + {\mathrm { LN}}\left ({{\textbf {MCSA}_{k}}}\right ),\ \tag {7}\end{equation*}
D. Improved Multilayer Perceptron
The original multilayer perceptron (MLP) used in the Transformer structure employs a two-layer linear mapping mechanism, which may not fully capture the complex shape features present in medical images. To address this limitation, we have developed an improved multilayer perceptron (MLP) with three linear layers, incorporating batch normalization BN in the middle layer and a dropout layer, as shown in Fig. 4.
The feature sequence from the MCSA is first processed by a linear layer, transforming it into a 784-dimensional feature sequence. After undergoing BN and Gaussian Error Linear Unit (GELU) activation, the sequence enters next linear layer, retaining the same dimensionality. The sequence then passes through another round of BN and GELU activation before entering third linear layer, which reduces its dimensionality back to the original size. Following a dropout layer, the final output provides a comprehensive representation of the complex semantic information in the medical image.
E. Spatial Channel Cross-Attention
In the architecture shown in Fig. 3, we design a spatial channel cross-attention (SCCA) module to facilitate the fusion of features from the USC module and the corresponding decoder layer. Specifically, the decoder output features\begin{equation*} \textbf {M}_{k} = \text {SoftMax}\left ({{{\mathrm { SN}}\left ({{\frac {\textbf {Q}^{\mathrm { T}}_{_{k}}\textbf {K}}{\sqrt {C_{k}}}}}\right ) }}\right )\textbf {V}^{\mathrm { T}} \tag {8}\end{equation*}
F. Computational Complexity
This study designs the USC and SCCA modules based on the self-attention mechanism. Both modules employ softmax attention as described in Eq. (5) and Eq. (8), which involves matrix operations to compute the similarity between all Q-K pairs, resulting in a complexity of \begin{equation*} \mathrm {O}(H \times W \times {\kappa ^{2}} \times 2C_{in} \times C_{out}), \tag {9}\end{equation*}
Experiments and Analysis of Results
A. K-Fold Cross-Validation
To rigorously evaluate the segmentation precision of the U-Net and our proposed USCT-UNet models, we conducted 5-fold cross-validation experiments across four distinct datasets. The results are summarized in TABLE V, which indicates that the USCT-UNet consistently outperforms the U-Net in terms of segmentation performance across all datasets. For instance, for ISIC-2018 dataset, the USCT-UNet achieved a 4.79% improvement in Dice, a 5.70% increase in MIoU, and a reduction of 2.15 in HD. Similarly, for GlaS dataset, the USCT-UNet demonstrated a 2.24% enhancement in Dice, a 3.46% enhancement in MIoU, and a 3.26 decrease in HD. Additionally, the USCT-UNet showed substantial improvements in these evaluation metrics for both DSB-2018 and MoNuSeg datasets.
B. Ablation Studies
We conducted ablation study using four challenging datasets, with the results presented in TABLE VI. In these experiments, we systematically evaluated the effects of the USC module, the SCCA module, and their combination on model performance. The results demonstrate that the USC module significantly enhances model performance across all four datasets. For instance, on ISIC-2018 dataset, the Dice for the UNet+USC model improves to 87.79%, which is 3.92% higher than that of the UNet of 83.87%. On DSB-2018 dataset, the MIoU for the UNet+USC model improves to 82.08%, which is 2.85% higher than that of the UNet of 79.23%. Similarly, incorporating the SCCA module substantially improves model performance. For example, on MoNuSeg dataset, the MIoU for the UNet+SCCA model reaches 65.81, which is 3.48% higher than that of the UNet of 62.33%. On GlaS dataset, the HD for the UNet+USC model decreases to 26.49, which is 1.60 higher than that of the UNet of 28.09.
The model’s performance reaches its optimal level when both the USC and SCCA modules are combined. For instance, on GlaS dataset, the Dice and MIoU for the UNet+USC+SCCA model improve to 91.16% and 84.44%, On DSB-2018 dataset, the Dice and MIoU rise to 89.12% and 83.15%, respectively. These enhancements demonstrate that the USC module effectively mitigates the negative impact of the semantic gap, while the SCCA module proficiently integrates features from both the USC and the decoder. However, these performance gains come at the cost of increased computational complexity. Its parameters and FLOPs increase to 31.15M and 62.05G, respectively, which may affect the model’s real-time performance and deployment efficiency.
We further investigate the role of the USC module in USCT-UNet, specifically how it addresses the semantic gap between the decoder and encoder. This section includes a detailed ablation study of the USC scheme using four datasets, with results provided in Fig. 5. The Dice, MIoU, and HD results across the four datasets demonstrate that the USC scheme significantly enhances model’s segmentation performance. The improvements are particularly pronounced on more challenging datasets, such as MoNuSeg and ISIC-2018, due to the morphological diversity and complexity of the target regions in these datasets. These factors make it more difficult for segmentation models to capture accurate semantic information, whereas our USC scheme effectively overcomes this challenge.
Results of the ablation experiments to assess the contribution of the USC options across multiple assessment metrics.
The effectiveness of the proposed USC module in addressing the semantic gap between different layers of the encoder and decoder has been quantified through the data presented in TABLEs VII, VIII, and IX. In these TABLEs, “+” indicates improvement. These results demonstrate significant improvements in the model’s ability to handle semantic mismatches across layers. The findings reveal three significant improvements facilitated by the USC scheme compared to the DSC scheme (as detailed in TABLEs II, III, and IV).
Across the four datasets, all variations of the USC scheme positively effect the segmentation process. This underscores the efficacy of USCT-UNet in bridging the semantic gap between corresponding layers of the decoder and encoder. For instance, on DSB-2018 dataset, compared to the configuration without skip connections (“None”), “SC1,” “SC2,” “SC3,” and “SC4” increased Dice by 1.79%, 1.62%, 1.02%, and 1.87%, respectively. On ISIC-2018 dataset, they enhanced MIoU by 2.48%, 2.19%, 2.15%, and 2.34%, respectively.
The USC module eliminates the adverse impacts associated with the direct skip connection (DSC) while maximizing its benefits. For example, on MoNuSeg dataset, transitioning from “SC4” to “SC4+SC1,” “SC4+SC1+SC2,” and “SC4+SC1+SC2+SC3” increased Dice from 1.42% to 3.28%, 4.07%, and 5.60%, respectively. On GlaS dataset, they reduced HD from 1.16 to 1.01, 1.07, and 1.47, respectively.
The USC module significantly improved the model’s segmentation performance. Specifically, on GlaS dataset, USCT-UNet achieved a Dice of 91.16%, a MIoU of 84.44%, and a HD of 24.83. On DSB-2018 dataset, the model achieved a Dice of 89.12%, a MIoU of 83.15%, and a HD of 15.11. These results underscore the effectiveness of the USC scheme in bridging the semantic gap between the decoder and encoder.
C. Segmentation Results Visualisation
In this section, we present visual comparisons of the segmentation results produced by the proposed USCT-UNet and benchmark models to demonstrate USCT-UNet’s superiority in medical image segmentation tasks. These results are depicted in Fig. 6, where the red boxes indicate areas where USCT-UNet outperforms the other models in segmentation effectiveness. USCT-UNet clearly delivers superior segmentation outcomes that more closely resemble the ground truth compared to U-Net and UCTransNet. As illustrated in Fig. 5, USCT-UNet accurately highlights the correct pathological regions while suppressing false positives and delineates continuous boundary edges, demonstrating its effectiveness in precise and robust segmentation.
D. Comparison With State-of-the-Art Related Methods
In this subsection, we comprehensively analyze the segmentation performance, parameters, and computational complexity of USCT-UNet across four datasets. We compare its performance with the latest CNN and Transformer-based methods, as detailed in TABLEs X and XI. The results indicate that USCT-UNet achieves the highest scores in both Dice and MIoU metrics, with values of 89.12% and 83.15% on DSB-2018, 88.66% and 81.62% on ISIC-2018, 91.16% and 84.44% on GlaS, and 80.07% and 66.99% on MoNuSeg, representing the highest values across all datasets. Additionally, compared to methods such as HRNetV2 [44], TransFuse [49], UDTransNet [53], UCTransNet [26] and LCAMix [52], our USCT-UNet demonstrates design efficiency by significantly improving precision while maintaining a low parameters (31.15M Param) and computational complexity (62.05G FLOPs), highlighting its potential for broad clinical deployment and application.
Conclusion
Precise and automated segmentation of medical images is a pivotal component in clinical diagnostics and rehabilitation monitoring. This study systematically evaluates the semantic gap between corresponding layers of the U-Net architecture, identifying a disparity in the magnitude of this gap across various layers, with more pronounced gaps in the upper layers and less severe ones in the deeper layers. This semantic gap negatively impacts the segmentation process when using direct skip connections (DSC).
To address this issue and ensure reliable and accurate automated segmentation of medical images, we propose the USCT-UNet architecture, which integrates U-Net with U-shaped skip connections (USC) and a spatial channel cross-attention (SCCA) module. The USC, constructed using multichannel fusion Transformer blocks, replaces the direct skip connections (DSC). The SCCA module, developed using self-attention mechanisms, facilitates the fusion of the decoder’s output features with those from the USC. Experimental results confirm that the proposed method effectively eliminates the semantic gap between the decoder and encoder, significantly enhancing state-of-the-art medical image segmentation performance on several benchmark datasets.
Despite the excellent segmentation performance achieved by USCT-UNet, there are two potential areas for improvement: 1) Segmenting a 3D image requires converting it from 3D to 2D, which may result in the loss of correlation information between slices. 2) The inclusion of the USC and SCCA modules significantly increases computational complexity. Therefore, future enhancements to the USCT-UNet architecture will focus on constructing 3D models and designing an attention mechanism with linear computational complexity.