Journals & Magazines >IEEE Transactions on Neural S... >Volume: 32

USCT-UNet: Rethinking the Semantic Gap in U-Net Network From U-Shaped Skip Connections With Multichannel Fusion Transformer

Abstract:

Medical image segmentation is a crucial component of computer-aided clinical diagnosis, with state-of-the-art models often being variants of U-Net. Despite their success,...Show More

Metadata

Abstract:

Medical image segmentation is a crucial component of computer-aided clinical diagnosis, with state-of-the-art models often being variants of U-Net. Despite their success, these models’ skip connections introduce an unnecessary semantic gap between the encoder and decoder, which hinders their ability to achieve the high precision required for clinical applications. Awareness of this semantic gap and its detrimental influences have increased over time. However, a quantitative understanding of how this semantic gap compromises accuracy and reliability remains lacking, emphasizing the need for effective mitigation strategies. In response, we present the first quantitative evaluation of the semantic gap between corresponding layers of U-Net and identify two key characteristics: 1) The direct skip connection (DSC) exhibits a semantic gap that negatively impacts models’ performance; 2) The magnitude of the semantic gap varies across different layers. Based on these findings, we re-examine this issue through the lens of skip connections. We introduce a Multichannel Fusion Transformer (MCFT) and propose a novel USCT-UNet architecture, which incorporates U-shaped skip connections (USC) to replace DSC, allocates varying numbers of MCFT blocks based on the semantic gap magnitude at different layers, and employs a spatial channel cross-attention (SCCA) module to facilitate the fusion of features between the decoder and USC. We evaluate USCT-UNet on four challenging datasets, and the results demonstrate that it effectively eliminates the semantic gap. Compared to using DSC, our USC and SCCA strategies achieve maximum improvements of 4.79% in the Dice coefficient, 5.70% in mean intersection over union (MIoU), and 3.26 in Hausdorff distance.

Published in: IEEE Transactions on Neural Systems and Rehabilitation Engineering ( Volume: 32)

Page(s): 3782 - 3793

Date of Publication: 26 September 2024

ISSN Information:

PubMed ID: 39325601

DOI: 10.1109/TNSRE.2024.3468339

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Computer-aided diagnosis system has become an indispensable technology for disease treatment, aiding in both preoperative diagnosis and postoperative rehabilitation monitoring. These systems aim to help clinicians make diagnostic decisions more accurately and objectively [1]. Medical image segmentation is a crucial step in computer-aided clinical diagnosis. The development and integration of advanced methods in medical image segmentation are essential for enhancing diagnostic accuracy and improving patient rehabilitation. These innovations support clinicians in diagnosing diseases and monitoring patient rehabilitation with greater precision and reliability [2]. Segmenting and quantitatively analyzing disease regions not only provide valuable information for pathological diagnosis but also significantly aid in planning treatment strategies, monitoring disease progression and patient rehabilitation [3], [4]. Currently, the state-of-the-art models for medical image segmentation are variants of U-Net, which include an encoder and a decoder [5], [6]. Among these, U-Net [7] is the most frequently utilized segmentation network. The encoder successively extracts the semantic features of the image, while the decoder progressively recovers its fine-grained details. Skip connection is a critical component of U-Net, which propagates the spatial information from the encoder to the corresponding layers of the decoder before pooling operations, thereby preventing the loss of these information during pooling. However, despite U-Net’s robust feature representation capabilities, its skip connection scheme struggles to adequately manage multiscale variations in complex medical images [8], [9], [10]. These challenges highlight the need for advanced models and engineering strategies to enhance the precision and utility of machine learning in medical image segmentation, ensuring effectiveness in complex clinical applications.

Some studies have proposed more efficient U-Net variants by enhancing and optimizing the encoder and decoder components, such as Edge U-Net [11], MDU-Net [12], and DCSAU-Net [13]. Additionally, recent research has leveraged the multiscale global modeling capabilities of Transformers [14] to construct more advanced encoders and decoders [15], [16], as demonstrated in models like TransUnet [17], Swin-Unet [18], and DS-TransUNet [19]. Furthermore, efforts have been made to improve U-Net’s precision and reliability by optimizing the skip connection scheme, as evidenced by Attention U-Net [20], UNet++ [21], and UNet3+ [22]. However, these approaches often overlook the semantic gap present in skip connections, where features mapped from the decoder may be semantically discordant with those from the encoder. In 2018, researchers enhanced the U-Net architecture by replacing the direct skip connection (DSC) with short connections consisting of nested dense convolutional blocks, significantly improving its performance [23]. Following this, further advancements were made in 2019 with the design of a semantic enhancement module and a boundary attention module, which were integrated into a parallel pyramid structure, achieving state-of-the-art results [24]. These developments underscore the notable semantic gap between the encoder and decoder when using DSC, which these modifications aim to address.

In 2020, Ibtehaz and Rahman first identified the concept of the semantic gap in U-Net architecture in their work titled “MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation” [25]. They emphasized that this semantic gap significantly impairs U-Net’s segmentation performance. Their analysis, based on experimental segmentation results from U-Net across five public databases, demonstrated a semantic gap between the corresponding layers of the encoder and decoder. They noted that merging high-level semantic information from the decoder with low-level spatial information from the encoder via the DSC could detrimentally affect segmentation precision. To address this issue, the study proposed replacing the DSC with a sequence of convolutional layers that include residual connection. This adjustment introduces a structure comprising four $3\times 3$ convolution layers followed by a residual connection employing $1\times 1$ convolution operations.

In recent years, a growing body of research has identified the semantic gap as a significant obstacle in medical image segmentation, limiting the effectiveness of segmentation algorithms in transitioning from experimental research to clinical applications [5]. Consequently, this issue has attracted considerable attention and research efforts from scholars. In 2022, a systematic experimental analysis confirmed the existence of the semantic gap, highlighting that the DSC scheme might impair segmentation performance [26]. In some instances, U-Net’s segmentation performance without skip connections surpassed that achieved using skip connections. Concurrently, researchers attempted to mitigate the semantic gap by replacing the DSC with the channel Transformer (CTrans), leveraging the Transformer’s superior multiscale global modeling capabilities [26]. In 2023, a dense skip connection with cross co-attention was designed in U-Net to address the semantic gap, achieving more precise segmentation [27]. Furthermore, in 2024, a Skip-NAT scheme was introduced to replace the DSC, where the encoder’s feature maps are processed by a neighborhood attention Transformer and then fused with the decoder’s corresponding feature maps, minimizing the semantic gap and enhancing U-Net’s segmentation effects [28].

While these studies have recognized the semantic gap and its detrimental impacts, the absence of a quantitative evaluation method for this gap hinders the development of targeted mitigation strategies to enhance segmentation precision and reliability. In summary, the semantic gap issue has significantly constrained the achievement of accurate and reliable automatic medical image segmentation. Therefore, there is an urgent need to understand the patterns of the semantic gap between corresponding layers of the encoder and decoder and to develop methods to eliminate it.

Motivated by these issues, we conducted an in-depth study of the U-Net architecture concerning the semantic gap. Firstly, we quantified the semantic gap between corresponding layers of the decoder and encoder and performed a systematic analysis of their segmentation performance across the 2018 Data Science Bowl challenge (DSB-2018) database [29], the Lesion Boundary Segmentation Challenge dataset (ISIC-2018) [30], [31], the GlaS dataset [32], and MoNuSeg database [33], [34]. We observed two key characteristics:

The existence of the semantic gap means that the direct fusion of the decoder’s high-level semantic information and the encoder’s low-level spatial information via the DSC can compromise the model’s segmentation performance when the information is incompatible.
The semantic gap varies in severity across different layers between the encoder and decoder, leading to diverse impacts on the model’s segmentation performance. Specifically, shallower layers exhibit a worse semantic gap and a consequent greater degradation in performance, whereas deeper layers present a minor semantic gap and a lesser impact on performance.

Secondly, to address these challenges and enhance the trustworthiness of medical image segmentation, we introduce a multichannel fusion Transformer (MCFT) skip connection to counter the semantic gap’s adverse impacts. We further propose a novel USCT-UNet segmentation network to accomplish higher-precision and more generalizable automatic medical image segmentation. Specifically, we use MCFT blocks to construct U-shaped skip connection (USC) and allocate a variable number of MCFT blocks according to the semantic gap pattern at different layers. This allocation aims to establish long-term dependencies between the decoder’s high-level features and the encoder’s low-level features, thus minimizing the semantic gap and maximizing segmentation precision. Moreover, we designed a spatial channel cross-attention (SCCA) module to guide the fusion of features from the decoder and USC. Notably, our USC and SCCA modules can be easily embedded into other U-Net variants.

Finally, comprehensive experimental results from DSB-2018, ISIC-2018, GlaS and MoNuSeg databases show that our USCT-UNet outperforms U-Net on multiple segmentation evaluation metrics. This improvement not only significantly enhances traditional segmentation pipelines and offers an effective solution for the semantic gap issue, but also demonstrates a practical path towards improving the explainability, generalization, and accountability capabilities of machine learning in health informatics.

The principal contributions of this study are as follows:

Our research is the first to quantify the semantic gap between corresponding layers of the encoder and decoder, providing a systematic analysis of its laws. We found that the semantic gap’s severity varies among different corresponding layers, leading to diverse impacts on the model’s segmentation performance. These findings suggest that simple direct and uniform skip connections may be inappropriate strategies.
We proposed the USCT-UNet segmentation network to mitigate this semantic gap and enhance segmentation precision, in which the USC was designed to replace the DSC, allocating different numbers of MCFT blocks based on the semantic gap magnitude at various layers, and the SCCA module was designed to guide the fusion of features from the decoder and USC module.
We conducted extensive experiments on various challenging datasets, and the results demonstrate that USCT-UNet effectively eliminates the semantic gap between the decoder and encoder, achieving precise and reliable medical image segmentation. When compared with other advanced segmentation methods, our method demonstrates superior performance.

SECTION II.

Related Work

A. Studies on the Semantic Gap in U-Net

Although early research did not explicitly identify the semantic gap in U-Nets, these studies acknowledged the significant role of efficient skip connection schemes in improving U-Net’s segmentation precision. For instance, Attention U-Net [20] employs attention gates instead of the direct skip connection (DSC) (as shown in Fig. 1(b)) to guide feature fusion between the decoder and encoder, effectively highlighting relevant regions while suppressing unimportant background areas. UNet++ [21] introduces a dense connection scheme to replace the DSC, resulting in superior segmentation performance compared to U-Net. UNet3+ [22] designs a full-scale connection scheme to supplant the DSC, achieving advanced performance across multiple databases. These methods enhance segmentation precision by improving the skip connection scheme, yet they do not explicitly address the underlying issue of the semantic gap.

Fig. 1.

Comparison of skip connection schemes between the proposed USCT-UNet and other models.

Show All

It was not until 2020 that MultiResUNet [25] suggested the potential existence of a semantic gap between corresponding layers of the encoder and decoder, based on observations of U-Net’s segmentation results across five public databases. To address this semantic gap, they designed the residual path (Res Path) with convolutional blocks instead of DSC, as illustrated in Fig. 1(c). These additional nonlinear operations aimed to mitigate the semantic gap. Subsequently, in 2022, UCTransNet [26] confirmed the existence of this semantic gap through systematic analysis and noted that the DSC scheme could impair model performance. They proposed using the channel Transformer (CTrans) as a substitute for the DSC to further reduce the semantic gap, as shown in Fig. 1(d).

In 2023, the literature [35] employed a Transformer-based cross-layer feature enhancement module that fuses feature sets from neighboring encoder layers and integrates them with decoder feature sets. This approach enhances low-layer features through cross-layer feature learning, effectively mitigating semantic gaps and improving medical image segmentation performance. In 2024, the literature [36] introduced a composite attention module combining channel and spatial attention, featuring a three-branch structure with double squeeze-and-excitation blocks, convolutional blocks, and batch normalization. This module replaces the DSC and reduces the semantic gap across different layers between the encoder and decoder.

Although these studies have acknowledged the existence of the semantic gap between the decoder and encoder, they have not provided a comprehensive quantitative measurement or detailed analysis of it. Furthermore, they continue to use a uniform skip connection scheme, suggesting that there is still room for improvement in segmentation performance. To address this, we developed the U-shaped skip connection (USC) solution, which allocates a variable number of MCFT blocks based on the semantic gap laws at different layers between the encoder and decoder, as illustrated in Fig. 1(a).

B. Transformer-Based Segmentation Methods

The Transformer is a neural network architecture that leverages a self-attention mechanism and exhibits exceptional global modeling capabilities [14]. Recently, the Vision Transformer (ViT) [37], which has demonstrated impressive results in image recognition tasks, has prompted research into replacing convolutional neural networks (CNNs) with ViT for certain medical image segmentation tasks [15], [16]. For instance, TransUnet [17] integrates ViT to construct the encoder for U-Net, while Swin-Unet [18] employs the Swin Transformer (SwinT) [38] to develop a Transformer-based U-Net. Similarly, DS-TransUNet [19] introduces a dual-scale encoder to capture both coarse- and fine-grained image features. It is worth noting that all the aforementioned methods retain the DSC component.

In a recent study [39], integrating SwinT into U-Net, along with spatial interaction, feature compression, and relationship aggregation blocks, enhanced the representation of irregularly shaped tumors. Other research [40] utilized SwinT as an encoder to extract image features and designed a cascading upsampling block as a decoder to optimize segmentation results. Additionally, some studies [41] proposed a hybrid network combining CNNs as the encoder and SwinT as the decoder for medical image segmentation, achieving remarkable performance. In another study [35], researchers used SwinT for the encoder and CNNs for the decoder, introducing a cross-layer feature enhancement module and a spatial channel squeeze-excitation module to enhance feature learning across different layers. Despite these advancements, these approaches primarily focus on enhancing the encoder and decoder of U-Net without effectively addressing the semantic gap issue by improving the skip connetion.

SECTION III.

Quantitative Analysis of Semantic Gap

A. Databases

In this section, we quantitatively analyze the semantic gap between the layers of the encoder and decoder, assessing their impact on U-Net segmentation performance across four challenging datasets: DSB-2018, ISIC-2018, GlaS, and MoNuSeg. These datasets were chosen because they offer diverse and comprehensive cases for various biomedical imaging challenges, which can effectively indicate the robustness and generalization capabilities of segmentation models. Details regarding each dataset and the distribution of images among the training, validation, and testing sets are outlined in TABLE I.

The 2018 Data Science Bowl Challenge (DSB-2018) dataset [29] presents a wide variety of nuclei images, offering challenges due to the diversity in the structure and morphology of the nuclei. This diversity necessitates that segmentation models capture precise semantic information, thereby supporting the resolution of semantic gap.
The GlaS dataset [32] focuses on the segmentation of glandular structures in colon histology images, encompassing glands with diverse shapes, sizes, and appearances. These irregular shapes pose a challenge to segmentation models, requiring them to capture precise semantic information of glandular contours, making the dataset an ideal platform for testing model performance in segmenting complex and irregularly images.
The MoNuSeg dataset [33], [34] includes nuclear segmentation images from multiple organs, such as the liver, kidney, prostate, and breast. This diversity in data necessitates that segmentation models possess strong generalization capabilities, enabling them to perform well across various organ and tissue types. By employing a multi-organ dataset, we aim to evaluate the model’s ability to bridge semantic gap.
The Lesion Boundary Segmentation Challenge dataset (ISIC-2018) [30], [31] focuses on skin lesion segmentation, encompassing a variety of shapes, sizes, and locations. This diversity aids in assessing the capability of segmentation models to handle complex lesion features, capture variable semantic information, and address the semantic gap issue.

TABLE I Details of the Medical Segmentation Databases

B. Implementation Details and Assessment Metrics

The experiments were conducted using a standalone NVIDIA GeForce RTX 3090 Ti tensor core GPU, an 8-core CPU, and 24 GB RAM, employing the PyTorch framework. To prevent overfitting, data augmentation strategies such as image flipping and random rotation were used. Before being input into the network, the resolution of all images was standardized to $448\times 448$ pixels. Model training was performed using the Adam optimizer with an initial learning rate of 0.001, employing cross-entropy and Dice loss functions. A 5-fold cross-validation was applied to enhance the robustness of the results.

The model’s performance was evaluated using the Dice coefficient (Dice), the mean intersection over union (MIoU), and the Hausdorff distance (HD) [42]. The Dice coefficient, a metric used to assess sample similarity, measures the ratio of twice the intersection of two sets to their total size. MIoU represents the average ratio of the intersection to the union of the ground truth and predicted sets. The formulas for calculating Dice and MIoU are as follows:

$\begin{align*} Dice & = \frac {2*TP}{FP + 2*TP + FN} \times 100\%, \ \tag {1}\\ MIoU & = \frac {1}{k + 1}\sum \limits _{i = 0}^{k} {\frac {TP}{FP + TP + FN}} \times 100\%,\ \tag {2}\end{align*}$ View Source

where TP denotes the pixels correctly classified as foreground, while FP refers to the pixels mistakenly categorized as foreground, and FN refers to pixels incorrectly identified as background. Both Dice and MIoU values vary from 0 to 1, with values approaching 1 signifying a higher degree of similarity between two sets.

The Hausdorff distance (HD) quantifies the spatial separation between the ground truth and predicted sets within a metric space. A smaller HD value indicates a closer correspondence between the model prediction and the actual target region. The formula for calculating HD is as follows:

$\begin{align*} & {d_{H}}(X,Y) \\ & = \max \left \{{{\begin{array}{cccccccccccccccccccc} {\max \limits _{x \in X} \left ({{\mathop {\min d(x,y)}\limits _{y \in Y} }}\right ),}& {\max \limits _{y \in Y} \left ({{\mathop {\min d(y,x}\limits _{x \in X} }}\right )} \end{array}}}\right \}, \tag {3}\end{align*}$ View Source

where

${d_{H}}(X,Y)$

denotes the HD between images X and Y;

$d(x,y)$

symbolizes the distance from each pixel point

$x_{i}$

in image X to the nearest pixel point

$y_{j}$

in image Y, and similarly,

$d(y,x)$

indicates the distance from each pixel point

$y_{j}$

in image Y to the closest pixel point

$x_{j}$

in image X.

C. Quantitative Results and Analysis

This analysis is conducted for various skip connection configurations, as depicted in Fig. 2. The Dice, MIoU, and HD results across the four datasets indicate that the DSC scheme reduces model’s segmentation performance, whether applied individually or in combination. The performance drop is particularly pronounced when SC1 and SC2 are used separately. This is due to the complex and variable morphology of target regions in these datasets, which exacerbates semantic gap at corresponding layers between the encoder and decoder. The DSC scheme failed to effectively address this issue.

Fig. 2.

Quantifying the influence of semantic gap on U-Net’s segmentation performance across multiple assessment metrics.

Show All

The semantic gap between different layers of the encoder and decoder are quantified, as shown in TABLEs II, III, and IV. Here, “None” indicates the absence of skip connections, “SC1” implies that only the first level of skip connections is preserved, and “SC4+SC1” means that the first and fourth levels of skip connections are maintained. In this study, the scenario denoted as “None” is used as a reference point to gauge the impact of the semantic gap. In this configuration, there are no skip connections. Notably, skip connections in U-Net typically bypass certain layers, forwarding the output from one layer directly to a subsequent layer, thus preserving spatial information that might otherwise be lost during the pooling operation. However, in the “None” scenario, this bypassing mechanism is absent. Consequently, there is no direct transfer of information between the corresponding layers of the encoder and decoder, effectively eradicating the opportunity for a semantic gap to emerge. Thus, the “None” scenario acts as a control condition, forming a baseline against which the impacts of the semantic gap in other scenarios incorporating skip connections can be measured and compared. This setup allows for a comprehensive understanding of the semantic gap’s impacts under different circumstances. In these TABLEs, “+” indicates improvement, while “-” signifies decline. A systematic analysis of the quantified semantic gap results reveals three critical findings.

TABLE II Quantitative Analysis of Semantic Gap Influence on Dice Metric at Different Layers

TABLE III Quantitative Analysis of Semantic Gap Influence on MIoU Metric at Different Layers

TABLE IV Quantitative Analysis of Semantic Gap Impact on HD Metric at Different Layers

1) Finding 1:

The degree of the semantic gap varies across different layers, leading to different impacts on U-Net’s segmentation performance. The shallower the layer, the more significant the semantic gap and its impacts on segmentation performance. Conversely, the deeper the layer, the smaller the semantic gap and its influence. For instance, in TABLE II, on DSB-2018 dataset using the Dice metric, the semantic gap for “SC1” is −8.54, for “SC2” is −6.51, for “SC3” is −3.02, and for “SC4” is −0.59. Similarly, in TABLE III, on MoNuSeg dataset using the MIoU metric, the semantic gap for “SC1” is −4.40, for “SC2” is −0.48, for “SC3” is −0.34, and for “SC4” is −0.18. Therefore, adopting a uniform skip connection scheme is unsuitable.

2) Finding 2:

Due to the presence of the semantic gap, the direct skip connection (DSC), which directly fuses features from the decoder and encoder, may impair the model’s segmentation performance. This study evaluates the performance by incrementally adding skip connections at various layers based on “SC4”. For example, in TABLE IV, on ISIC-2018 dataset using the HD metric, the semantic gap for “SC4” is −0.11. When additional skip connections are added, “SC4+SC1” results in −0.18, “SC4+SC1+SC2” increases to −0.23, and “SC4+SC1+SC2+SC3” further increases to −0.40. Similarly, on Glas dataset using the HD metric, “SC4” has a semantic gap of −0.48, “SC4+SC1” increases to −1.37, “SC4+SC1+SC2” rises to −1.97, and “SC4+SC1+SC2+SC3” is −1.79. Thus, direct fusion of decoder and encoder features is impractical.

3) Finding 3:

The detrimental impact of the semantic gap has led to low overall segmentation precision for U-Net across the four datasets, significantly hindering the achievement of precise and reliable automatic medical image segmentation. For example, the MIoU scores of U-Net with DSC are 75.92% on ISIC-2018 dataset and 80.98% on GlaS dataset. In comparison, the MIoU scores of U-Net without DSC are 77.16% and 82.81%, respectively. This indicates a performance degradation due to the semantic gap of 1.24% on ISIC-2018 and 1.83% on GlaS.

Based on the results of the quantitative analyses, we identify two key characteristics: 1) The direct skip connection (DSC) exhibits a semantic gap that adversely impacts the model’s segmentation performance; 2) The magnitude of the semantic gap varies across different layers. Consequently, it is essential to develop an efficient skip connection scheme to replace the DSC, thereby eliminating the semantic gap and enhancing the model’s segmentation precision.

SECTION IV.

Methodology

A. Overall Architecture

To address the challenges posed by the semantic gap, this study introduces the USCT-UNet architecture, an extension of the U-Net. The structure of the proposed USCT-UNet is illustrated in Fig. 3. In this architecture, the U-shaped skip connection (USC) replaces the direct skip connection (DSC), and the fusion of features from the decoder and USC is achieved through the SCCA module in the decoder. The process begins with an image $X \in {\mathbb {R}^{H \times W \times 3}}$ (where H and W represent the height and width of the image, respectively) being fed into the encoder for representation learning, resulting in feature maps $\textbf {E}_{k}$ for $k = 1, \cdot \cdot \cdot,4$ . These feature maps then undergo feature embedding to generate $\textbf {T}_{k}$ , which are then fed into the USC for semantic disambiguation, yielding the features $\textbf {O}_{k}$ . Simultaneously, $\textbf {E}_{4}$ undergoes pooling and convolution operations to obtain $\textbf {E}_{5}$ , which is fed into the decoder, producing the feature maps $\textbf {D}_{k}$ . Finally, $\textbf {O}_{k}$ and $\textbf {D}_{k}$ are fused through the SCCA module in the decoder, and the output features from the final layer of the decoder undergo a convolution operation to produce the segmentation result $Y \in {\mathbb {R}^{H \times W \times 1}}$ . Further details are provided in below and in Algorithms 1 and 2.

Fig. 3.

Schematic of the overall structure of USCT-UNet. The DSC is replaced with the USC, built using MCFT blocks. A SCCA module is introduced to guide the fusion of feature information between the decoder and the output of the USC.

Show All

Algorithm 1 Encoder Feature Embedding

Input: Feature matrix ${\textbf {E}_{k}} \in {\mathbb {R}^{(H \times W/{k^{2}}) \times {C_{k}}}}$

Output: Feature embedding matrix ${\textbf {T}_{k}} \in {\mathbb {R}^{d \times {C_{k}}}}$

Initialize patch size $P \times P$ and target embedding dimension d

for $k \leq 4$ do

Set the feature matrix $\textbf {E}_{k}$ dimensions as $(H \times W/{k^{2}}) \times {C_{k}}$

Compute the total number of patches $N_{k} = (H \times W/{k^{2}})/{P^{2}}$

Divide $\textbf {E}_{k}$ into $N_{k}$ patches, each of size $P \times P \times C_{k}$

for $i = 1\textbf {to}N_{k}$ do

Flatten each patch into a vector $z_{ki} \in \mathbb {R}^{P^{2} \cdot C_{k}}$

10:

Embed the flattened vector $z_{ki}$ into dimension d:

11:

$z^{p}_{ki}=z_{ki} \cdot \mathbf {F}_{k} + \mathbf {b}_{k}, \quad \mathbf {F}_{k} \in \mathbb {R}^{(P^{2} \times C_{k}) \times d}, \quad \mathbf {b}_{k} \in \mathbb {R}^{d}$

12:

Add position encoding $\mathbf {P}_{ki}$ : $z_{ki} = z^{p}_{ki} + \mathbf {P}_{ki}$

13:

end for

14:

Add a learnable classification token [CLS] at the beginning of the sequence, initialized as a special vector $\mathbf {z}_{0} \in \mathbb {R}^{d}$

15:

Combine the classification token and all patch embeddings into the feature embedding matrix: $\textbf {T}_{k} = [\mathbf {z}_{k0}; z_{k1}; z_{k2}; \ldots; {z}_{kN_{k}}]$

16:

end for

17:

Output feature embedding matrix ${\textbf {T}_{k}} \in {\mathbb {R}^{d \times {C_{k}}}}$

Algorithm 2 Medical Image Segmentation Based on USCT-UNet

Require:

Input image $X \in {\mathbb {R}^{H \times W \times 3}}$

Ensure:

Segmentation result $Y\in {\mathbb {R}^{H \times W \times 1}}$

Encoder Phase: Input X into the U-Net encoder

for $k = 1$ to 4 do

Perform convolution and pooling to extract features

Store the feature map as $\textbf {E}_{k}$

end for

Encoder Feature Embedding:

Apply algorithm 1 for $\textbf {E}_{k}$

Store the embedded features as $\textbf {T}_{k}$

USC Module Processing:

10:

for $k = 1$ to 4 do

11:

Determine the number of MCFT modules to pass $\textbf {T}_{k}$ through

12:

Pass $\textbf {T}_{k}$ through the corresponding number of MCFT modules

13:

for each MCFT module do

14:

Perform feature embedding using Eq. (4) to obtain Q, K, V

15:

Conduct semantic disambiguation learning using Eqs. (5)–(7)

16:

end for

17:

Store the disambiguated features as $\textbf {O}_{k}$

18:

end for

19:

Decoder Phase:

20:

$\textbf {E}_{4}$ undergoes pooling and convolution operations to obtain $\textbf {E}_{5}$

21:

Upsample $\textbf {E}_{5}$ using deconvolution to obtain $\textbf {D}_{4}$

22:

for $k = 4$ to 1 do

23:

Input $\textbf {O}_{k}$ and $\textbf {D}_{k}$ into the SCCA module in the decoder

24:

Perform feature embedding on $\textbf {D}_{k}$ using Eq. (4) to obtain Q

25:

Apply linear embedding to $\textbf {O}_{k}$ using Eq. (4) to obtain K, V

26:

Compute matching features $\textbf {M}_{k}$ using Eq. (8)

27:

Reconstruct $\textbf {M}_{k}$ and multiply with $\textbf {D}_{k}$ to update $\textbf {D}_{k}$

28:

Upsample $\textbf {D}_{k}$ using deconvolution to obtain $\textbf {D}_{k-1}$

29:

end for

30:

Apply a $1 \times 1$ convolution to $\textbf {D}_{1}$ to produce the final result Y

B. U-Shaped Skip Connection

In response to the observed characteristics of the semantic gap, this study introduces the USC scheme using the MCFT to replace the DSC scheme, thereby eliminating the semantic gaps between the encoder and decoder, as shown in Fig. 3. Specifically, the severity of the semantic gap determines the number of MCFT blocks used to address it. For example, “SC1,” with the most pronounced semantic gap, consists of four MCFT blocks. Similarly, “SC2” contains three MCFT blocks, “SC3” has two, and “SC4,” with the least severe semantic gap, includes a single MCFT block. The output from each encoder layer, $\textbf {E}_{k}$ , is converted into a two-dimensional matrix and then input into the USC for multichannel feature fusion and multiscale modeling. The USC output feature, $\textbf {O}_{k}$ , is fused with the decoder output feature, $\textbf {D}_{k}$ , within the SCCA module in the decoder, thereby restoring the image’s fine-grained details.

C. Multichannel Fusion Transformer

To address the semantic gap between the encoder and decoder, we employ a multichannel fusion Transformer (MCFT) for multiscale global modeling on the output features of each encoder layer, $\textbf {E}_{k}$ . As shown in Fig. 3, the MCFT block consists of a feature embedding layer, a linear layer, two normalization layers (LN), a multihead channel self-attention (MCSA) mechanism, a dropout layer, and an improved multilayer perceptron (IMLP).

1) Encoder Feature Embedding:

We first use filters of size $\left ({{\frac {P}{2^{k - 1}},\frac {P}{2^{k - 1}}}}\right)$ , for $k=1,2,3,4$ , with a stride of $\frac {P}{2^{k - 1}}$ , to flatten the multiscale feature maps ${\textbf {E}_{k}} \in {\mathbb {R}^{(H \times W/{k^{2}}) \times {C_{k}}}}$ from the encoder into two-dimensional patch tokens, ${\textbf {T}_{k}} \in {\mathbb {R}^{d \times {C_{k}}}}$ , where $d \in \frac {HW}{P}$ , H is the input image height, W is the input image width, and $C_{k}$ is the number of channels in the feature map. This operation does not alter the size of $C_{k}$ . More details can be found in Algorithm 1.

2) Multihead Channel Self-Attention:

We then linearly map $\textbf {T}_{k}$ to the query vector ${\textbf {Q}_{k}}\in {\mathbb {R}^{d \times {C_{k}}}}$ , and the channel-wise concatenation of $\textbf {T}_{\Sigma }$ to the key vector $\textbf {K} \in {\mathbb {R}^{d \times {C_{\Sigma } }}}$ and value vector $\textbf {V} \in {\mathbb {R}^{d \times {C_{\Sigma } }}}$ . This is expressed as

$\begin{equation*} \ {\textbf {Q}_{k}} = {\textbf {T}_{k}}{\textbf {W}_{Q_{k}}}, \textbf {K} = {\textbf {T}_{\Sigma }}{\textbf {W}_{K}}, \textbf {V} = {\textbf {T}_{\Sigma }}{\textbf {W}_{V}},\ \tag {4}\end{equation*}$ View Source

where

${\textbf {W}_{Q_{k}}} \in {\mathbb {R}^{d \times {C_{k}}}}$

${\textbf {W}_{K}} \in {\mathbb {R}^{d \times {C_{\Sigma } }}}$

, and

${\textbf {W}_{V}} \in {\mathbb {R}^{d \times {C_{\Sigma } }}}$

are the weights for

$\textbf {Q}_{k}$

, K, and V, respectively. Additionally,

${C_{\Sigma } } = \mathrm {Concat}({C_{1}},{C_{2}},{C_{3}},{C_{4}})$

and

${\textbf {T}_{\Sigma } } = \mathrm {Concat} ({\textbf {T}_{1}},{\textbf {T}_{2}},{\textbf {T}_{3}},{\textbf {T}_{4}})$

After LN, $\textbf {Q}_{k}$ , K, and V are processed through the CSA for attention calculation. The output $\textbf {CSA}_{k}$ is defined as

$\begin{equation*} \ {\textbf {CSA}_{k}} = {\text {SoftMax}}\left ({{{\mathrm { SN}}\left ({{\frac {\textbf {Q}_{_{k}}^{\mathrm { T}}\textbf {K}}{\sqrt {C_{\Sigma }} }}}\right )}}\right ){\textbf {V}^{\mathrm { T}}},\ \tag {5}\end{equation*}$ View Source

where

$\mathrm {SN}(\cdot)$

denotes switch normalization (SN) [43].

$\textbf {Q}_{_{k}}^{\mathrm { T}}$

and

$\textbf {V}^{\mathrm { T}}$

are the transposes of

$\textbf {Q}_{k}$

and V, respectively.

MCSA is formed by combining multiple CSA, which can be computed in parallel to obtain the output result. With m heads, the output result $\textbf {MCSA}_{k}$ of the MCSA is computed as

$\begin{equation*} {\textbf {MCSA}_{k}} = \frac {\textbf {CSA}_{_{k}}^{1} + \textbf {CSA}_{_{k}}^{2} +, \cdot \cdot \cdot, + \textbf {CSA}_{_{k}}^{m}}{m} + {\textbf {T}_{k}},\ \tag {6}\end{equation*}$ View Source

Then, applying the IMLP and residual operator, we derive the final output feature ${\textbf {O}_{k}} \in {\mathbb {R}^{d \times {C_{k}}}}$ as

$\begin{equation*} \ {\textbf {O}_{k}} = {\mathrm { IMLP}}\left ({{{\mathrm { LN}}\left ({{\textbf {MCSA}_{k}}}\right )}}\right ) + {\mathrm { LN}}\left ({{\textbf {MCSA}_{k}}}\right ),\ \tag {7}\end{equation*}$ View Source

By repeating Eqs. (5 –7) L times, L MCFT blocks can be assembled. This work assigns a different number of MCFT blocks to each skip connection, corresponding to the severity of the semantic gap.

D. Improved Multilayer Perceptron

The original multilayer perceptron (MLP) used in the Transformer structure employs a two-layer linear mapping mechanism, which may not fully capture the complex shape features present in medical images. To address this limitation, we have developed an improved multilayer perceptron (MLP) with three linear layers, incorporating batch normalization BN in the middle layer and a dropout layer, as shown in Fig. 4.

Fig. 4.

Comparison between the improved MLP and the original MLP.

Show All

The feature sequence from the MCSA is first processed by a linear layer, transforming it into a 784-dimensional feature sequence. After undergoing BN and Gaussian Error Linear Unit (GELU) activation, the sequence enters next linear layer, retaining the same dimensionality. The sequence then passes through another round of BN and GELU activation before entering third linear layer, which reduces its dimensionality back to the original size. Following a dropout layer, the final output provides a comprehensive representation of the complex semantic information in the medical image.

E. Spatial Channel Cross-Attention

In the architecture shown in Fig. 3, we design a spatial channel cross-attention (SCCA) module to facilitate the fusion of features from the USC module and the corresponding decoder layer. Specifically, the decoder output features $\textbf {D}_{k}$ are first upscaled using a convolution operation and then passed through a feature embedding layer to obtain the two-dimensional patch token ${\textbf {T}_{k}} \in {\mathbb {R}^{d * {C_{k}}}}$ , which is then linearly mapped to the query vector ${\textbf {Q}_{k}} \in {\mathbb {R}^{d * {C_{k}}}}$ using Eq. (4). Simultaneously, the features $\textbf {O}_{k}$ from the USC module are linearly mapped to the key vector $\textbf {K} \in {\mathbb {R}^{d * {C_{k}}}}$ and the value vector $\textbf {V} \in {\mathbb {R}^{d * {C_{k}}}}$ through Eq. (4). The query vector $\textbf {Q}_{k}$ and the value vector V are then transposed to obtain $\textbf {Q}^{\mathrm { T}}_{_{k}}$ and $\textbf {V}^{\mathrm { T}}$ , respectively, flipping the spatial and channel information of the feature maps. Next, $\textbf {Q}^{\mathrm { T}}_{_{k}}$ , K, $\textbf {V}^{\mathrm { T}}$ are applied in Eq. (8) for the cross-attention computation. The resulting output feature map $\textbf {M}_{k}$ is then reconstructed and multiplied with $\textbf {D}_{k}$ to update $\textbf {D}_{k}$ , signifying the fusion of information from the USC module and the corresponding decoder layer.

$\begin{equation*} \textbf {M}_{k} = \text {SoftMax}\left ({{{\mathrm { SN}}\left ({{\frac {\textbf {Q}^{\mathrm { T}}_{_{k}}\textbf {K}}{\sqrt {C_{k}}}}}\right ) }}\right )\textbf {V}^{\mathrm { T}} \tag {8}\end{equation*}$ View Source

F. Computational Complexity

This study designs the USC and SCCA modules based on the self-attention mechanism. Both modules employ softmax attention as described in Eq. (5) and Eq. (8), which involves matrix operations to compute the similarity between all Q-K pairs, resulting in a complexity of $\mathrm {O}({N_{k}^{2}}\cdot d)$ . In the U-Net architecture, convolution operations are the primary source of computational complexity, which can be expressed as:

$\begin{equation*} \mathrm {O}(H \times W \times {\kappa ^{2}} \times 2C_{in} \times C_{out}), \tag {9}\end{equation*}$ View Source

where the input image channels are denoted as

$C_{int}=3$

, and the convolution kernel size is

$\kappa =3$

. The output map channels

$C_{o}ut$

are 64, 128, 256, and 512, respectively. Given that

$C_{in}$

and

$\kappa$

are relatively small, the complexity can be simplified to

$\mathrm {O}(N \cdot C_{out})$

. Therefore, the computational complexity of USCT-UNet is approximately

$\mathrm {O}({N_{k}^{2}}\cdot d + N \cdot C_{out})$

. Despite the increased computational complexity, it is important to note that attention computation is performed on the output feature maps of the encoder and decoder layers, rather than on the original input image. As a result, the increase in complexity is not substantial.

SECTION V.

Experiments and Analysis of Results

A. K-Fold Cross-Validation

To rigorously evaluate the segmentation precision of the U-Net and our proposed USCT-UNet models, we conducted 5-fold cross-validation experiments across four distinct datasets. The results are summarized in TABLE V, which indicates that the USCT-UNet consistently outperforms the U-Net in terms of segmentation performance across all datasets. For instance, for ISIC-2018 dataset, the USCT-UNet achieved a 4.79% improvement in Dice, a 5.70% increase in MIoU, and a reduction of 2.15 in HD. Similarly, for GlaS dataset, the USCT-UNet demonstrated a 2.24% enhancement in Dice, a 3.46% enhancement in MIoU, and a 3.26 decrease in HD. Additionally, the USCT-UNet showed substantial improvements in these evaluation metrics for both DSB-2018 and MoNuSeg datasets.

TABLE V Results of 5-Fold Cross-Validation

B. Ablation Studies

We conducted ablation study using four challenging datasets, with the results presented in TABLE VI. In these experiments, we systematically evaluated the effects of the USC module, the SCCA module, and their combination on model performance. The results demonstrate that the USC module significantly enhances model performance across all four datasets. For instance, on ISIC-2018 dataset, the Dice for the UNet+USC model improves to 87.79%, which is 3.92% higher than that of the UNet of 83.87%. On DSB-2018 dataset, the MIoU for the UNet+USC model improves to 82.08%, which is 2.85% higher than that of the UNet of 79.23%. Similarly, incorporating the SCCA module substantially improves model performance. For example, on MoNuSeg dataset, the MIoU for the UNet+SCCA model reaches 65.81, which is 3.48% higher than that of the UNet of 62.33%. On GlaS dataset, the HD for the UNet+USC model decreases to 26.49, which is 1.60 higher than that of the UNet of 28.09.

TABLE VI Results of The Ablation Experiments to Assess The Contribution of USCT-UNet’s Components on Four Datasets

The model’s performance reaches its optimal level when both the USC and SCCA modules are combined. For instance, on GlaS dataset, the Dice and MIoU for the UNet+USC+SCCA model improve to 91.16% and 84.44%, On DSB-2018 dataset, the Dice and MIoU rise to 89.12% and 83.15%, respectively. These enhancements demonstrate that the USC module effectively mitigates the negative impact of the semantic gap, while the SCCA module proficiently integrates features from both the USC and the decoder. However, these performance gains come at the cost of increased computational complexity. Its parameters and FLOPs increase to 31.15M and 62.05G, respectively, which may affect the model’s real-time performance and deployment efficiency.

We further investigate the role of the USC module in USCT-UNet, specifically how it addresses the semantic gap between the decoder and encoder. This section includes a detailed ablation study of the USC scheme using four datasets, with results provided in Fig. 5. The Dice, MIoU, and HD results across the four datasets demonstrate that the USC scheme significantly enhances model’s segmentation performance. The improvements are particularly pronounced on more challenging datasets, such as MoNuSeg and ISIC-2018, due to the morphological diversity and complexity of the target regions in these datasets. These factors make it more difficult for segmentation models to capture accurate semantic information, whereas our USC scheme effectively overcomes this challenge.

Fig. 5.

Results of the ablation experiments to assess the contribution of the USC options across multiple assessment metrics.

Show All

The effectiveness of the proposed USC module in addressing the semantic gap between different layers of the encoder and decoder has been quantified through the data presented in TABLEs VII, VIII, and IX. In these TABLEs, “+” indicates improvement. These results demonstrate significant improvements in the model’s ability to handle semantic mismatches across layers. The findings reveal three significant improvements facilitated by the USC scheme compared to the DSC scheme (as detailed in TABLEs II, III, and IV).

Across the four datasets, all variations of the USC scheme positively effect the segmentation process. This underscores the efficacy of USCT-UNet in bridging the semantic gap between corresponding layers of the decoder and encoder. For instance, on DSB-2018 dataset, compared to the configuration without skip connections (“None”), “SC1,” “SC2,” “SC3,” and “SC4” increased Dice by 1.79%, 1.62%, 1.02%, and 1.87%, respectively. On ISIC-2018 dataset, they enhanced MIoU by 2.48%, 2.19%, 2.15%, and 2.34%, respectively.
The USC module eliminates the adverse impacts associated with the direct skip connection (DSC) while maximizing its benefits. For example, on MoNuSeg dataset, transitioning from “SC4” to “SC4+SC1,” “SC4+SC1+SC2,” and “SC4+SC1+SC2+SC3” increased Dice from 1.42% to 3.28%, 4.07%, and 5.60%, respectively. On GlaS dataset, they reduced HD from 1.16 to 1.01, 1.07, and 1.47, respectively.
The USC module significantly improved the model’s segmentation performance. Specifically, on GlaS dataset, USCT-UNet achieved a Dice of 91.16%, a MIoU of 84.44%, and a HD of 24.83. On DSB-2018 dataset, the model achieved a Dice of 89.12%, a MIoU of 83.15%, and a HD of 15.11. These results underscore the effectiveness of the USC scheme in bridging the semantic gap between the decoder and encoder.

TABLE VII Results of the Ablation Experiments to Assess the Contribution of the USC Options on Dice Metric

TABLE VIII Results of the Ablation Experiments to Assess the Contribution of the USC Options on MIoU Metric

TABLE IX Results of the Ablation Experiments to Assess the Contribution of the USC Options on HD Metric

C. Segmentation Results Visualisation

In this section, we present visual comparisons of the segmentation results produced by the proposed USCT-UNet and benchmark models to demonstrate USCT-UNet’s superiority in medical image segmentation tasks. These results are depicted in Fig. 6, where the red boxes indicate areas where USCT-UNet outperforms the other models in segmentation effectiveness. USCT-UNet clearly delivers superior segmentation outcomes that more closely resemble the ground truth compared to U-Net and UCTransNet. As illustrated in Fig. 5, USCT-UNet accurately highlights the correct pathological regions while suppressing false positives and delineates continuous boundary edges, demonstrating its effectiveness in precise and robust segmentation.

Fig. 6.

Qualitative comparative results on four challenging medical image databases.

Show All

D. Comparison With State-of-the-Art Related Methods

In this subsection, we comprehensively analyze the segmentation performance, parameters, and computational complexity of USCT-UNet across four datasets. We compare its performance with the latest CNN and Transformer-based methods, as detailed in TABLEs X and XI. The results indicate that USCT-UNet achieves the highest scores in both Dice and MIoU metrics, with values of 89.12% and 83.15% on DSB-2018, 88.66% and 81.62% on ISIC-2018, 91.16% and 84.44% on GlaS, and 80.07% and 66.99% on MoNuSeg, representing the highest values across all datasets. Additionally, compared to methods such as HRNetV2 [44], TransFuse [49], UDTransNet [53], UCTransNet [26] and LCAMix [52], our USCT-UNet demonstrates design efficiency by significantly improving precision while maintaining a low parameters (31.15M Param) and computational complexity (62.05G FLOPs), highlighting its potential for broad clinical deployment and application.

TABLE X Comparison With the Latest Relevant Methods on DSB-2018 and ISIC-2018

TABLE XI Comparison With the Latest Relevant Methods on GlaS And MoNuSeg

SECTION VI.

Conclusion

Precise and automated segmentation of medical images is a pivotal component in clinical diagnostics and rehabilitation monitoring. This study systematically evaluates the semantic gap between corresponding layers of the U-Net architecture, identifying a disparity in the magnitude of this gap across various layers, with more pronounced gaps in the upper layers and less severe ones in the deeper layers. This semantic gap negatively impacts the segmentation process when using direct skip connections (DSC).

To address this issue and ensure reliable and accurate automated segmentation of medical images, we propose the USCT-UNet architecture, which integrates U-Net with U-shaped skip connections (USC) and a spatial channel cross-attention (SCCA) module. The USC, constructed using multichannel fusion Transformer blocks, replaces the direct skip connections (DSC). The SCCA module, developed using self-attention mechanisms, facilitates the fusion of the decoder’s output features with those from the USC. Experimental results confirm that the proposed method effectively eliminates the semantic gap between the decoder and encoder, significantly enhancing state-of-the-art medical image segmentation performance on several benchmark datasets.

Despite the excellent segmentation performance achieved by USCT-UNet, there are two potential areas for improvement: 1) Segmenting a 3D image requires converting it from 3D to 2D, which may result in the loss of correlation information between slices. 2) The inclusion of the USC and SCCA modules significantly increases computational complexity. Therefore, future enhancements to the USCT-UNet architecture will focus on constructing 3D models and designing an attention mechanism with linear computational complexity.

References is not available for this document.

USCT-UNet: Rethinking the Semantic Gap in U-Net Network From U-Shaped Skip Connections With Multichannel Fusion Transformer

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work

A. Studies on the Semantic Gap in U-Net

B. Transformer-Based Segmentation Methods

Quantitative Analysis of Semantic Gap

A. Databases

B. Implementation Details and Assessment Metrics

C. Quantitative Results and Analysis

1) Finding 1:

2) Finding 2:

3) Finding 3:

Methodology

A. Overall Architecture

Algorithm 1 Encoder Feature Embedding

Algorithm 2 Medical Image Segmentation Based on USCT-UNet

B. U-Shaped Skip Connection

C. Multichannel Fusion Transformer

1) Encoder Feature Embedding:

2) Multihead Channel Self-Attention:

D. Improved Multilayer Perceptron

E. Spatial Channel Cross-Attention

F. Computational Complexity

Experiments and Analysis of Results

A. K-Fold Cross-Validation

B. Ablation Studies

C. Segmentation Results Visualisation

D. Comparison With State-of-the-Art Related Methods

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

USCT-UNet: Rethinking the Semantic Gap in U-Net Network From U-Shaped Skip Connections With Multichannel Fusion Transformer

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Work

A. Studies on the Semantic Gap in U-Net

B. Transformer-Based Segmentation Methods

Quantitative Analysis of Semantic Gap

A. Databases

B. Implementation Details and Assessment Metrics

C. Quantitative Results and Analysis

1) Finding 1:

2) Finding 2:

3) Finding 3:

Methodology

A. Overall Architecture

Algorithm 1 Encoder Feature Embedding

Algorithm 2 Medical Image Segmentation Based on USCT-UNet

B. U-Shaped Skip Connection

C. Multichannel Fusion Transformer

1) Encoder Feature Embedding:

2) Multihead Channel Self-Attention:

D. Improved Multilayer Perceptron

E. Spatial Channel Cross-Attention

F. Computational Complexity

Experiments and Analysis of Results

A. K-Fold Cross-Validation

B. Ablation Studies

C. Segmentation Results Visualisation

D. Comparison With State-of-the-Art Related Methods

Conclusion

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?