Introduction

Tongue diagnosis is one of the most important diagnostic methods in Traditional Chinese Medicine (TCM) and serves as a non-invasive and convenient assessment of human health1. TCM posits that the tongue is closely linked to an individual’s health. When there are abnormalities or diseases in the body, the characteristics of the tongue undergo significant changes. Doctors evaluate the severity of the condition by analyzing changes in the patient’s tongue features and making corresponding diagnoses2. The tip of the tongue corresponds to the lungs, shoulders, heart, and brain, while the body of the tongue reflects the spleen and stomach; the sides relate to the liver and gallbladder, and the back corresponds to the kidneys and intestines3. An essential step in TCM diagnosis is examining the tongue and tongue coating. Different tongue characteristics correspond to varying states of the body. For instance, a red tongue with pronounced teeth marks may indicate excessive dampness within the body. If the tongue surface exhibits fissures of varying depth and shape, it is termed a “fissure tongue,” often a symptom of yin deficiency and excessive heat.

Tongue fissures and teeth marks are two common tongue features in TCM diagnosis4, holding significant clinical diagnostic value. Tongue fissures typically refer to longitudinal or transverse fissures on the tongue surface, often related to insufficient qi and blood or dysfunction of internal organs. Based on the shape and distribution of these fissures, TCM practitioners can infer the patient’s organ conditions and overall health. Teeth marks appear as indentations along the tongue’s edges, usually associated with dysfunction of the spleen and stomach or qi deficiency5. By analyzing tongue fissures and teeth marks, TCM practitioners can make more accurate assessments of diseases and formulate tailored treatment plans, thus enhancing the comprehensiveness and personalization of diagnoses6.

From the above, it is evident that effectively detecting the status of tongue teeth marks and fissures can help anticipate a patient’s constitution, thereby improving the prediction of potential diseases. Utilizing computer vision and neural networks to automatically extract features related to tongue teeth marks and fissures can assist doctors in accurately assessing patients’ physiological conditions. Providing precise feature predictions can furnish doctors with comprehensive reference information. Additionally, such tools can be integrated into applications or mini-programs to assist patients in self-assessment, thus allowing for proactive avoidance of health risks.

Tang et al.7 proposed using a cascaded Convolutional Neural Network (CNN) for tongue image segmentation and landmark prediction, combining Deformable Convolutions (DCN) for fine classification. Although this improved detection accuracy, it increased model complexity and made hyperparameter tuning sensitive. Lo Lun-chien et al.8 introduced a system using HIS color space transformation, polar coordinate transformation, and active contour models for tongue image segmentation and analysis of tongue surface features. This system can accurately detect features such as fissures and teeth marks but is limited by lighting conditions and dataset size. Wu et al.9 proposed a multimodal intelligent diagnostic model (MMTV) based on EEG and tongue images for assisting depression diagnosis, incorporating a dual-stream input mechanism and self-attention into EEGNet, as well as a multi-step pre-training approach for the ViT model, all of which enhance feature extraction capabilities. However, this requires substantial computational resources, which may limit its application in resource-constrained environments. Ni et al.10 presented an improved capsule network model (TongueCaps) for tongue color classification in TCM, though capsule networks are generally considered less interpretable than CNNs, potentially affecting their transparency and reliability in clinical applications. Wang et al.11 achieved over 90% accuracy in identifying teeth-marked tongues using ResNet34, although the study noted a relatively small sample size and potential class imbalance, which could affect model performance. Li et al.12 employed multi-instance learning and CNN feature recognition for teeth-marked tongues, with experimental results indicating reasonable accuracy, but the specifics of dataset size, diversity, and experimental settings warrant further consideration.

In recent years, with the development of object detection algorithms, the YOLO series has gained popularity in various fields due to its lightweight, fast, and efficient attributes. The introduction of deep learning has significantly enhanced object detection performance, particularly through the excellent image feature extraction capabilities of Convolutional Neural Networks (CNNs). Deep learning-based object detection methods can be categorized into two main types: two-stage methods and single-stage methods. Two-stage methods, represented by the R-CNN13 series-including R-CNN, Fast R-CNN14, and Mask R-CNN15-rely on stepwise processing for candidate region generation and classification, offering high accuracy but with greater computational overhead. In contrast, single-stage methods, represented by YOLO and SSD16, such as the YOLOv117 to YOLOv818 series, directly output bounding boxes and classes through a single network prediction, balancing speed and accuracy, making them suitable for real-time detection scenarios. Among single-stage methods, the YOLO (You Only Look Once) series stands out for its simplified structure and efficient detection speed. YOLOv1 introduced an innovative framework that simplifies the object detection task into a single regression problem; however, while it is fast, its accuracy is somewhat lacking. With successive iterations, YOLOv2, YOLOv3, and YOLOv4 gradually improved accuracy and robustness while maintaining high-speed detection advantages. The latest YOLOv519, YOLOv7, and YOLOv8 further optimized the model architecture, achieving a better balance between speed and accuracy, making them applicable to a wider range of scenarios. Another single-stage method, SSD (Single Shot Multibox Detector), predicts through multi-scale feature mapping layers, balancing speed and accuracy, particularly suited for real-time applications. Additionally, RetinaNet introduces Focal Loss to address the imbalance between positive and negative samples, effectively enhancing detection performance for small objects, making it suitable for large-scale datasets.

In summary, for tongue feature detection, deploying a lightweight and fast model on hardware is essential. Although two-stage object detection algorithms offer high accuracy, they require more time and resources. Single-stage object detection algorithms still have room for improvement in accuracy. This paper utilizes an improved YOLOv8 to recognize tongue teeth marks and fissures, ensuring enhanced accuracy while maintaining high speed to meet the demands of the application scenario.

The main contributions of this paper are as follows:

1. Development of an Improved Tongue Diagnosis Image Detection Model Based on YOLOv8: Building upon the traditional YOLOv8 model, the paper propose a novel improved model specifically designed for tongue diagnosis image detection. By optimizing the model architecture, particularly in handling the subtle features of the tongue and complex backgrounds, enhance the detection capability of the model.

2. C2f_DCNv3 and SE Attention Mechanism:This study innovatively integrates Deformable Convolution (DCN)20 and the Squeeze-and-Excitation (SE)21 Attention Mechanism into the YOLOv8 model. Deformable Convolution allows for adaptive adjustment of the convolution kernel’s position, enabling more precise capture of irregular features such as tongue fissures and teeth marks. The C2f_DCNv3 module, through weighted fusion within the module, not only retains the feature-capturing capability of DCNv3 but also significantly reduces the model’s parameter count. The SE Attention Mechanism15 dynamically adjusts the weights of feature channels, enhancing the model’s focus on critical features. The combination of C2f_DCNv3 and SE leads to a substantial improvement in detection accuracy and efficiency.

3. Dataset and Model Performance: The dataset used for this model was annotated by professional practitioners at the Shanxi University of Traditional Chinese Medicine Hospital. Experimental results show that the improved model reduces the FLOPS from the original YOLOv8’s 28.8G to 21G, achieving a one-third reduction in computational load while maintaining high performance and significantly enhancing computational efficiency. Compared to the original YOLOv8 model, our proposed improved model demonstrates a significant performance increase in detecting tongue fissures and teeth marks, achieving a total mean Average Precision (mAP) of 92.77%. Specifically, the mAP for tongue fissures reached 91.34%, a 5.24 percentage point improvement over the original model, with enhancements in F1 score, precision, and recall. The mAP for tongue teeth marks increased to 94.21%. These experiments validate the effectiveness of the improved model in complex tongue feature detection.

YOLOv8 algorithm

The YOLOv8 architecture continues the design philosophy of YOLOv5 and incorporates optimizations and improvements in various aspects. Despite its impressive performance in many applications, YOLOv8 still faces limitations when dealing with complex backgrounds and irregular features. Traditional convolution operations extract features within a fixed receptive field, which may lack the adaptability needed for complex or deformed targets like tongue fissures and teeth marks. Consequently, YOLOv8 may experience feature loss or false detections in such cases, leading to decreased accuracy. Furthermore, when processing highly textured or complex background medical images, background interference can adversely affect precise target recognition.

Therefore, improving the YOLOv8 model to enhance its detection capabilities for irregular and complex targets has become a crucial research direction for enhancing its performance in medical imaging applications. This paper further optimizes the multi-scale feature fusion in the Neck module, enabling the model to better handle targets of varying sizes, particularly small pathological features in complex backgrounds. Through these improvements, YOLOv8 significantly enhances its ability to identify fine-grained targets while maintaining efficient detection speeds.

Improved YOLOv8 network

YOLOv8 iimprovements

In the improved YOLOv8 model, as shown in the Fig. 1, the original C2f module has been replaced with the lightweight C2f_DCNv3 module. Compared to the original C2f module, C2f_DCNv3 significantly reduces both the number of parameters and computational costs while expanding the receptive field for spatial information, further enhancing the model’s detection capabilities. The C2f_DCNv3 module incorporates Deformable Convolution (DCN), which strengthens the model’s ability to capture complex and irregular fine features.

Figure 1
figure 1

yolov8-our.

Additionally, the Squeeze-and-Excitation (SE) attention mechanism has been integrated between the Backbone and Neck frameworks. This not only retains the Backbone’s effective feature extraction but also enhances the model’s focus on critical features within the original image, improving feature weighting capabilities. The SE mechanism adaptively adjusts the feature weights across different channels, enabling the model to maintain high-precision detection of targets even in complex backgrounds.

This improvement, combining the spatial deformation capability of DCNv3 with the channel attention of the SE mechanism, significantly enhances the accuracy and robustness of the YOLOv8 model in tongue image detection tasks. It provides stronger support for detecting complex morphological features such as tongue indentations and fissures, which are crucial in Traditional Chinese Medicine (TCM) tongue diagnosis.

By optimizing the network architecture, the improved YOLOv8 model ensures high precision in specific detection tasks. Particularly in the complex scenario of tongue diagnosis, YOLOv8’s efficient feature extraction and target localization capabilities enable accurate identification of fine features on the tongue’s surface. The enhanced model not only offers rapid and precise detection capabilities but also meets the real-time detection requirements in TCM diagnostics, providing robust support for intelligent tongue diagnosis.

Deformable convolution DCNv3

Traditional convolution uses fixed convolutional kernels, as shown in Fig. 2a which limits the ability to model geometric transformations, such as object deformations. Since such deformations are common in tasks like object detection and medical image segmentation, introduce deformable convolution, as illustrated in figures (b) and (c). This enhancement aims to improve the accuracy of detecting features like tongue indentations and fissures. By allowing the convolutional kernel to adaptively adjust its shape and size, DCNv3 effectively captures irregular patterns and complex structures, leading to better performance in identifying subtle features in tongue images.

Figure 2
figure 2

Dcnv3.

Standard convolution samples input features at fixed positions, as shown in Eq. (1). In contrast, Deformable Convolution Networks (DCN) introduce learnable offsets, as described in Eq. (2), allowing the network to dynamically adapt to the spatial structures of objects. These offsets,as described in Eqs. (3) and (4), are predicted from the input feature map and applied during the convolution operation, enabling the model to focus on regions of interest that may not align with the grid. By incorporating spatial flexibility, DCNv3 can learn more complex geometric variations and object shapes, making it particularly effective for tasks involving irregular or deformable objects, such as indentations and fissures. Compared to DCNv122 and DCNv223, DCNv3 has been further optimized for accurate prediction and computational efficiency, enhancing accuracy. Thus, incorporating DCNv3 into the YOLOv8 model for detecting tongue features like indentations and fissures can significantly improve the model’s performance. Given that these features often exhibit substantial morphological variability, using DCNv3 allows for more flexible capture of these changes, enhancing the model’s generalization ability and diagnostic accuracy.

Regular conv

$$\begin{aligned} y(p_{0})&= \sum _{p_{n}\in R } w(p_{n}) \cdot x(p_{0}+p_{n}) \end{aligned}$$
(1)

Deformable conv

$$\begin{aligned} y(p_{0})&= \sum _{p_{n}\in R } w(p_{n})\cdot x(p_{0}+p_{n}+\bigtriangleup p_{n}) \end{aligned}$$
(2)
$$\begin{aligned} x(p)&= \sum _{q} G(q,p)\cdot x(q) \end{aligned}$$
(3)
$$\begin{aligned} p&= p_{0}+p_{n}+\triangle p_{n} \end{aligned}$$
(4)

C2f_DCNv3 model

Figure 3
figure 3

C2f_Dcnv3.

As shown in Fig. 3, the C2f_DCNv3 module addresses the challenges associated with increased computational complexity and parameter count as the network depth increases, which can limit the model’s applicability on resource-constrained devices. This module optimizes the network structure by introducing the Bottleneck_DCNv3.The input image first passes through a convolutional layer with a kernel size of 1x1. This layer not only alters the dimensionality but also facilitates cross-channel information exchange, simplifying the model. The output from this convolutional layer is then split into two equal parts, each with dimensions of Eq. (5).

$$\begin{aligned} h \times w \times \frac{C_{out}}{2} \end{aligned}$$
(5)

Each partition is processed through a Bottleneck_DCNv3 structure, which further reduces the parameter count and computational complexity while maintaining the model’s expressive power. The outputs of the two Bottleneck_DCNv3 structures are merged along the channel dimension, resulting in a combined feature map with dimensions of Eq. (6),where n is the expansion coefficient of the Bottleneck layer. Finally, the merged feature map is processed by another convolutional layer, which also employs a 1x1 kernel with a stride of 1, ultimately yielding an output with CoutC_outCout channels.

$$\begin{aligned} h \times w \times (n + 2) \frac{C_{out}}{2} \end{aligned}$$
(6)
Figure 4
figure 4

Dcnv3_YoLo.

As shown in Fig. 4, the left side depicts the DCNv3-YOLO module. The input features are first reshaped using a Permute operation before being fed into the DCNv3 module, allowing it to learn from a broader receptive field. The features are then adjusted back to their original shape using Permute, followed by batch normalization (BN) and an activation function to produce the output.

As shown in Fig. 5, the Bottleneck_DCNv3 structure is the core of this model, utilizing depthwise separable convolutions to reduce both the parameter count and computational complexity. The input feature map first passes through a depthwise separable convolution layer (with K = 3, s = 1, p = 1), which halves the number of output channels to h\(\times\)w\(\times\)c/2 . Subsequently, batch normalization is applied to reduce internal covariate shift, enhancing the model’s generalization capability. The output of the batch normalization is then passed through an activation function (ReLU) to increase the model’s non-linear expressiveness.The second component is the deformable convolution, which enhances the model’s ability to compute the receptive fields for irregular features. Finally, a check on whether the Shortcut is set to true determines if a residual connection is applied, resulting in a final output feature map with dimensions of h\(\times\)w\(\times\)c.

Figure 5
figure 5

Bootleneck-DCNv3.

The C2f_DCNv3 module significantly reduces the number of model parameters and computational load through the use of depthwise separable convolutions and deformable convolutions, thus enhancing the operational efficiency of the model. Following this, batch normalization and activation functions help maintain the model’s expressive capability. Compared to the original C2f module, this structure effectively compresses and enhances the input features by integrating depthwise separable and deformable convolutions.

SE module

As shown in Fig. 6, the SE module effectively enhances the representation power of the network by allowing it to focus on important features, thereby improving overall model performance.

Figure 6
figure 6

SE_attention.

Squeeze Operation : The squeeze operation involves applying global average pooling (denoted as Eq. (7)) to the output feature map U, generating a feature map fg of size 1\(\times\)1\(\times\)C1. This process effectively compresses the spatial dimensions, resulting in a representation that summarizes the global information of each channel, allowing the model to capture the overall significance of features across the input.

$$\begin{aligned} f_{s}&= F_{sq}(U) \end{aligned}$$
(7)

Excitation Operation: As shown in Eq. (8), in the excitation operation, the feature map fg is fed into a fully connected layer (FC) followed by a ReLU24 activation function. This is then passed through another fully connected layer with a sigmoid25 activation function, resulting in a scaling vector fs. This vector captures the importance of each channel by dynamically adjusting their weights, effectively enhancing the model’s focus on the most relevant features for improved performance.

$$\begin{aligned} f_{s}&= \sigma (FC(ReLU(FC(f_{g})))) \end{aligned}$$
(8)

Scale Operation: As shown in Eq. (9), the scale operation involves element-wise multiplication of the original feature map U with the scaling vector fs. This operation, denoted as Fscale, produces the final output feature map X. By applying this scaling, the model emphasizes important channels while suppressing less relevant ones, thereby enhancing the overall feature representation and improving detection accuracy.

$$\begin{aligned} X&= F_{scale}(U, f_{s}) \end{aligned}$$
(9)
Figure 7
figure 7

SE_attention progress.

The structure of the SE attention mechanism is in the Fig. 7. After passing through the C2f module, the feature maps enter a global pooling layer for compression. Subsequently, the outputs are processed through ReLU and Sigmoid functions to obtain the scaling factors. This process effectively adjusts the weights for each channel, enhancing the model’s focus on important channels and improving overall feature representation.

Results

Datasets and environments

Image quality26 is critical for effective training,use dedicated imaging equipment to collect pictures.As shown in the Fig. 9, the dataset used in this study was provided by the Affiliated Hospital of Shanxi University of Traditional Chinese Medicine, with tongue diagnosis features annotated by experienced physicians from the hospital.

Figure 8
figure 8

Dataset.

Figure 9
figure 9

Collect tongue data.

As shown in the Fig. 8, all images were divided into training, validation, and test sets. Since each image may contain multiple annotations for teeth marks and fissures, the actual number of annotations is higher than indicated in the dataset description. During training, all images were automatically resized to 576x576. The optimizer used in this study was the SGD27optimizer, which is more suitable for detecting fine features such as teeth marks and fissures compared to the Adam28 optimizer. The batch size was set to 12, and an adaptive learning rate was applied to train the model. The experimental environment included an Ubuntu 20.04 system, three NVIDIA GeForce RTX 4090 GPUs, and the PyTorch framework.

Experimental process

Figures 10 and 11 illustrates the training and validation processes for different models. All models show a gradual convergence as the number of epochs increases, with the loss values rapidly decreasing. This indicates that the models effectively learned during the training process.

Figure 10
figure 10

Train_loss.

Figure 11
figure 11

Val_loss.

During the training process, the mean Average Precision (mAP) is also a crucial metric. mAP is widely used in object detection tasks to assess performance and measures the average precision of the model in identifying objects across different categories. Specifically, mAP@0.5 refers to the mAP calculated when the IoU (Intersection over Union) threshold is set to 0.5.

Figure 12 demonstrates the mAP changes throughout the training process. As seen in the figure, with an increasing number of training epochs, the YOLOv8+C2f_DCNv3+SE (our model) exhibits excellent performance. In comparison, models incorporating other attention mechanisms such as CBAM, ECA, or those using only SE or C2f_DCNv3 alone, show lower mAP@0.5 values under the same training conditions. Moreover, when compared to non-YOLO models like the Faster R-CNN model, our model demonstrates a significant advantage.

The final mAP@0.5 value for the C2f_DCNv3+SE model is the highest, reaching 92.4%. This further validates the effectiveness of the C2f_DCNv3 and SE modules in enhancing object detection accuracy and efficiency.

Figure 12
figure 12

mAP@0.5.

Ablation study

In this study, the original YOLOv8 model was used as the baseline for experiments, and various attention mechanisms such as CBAM29 and ECA30 were incorporated for comparison. Additionally, conducted experiments with C2f_DCNv3 and SE separately to demonstrate the superiority of our proposed model. Precision was used to evaluate the accuracy of the model’s predictions for tongue teeth marks and tongue fissures, reflecting the proportion of true cases among the correctly predicted cases. Recall measures whether all instances of tongue teeth marks and fissures were detected by the model. The F1 score represents the harmonic mean of Precision and Recall. Meanwhile, mAP (mean Average Precision) indicates the overall performance, with higher values suggesting better model effectiveness. The following two tables show the prediction results of tongue teeth marks and fissures under different models in the ablation experiments. As shown in Table 1 ,it is evident that after incorporating the C2f_DCNv3 and SE attention mechanisms, the model’s performance significantly surpassed that of other models. Specifically, in the prediction of fissures, the mAP reached 91.34%, which was much higher than other models, and the F1 score was 0.87, positioning it as the best among similar models. Although the Recall score was 84.94%, slightly lower than that of Faster-R-CNN, the overall performance of our model remained at a clear advantage.

Table 1 Fissures.

Table 2 shows the prediction performance for Toothmark. The mAP and F1 score reached the highest values, demonstrating the superior generalization ability of the model compared to other models. This highlights the overall performance advantage of our approach in accurately detecting toothmarks.

Table 2 Toothmark.

The Precision-Recall (PR) curve represents the relationship between Precision and Recall. The closer the PR curve is to the top right corner, the better the model’s training performance. As shown in Figs. 13 and 14, with the incorporation of the C2f_DCNv3 and SE modules, the model demonstrates significant advantages in predicting toothmarks and tongue fissures. This indicates that our model excels at detecting such small and subtle features. It is clearly visible from the figure that the PR curve of the C2f_DCNv3 and SE model combination is the closest to the top right corner, further validating the effectiveness of the model.Since Faster-RCNN31, EfficientDet32, and CenterNet33 are non-YOLO models, the PR curve only presents Faster-RCNN to represent the comparison with other models.

Figure 13
figure 13

Tongue fissures-PR.

Figure 14
figure 14

PR-toothmark.

Table 3 presents the mAP parameters for predictions made on the test set. Unlike the training and validation sets, the test set better reflects the model’s superiority. From the table, it is evident that our model not only achieves a significantly higher mAP compared to other models but also has a noticeably lower GFLOPS. This demonstrates that our model can significantly reduce computational complexity while maintaining high accuracy, achieving efficient performance.

Table 3 mAP@0.5 and GFLOPS.

To enhance the computational efficiency of our model, applied several lightweight techniques to reduce the FLOPS (Floating Point Operations Per Second). Through optimizations such as fusing convolution and batch normalization layers and introducing the C2f_DCNV3 module with reduced channel widths, the model’s complexity was effectively reduced. As a result, the total computational cost was reduced from 28.8 GFLOPS to 22.91 GFLOPS, representing a significant 20.5% reduction in FLOPS. This reduction was achieved while maintaining model accuracy, demonstrating the efficiency of the proposed optimizations.

Figure 15 shows the prediction results, where (a) represents a tongue with teeth marks, (c) represents a tongue with fissures, and (b) represents a tongue with both teeth marks and fissures. Panels (d), (e), and (f) correspond to the prediction results for the aforementioned images. It can be observed that the model is able to accurately predict these fine pathological features, such as teeth marks and fissures, with high precision, which meets the clinical needs of Traditional Chinese Medicine (TCM).

Figure 15
figure 15

Tongue detection result.

Conclusion and limitations

In this study, proposed an improved YOLOv8 model for the efficient detection of tongue fissures and teeth marks, utilizing the C2f_DCNv3 module. By introducing deformable convolution and the SE attention mechanism, the model demonstrated superior adaptability and accuracy in handling complex backgrounds and irregular features. Experimental results showed that the improved model achieved a mean Average Precision (mAP) of 92.77% in detecting tongue fissures and teeth marks, significantly enhancing detection performance while optimizing computational efficiency. This achievement not only provides more reliable technical support for Traditional Chinese Medicine (TCM) tongue diagnosis but also opens new application prospects for medical image analysis in related fields.At the present time, the model cannot be applied to video recognition.In the future, the model can be further adjusted to accommodate a wider range of scenarios.