Knowledge discovery of diseases symptoms and rehabilitation measures in Q&A communities

Zhang, Yanli; Wang, Tao; Wang, Yan; Cao, Jingyu

doi:10.1038/s41598-025-98300-9

Download PDF

Article
Open access
Published: 19 April 2025

Knowledge discovery of diseases symptoms and rehabilitation measures in Q&A communities

Yanli Zhang¹,
Tao Wang⁴,
Yan Wang² &
…
Jingyu Cao³

Scientific Reports volume 15, Article number: 13593 (2025) Cite this article

840 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Rehabilitation-related diseases have long recovery times, making frequent hospital visits impractical for patients. There is a high demand for online rehabilitation advice, but valuable Q&A information in online health communities remains largely untapped, leading to wasted medical resources. This study developed a BERT-BiGRU-attention model to extract three types of entity relationships: disease symptoms, appropriate rehabilitation measures, and inappropriate rehabilitation measures. This model achieved optimal knowledge extraction results. We then used a clustering analysis model to group disease-related knowledge, helping to uncover useful information for rehabilitation patients, assist in medical diagnosis, and enhance health education.

Towards knowledge-infused automated disease diagnosis assistant

Article Open access 11 June 2024

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction

Article Open access 20 May 2021

A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches

Article Open access 26 February 2025

Introduction

Medical Rehabilitation is a systematic project that helps patients restore physical function, psychological state, and social participation capabilities through multidimensional interventions¹. The importance of medical rehabilitation has become increasingly prominent, yet the field currently faces two core challenges: (1) Fragmented Knowledge: Rehabilitation services require the integration of multidisciplinary knowledge (e.g., physical therapy, psychological interventions), but the lack of information-sharing mechanisms among institutions leads to uneven resource allocation and inconsistent service quality; (2) Lagging Standardization: Traditional clinical rehabilitation guidelines often rely on generalized protocols, failing to address individual patient needs (e.g., personalized interventions for elderly fracture patients requiring simultaneous management of osteoporosis and cardiopulmonary function)².

With the surge in demand for home-based rehabilitation (globally, up to one-third of the population, or 2.41 billion people, may benefit from rehabilitation during illness or injury, and 68% of chronic disease rehabilitation patients in the U.S. rely on remote guidance), online health communities have emerged as critical channels for patients to access self-management solutions³. These platforms transcend temporal and spatial barriers by integrating patient-provider interaction UGC, such as Q&A data and rehabilitation diaries⁴. However, platform data utilization presents dual contradictions: patient-side inefficiencies due to unstructured Q&A data (e.g., a 37% duplication rate in queries like “post-knee replacement swelling management”) and platform-side resource waste from insufficient fine-grained knowledge mining^5,6.

Our research aims to systematically construct a dynamic “disease-symptom-rehabilitation measure” association framework, transforming tacit experiential knowledge from unstructured UGC into explicit representations to provide rehabilitation decision support while enabling knowledge sharing. To achieve this, we designed a BERT-BiGRU-attention model to extract relationships among three entity types—disease, symptom descriptions, and rehabilitation measures—from two million rehabilitation Q&A entries. Entity clustering further uncovered additional associative knowledge. This work holds significant value for supplementing and integrating clinical rehabilitation data, overcoming the static limitations of traditional guidelines. Additionally, the extracted knowledge enables dynamic matching of rehabilitation measures with patient profiles, empowering patients in self-managed rehabilitation.

Related work

Research on knowledge extraction in online health communities

A massive user base in online health communities generates a wealth of user-generated content containing a plethora of untapped information. Extracting knowledge from this can offer new insights and solutions to health issues. For instance, extracting important relationships between diseases, symptoms, and medications^7,8, analyzing the distribution of topics within breast cancer communities, tracking topic dynamics, and observing changes in disease severity over time⁹, using user comments to make wiser choices in selecting experts¹⁰, and identifying depression emotions¹¹. Additionally, extraction of medicine knowledge from online health communities plays a crucial role in the field of relationship extraction. Examples include extracting adverse drug events (ADEs) from user-generated content in online health communities^12,13, identifying new indications for drugs beyond their package inserts^14,15,16, detecting prescription drug abuse¹⁷, extracting relationships between drugs and their effects using the Medical Dictionary¹⁸, and extracting dietary recommendation knowledge from online health websites using rule-based methods, among others¹⁹.

In the context of multimodal rehabilitation knowledge mining, the vast unstructured user-generated content (UGC) from online rehabilitation Q&A communities, including colloquial expressions and metaphors, provides novel pathways to explore relationships among diseases, symptoms, and rehabilitation measures. This supports the construction of dynamic rehabilitation knowledge graphs, integrates UGC with clinical guideline knowledge systems, and enhances rehabilitation decision-making.

Research in medical knowledge discovery

Early scholars primarily used pattern matching and machine learning techniques to extract disease-drug relationships from medical texts. Pattern matching relied on syntactic analysis and expert-defined rules, yielding low recall rates. Methods employing syntax analysis and semantic dependencies have been widely adopted to extract relationships among diseases, symptoms, and drugs from diverse datasets. For instance, Iqbal et al. used rule-based extraction to identify drug-side effect relationships from electronic medical records²⁰. In tasks like the I2B2/VA challenge, machine learning was applied to extract relationships among medical concepts in clinical records, including problems, examinations, and treatments²¹. Support vector machines and kernel methods have been effective in extracting relationships between chemicals and diseases from PubMed literature²². Various machine learning algorithms have also been employed to extract semantic relationships such as cure, prevention, and side effects from clinical records and discharge summaries²³, as well as relationships related to patients’ medical problems (diseases, examinations, and treatments) from discharge summaries and medical literature²⁴. Machine learning methods have been widely applied in extracting adverse drug events (ADEs)^12,13, new indications for drug labels¹⁴, drug-drug interactions²⁵. Despite these advancements, machine learning remains constrained by the complexity of feature engineering, and thus its performance requires further improvement.

Artificial intelligence, particularly deep learning, has revolutionized various industries. Unlike traditional machine learning methods that require manual features and domain knowledge, deep learning automates feature extraction through multi-layer neural networks, reducing human effort significantly²⁶. This technology has found wide application in health information processing. For instance, it has been used to extract Bacteria Biotope events²⁷, and classify relationships in medical contexts such as problems-treatment, problems-detection, and inter-medical problem relationships²⁸. Deep learning frameworks have also been effective in tasks like chemical-induced diseases²⁹, nodule detection³⁰, medical labeling, and scanning³¹. Additionally, it has extensive applications in extracting adverse drug events³², drug-drug interactions³³, and the therapeutic effects of drugs on disease^34,35,36. It leverages advanced techniques like BERT word vectors and word embeddings to enhance natural language processing capabilities in health informatics³⁶.

Current research primarily focuses on structured data sources like electronic medical records, discharge summaries, and medical literature abstracts for extracting disease-related relationships. However, these datasets are small in scale, limiting the breadth of knowledge gained from relation extraction. Moreover, the effectiveness of these methods on large, colloquial, and unstructured text datasets is not optimal. Although some scholars have explored using deep learning to extract disease-related relationships from large, unstructured datasets, there remains untapped potential in utilizing vast question-answer data from online health communities. This data, generated by millions of users, could provide valuable insights to enhance knowledge bases and support clinical decision-making. Given that deep learning is pivotal in natural language processing and the primary method for addressing disease-related relationship extraction, this study aims to address the challenge of extracting knowledge from unstructured online Q&A rehabilitation communities between doctors and patients using deep learning techniques, and to leverage the acquired knowledge to facilitate rehabilitation-related health management.

Rehabilitation support systems

Rehabilitation support systems drive the intelligent development of disease recovery through technological innovation and diversified application scenarios. In the realm of personalized rehabilitation recommendations, Al-Remawi and Aburub utilize convolutional neural networks (CNN) to analyze radiomic features (e.g., lymphatic vessel texture) in postoperative breast cancer patients, accurately identifying biomarkers such as VEGF-C³⁷. They further generate personalized nutrition plans via intelligent platforms, significantly improving patient rehabilitation outcomes. The integration of remote rehabilitation and IoT technologies expands application scenarios—for example, wearable sensors combined with decision tree (DT) algorithms enable remote ankle rehabilitation systems to synchronize patient motion data through fiber-optic sensors. Based on DT classification models, these systems provide personalized recommendations to optimize training intensity in real-time¹, thereby reducing reinjury risks.

In the field of sports rehabilitation, researchers employed the KIMORE dataset to evaluate chronic low back pain rehabilitation. Using a GCN-LSTM model to analyze five types of lumbar movements (e.g., trunk rotation, squats), the prediction error for scoring was reduced by 40% compared to traditional methods³⁸. In sports injury optimization, wearable knee sensors monitor gait symmetry, while immersive training enhances patient engagement. Combined with LSTM-based reinjury risk prediction³⁹, secondary surgery rates are markedly reduced.

For AI-driven psychological rehabilitation interventions, technologies such as natural language processing (NLP), random forests, and long short-term memory (LSTM) networks provide real-time support for cancer patients. For example, Woebot (based on cognitive behavioral therapy) and the meditation app Headspace dynamically adjust CBT content through interactive dialogues and sentiment analysis, integrating NLP and random forest (RF) algorithms to significantly alleviate anxiety and depressive symptoms in cancer patients⁴⁰.

In the domain of knowledge base construction and system integration, researchers emphasize the importance of structured knowledge bases. The intelligent medical rehabilitation system proposed by Hou et al. collects patient physiological data in real time through an IoT sensing layer, integrates cross-institutional information sharing via a cloud platform, and supports remote personalized protocol design by physicians (e.g., exercise intensity control for osteoporosis patients)². Additionally, its rehabilitation knowledge base query module transforms unstructured doctor-patient Q&A (UGC) into standardized clinical guidelines. Similarly, Wang et al. developed a remote stroke rehabilitation system that combines a MySQL database with a PHP development framework to automate rehabilitation prescription generation, video guidance, and training reports, reducing reliance on offline resources and alleviating regional healthcare disparities. This system enabled 68% of chronic disease patients in the U.S. to complete training via remote guidance⁴¹. These integrated knowledge systems play a critical role in addressing the static limitations of traditional guidelines.

However, the field still faces multiple challenges: low adoption rates of smart rehabilitation devices and inconsistent evaluation metrics for rehabilitation knowledge across clinical scenarios. Future development must focus on the innovative integration of technology and knowledge. Technologically, federated learning and edge computing can enable low-latency, high-privacy remote rehabilitation management. For standardization, interdisciplinary frameworks involving clinicians, data scientists, and ethicists are needed to establish unified evaluation systems. In knowledge sharing, multi-modal rehabilitaton knowledge (e.g., text, imaging data, sensor signals, and patient reports) should be mined and fused through diverse channels to promote collaborative refinement of rehabilitation knowledge systems. Building on this, our study proposes a BERT-BiGRU-attention model to extract implicit rehabilitation knowledge from user-generated content (UGC) in online health Q&A communities. Through clustering analysis, we categorize this knowledge into groups (e.g., disease-symptom clusters, intervention recommendation clusters), revealing latent associations. This work holds significant value for enhancing rehabilitation knowledge systems, empowering patient self-management, supporting personalized clinical decision-making, and improving health education efficiency.

Methodology

Build a rehabilitation knowledge extraction model using BERT-BiGRU-attention architecture, comprising input, BERT word embedding, BiGRU, attention, and output layers (see Fig. 1).

The BERT-BiGRU-attention model has significant advantages in extracting relationships among diseases, symptoms, and rehabilitation measures. BERT (Bidirectional Encoder Representations from Transformers) learns bidirectional semantic representations from large-scale pre-training on a vocabulary enriched with medical entities and dictionaries. When combined with annotated data, it effectively captures complex contextual associations in medical texts through deep semantic reasoning. For example, implicit relationships between disease terms (e.g., post-stroke hemiplegia) and symptoms (e.g., finger stiffness) can be more accurately encoded via BERT’s contextual embeddings. The Bidirectional Gated Recurrent Unit (BiGRU) captures both forward and backward sequential dependencies while offering higher computational efficiency than BiLSTM¹⁸, making it suitable for modeling long-distance associations in lengthy medical narratives. In a sentence such as “Long-term finger stiffness after post-stroke hemiplegia requires enhanced hand functional exercises,” BiGRU effectively models the disease-symptom relationship (post-stroke hemiplegia → finger stiffness) and links it to the rehabilitation measure (enhanced hand functional exercises). The attention mechanism dynamically allocates weights to highlight critical semantic segments²⁷. For instance, in the above sentence, the model automatically focuses on the connections between post-stroke hemiplegia, finger stiffness, and enhanced hand functional exercises, mitigating interference from irrelevant terms or non-correlated descriptions.

Therefore, our research employs the BERT-BiGRU-attention model, leveraging the tripartite advantages of pre-trained semantic understanding, sequential modeling, and feature prioritization to efficiently address complex relationship extraction among diseases, symptoms, and rehabilitation measures in medical texts.

First, the input layer preprocesses the corpus by concatenating patient questions with doctor responses to create a complete relationship extraction corpus. Next, the corpus containing only one type of entity is filtered out, and annotated using a novel tagging method, marking entity positions with special symbols like " relation@entity1$entity2$text “.

After annotating the corpus, the text is vectorized in the word embedding layer. The rehabilitation Q&A community corpus presents challenges such as non-standard vocabulary, varied grammar, and polysemy. Patients typically describe their conditions and ask questions first, with doctors responding by analyzing and advising based on these descriptions. Doctors often use pronouns to refer to patient conditions. Utilizing the BERT model is essential to capture contextual information through bidirectional Transformer architecture, enhancing the precision and effectiveness of word representations.

Thirdly, after obtaining word vectors, they are fed into the neural network for training. The rehabilitation knowledge corpus features entities in both questions and answers, and only parts of the text indicate their relationship amid considerable noise. Integrating BiGRU networks to manage long-text dependencies and attention mechanisms to prioritize relevant text enhances entity relationship prediction by filtering out irrelevant noise.

Finally, vectors learned from the neural network are processed through a fully connected layer and softmax normalization. The label with the highest probability determines the classification label for the corpus, resulting in a triple (entity1, entity2, relationship). To facilitate knowledge extraction, all triples are categorized and summarized into a relationship dictionary based on entity relationships, capturing rehabilitation knowledge across different relationships. Additionally, to enhance intuitive understanding, a visual relationship network graph is generated as output.

Input layer in relationship extraction model

The input layer of the rehabilitation knowledge relationship extraction model mainly involves data preprocessing and annotation. In medical Q&A scenarios, there is typically a strong contextual link between patient questions and doctor responses. Extracting relationships between entities solely from questions or answers presents challenges. For instance:

Patient Question: “I have pain below the patella when squatting or going up and down stairs. What could it be?”

Doctor Response: “Based on your description, it could be arthritis. You should rest and take care of it. You can apply traditional Chinese medicine or medicated wine locally, take oral trauma pills, and use anti-inflammatory and analgesic drugs. Combining these measures with traditional Chinese medical therapy can help alleviate and improve the condition.”

Patient questions alone provide a corpus for extracting “symptom” entities without relationships between them. Doctor responses alone provide a corpus where relationships between “disease” and “rehabilitation measures” can be extracted. Combining patient questions and doctor responses into a unified corpus allows extraction of relationships involving three types of entities. Therefore, for comprehensive rehabilitation knowledge mining, patient questions and doctor responses are concatenated. Additionally, cases where the concatenated corpus contains only one type of entity are excluded.

During the relationship extraction phase, the corpus undergoes annotation to identify two types of entities and their relationships. These relationships include DS (symptom of disease), SFD (suitable for disease), NSFD (not suitable for disease), and UKN (unknown). Annotated corpus entries are formatted as “relationship@entity1$entity2$text.”

In relationship extraction tasks, two common annotation methods for corpora are used: (1) Only annotating the two entities of interest; (2) Position-marker annotation: Annotating both entities with special markers indicating their original positions in the corpus. This approach aids in accurately identifying sentences with implicit entity relationships. To enhance model learning effectiveness, it’s crucial to manage entity order in the corpus. Thus, in our model, we employ different symbols for marking entity positions. Specifically, “#” denotes the position of entity 1, and “*” denotes the position of entity 2. For instance, when annotating the relationship between “arthritis” and “trauma pills” in a corpus, the annotation example is as follows: “SFD@arthritis$trauma pills$ Consideration is #. You need to rest and take care of it, you can apply traditional Chinese medicine or medicated wine locally, and take * internally.” (Note: This is a simplified example; actual experimental corpora retain complete patient questions and doctor responses.)

BERT word embedding layer in relationship extraction model

After obtaining annotated corpora, the textual data is passed into the BERT word embedding layer to get vector representations. In rehabilitation knowledge relationship extraction in Q&A communities, there are some issues: first, the grammar and structure of the corpora are non-standard; second, the same vocabulary can have different meanings and entity types depending on the context, which can affect relationship extraction results. Therefore, using the BERT model’s bidirectional Transformer architecture helps provide more accurate vector representations based on contextual meanings.

In a Q&A setting, patients first describe their condition and ask questions, with doctors responding by analyzing and offering guidance based on the patient’s account. For instance:

Patient: “When I was young, I fell from upstairs and possibly broke a bone in my waist. I applied some random treatment and didn’t go to the hospital. Now, I occasionally feel pain. I’m a 22-year-old male. What’s the best examination if I want to get a check-up?”

Doctor: “In your situation, you could start with an X-ray, but an MRI would be better. It gives a clearer view of your lumbar vertebrae and surrounding tissues. Make sure to sleep on a firm mattress and avoid long periods of standing or sitting.”

Here, the phrase “In your situation” refers to the patient’s specific condition. To accurately represent the semantics of such passages, word embedding methods must take into account contextual information and manage long-text dependencies. BERT, with its bidirectional transformers and multi-head self-attention, is well-equipped for this. Thus, BERT is used as the word embedding layer, utilizing Google’s BERT-base Chinese model with 12 Transformer layers. The sentence length is set to 200, with each character represented by a 768-dimensional feature vector.

BiGRU + attention layer in relationship extraction model

After obtaining word embedding vectors from the corpus, they are fed into the neural network for model training.

In the task of extracting rehabilitation knowledge relationships from online Q&A communities, the two entities involved are often found in the patient’s question and the doctor’s response. For example:

Patient: “I’m an athlete. After running last week, the outer side of my ankle started hurting, and the ligament above the ankle is painful whenever I move it. What’s the issue? Any treatment advice?”

Doctor: “From your description, it seems to be a sports injury. It’s recommended to get an MRI of the ankle joint…”.

In this example, “pain on the outer side of the ankle” is a symptom, “sports injury” is a disease, and “MRI of the ankle joint” is a rehabilitation measure. These entities appear in both the question and the response. To extract relationships between them, it’s essential to consider the context and handle long-text dependencies. Thus, bidirectional gated recurrent units (BiGRU) are chosen for analysis.

In relation extraction tasks, due to noisy text, only part of the corpus typically expresses the relationship between two entities. Using an attention mechanism helps increase the weight of relevant text, filtering out noise and enhancing the model’s ability to predict relationships. The structure of this layer is shown in Fig. 1.

Step 1 The BiGRU layer receives the BERT word embeddings as input for each corpus, represented as X ∈ R^(200 × 768), where the sentence length is 200 and each word vector is 768-dimensional. X passes through forward and backward GRU layers, each with 32 hidden units. The output feature vectors from both GRU layers are concatenated to get H ∈ R^(200 × 64).

Step 2 Introducing the attention mechanism:

$${\text{M }}={\text{ tanh}}\left( {\text{H}} \right)$$

(1)

$$\alpha \,=\,{\text{softmax}}({\omega ^{\text{T}}}{\text{M}})$$

(2)

$${\text{c }}={\text{ H}}{\alpha ^{\text{T}}}$$

(3)

Where ω is the training parameter, and c∈R⁶⁴.

Output layer in the rehabilitation knowledge relationship extraction model

Relationship extraction in deep learning is essentially a classification task. The vectors learned from the neural network are passed into a fully connected layer to reduce the output dimensions. The vector’s dimensionality corresponds to the number of classification labels. Then, softmax is applied to normalize the vectors, ensuring each output element is within the range [0, 1].

$$\:\text{f}\left(\text{z}\text{j}\right)\:=\frac{{e}^{{z}_{j}}}{{\sum\:}_{i=1}^{n}{e}^{{z}_{i}}}$$

(4)

Where z_j is the j-th element of the vector from the fully connected layer, n is the number of classification labels, and f($\:{\text{z}}_{\text{j}}$) represents the probability of classifying the corpus as the j-th label. The label with the highest probability is selected as the classification label for the corpus.

After extracting relationships from each corpus, triples (entity1, entity2, relationship) are generated. These triples are organized into DS, SFD, and NSFD dictionaries, indexed by disease name. The DS dictionary includes diseases and their symptoms, the SFD lists diseases with suitable rehabilitation measures, and the NSFD lists unsuitable measures. Additionally, relationship network diagrams are provided for better visualization.

Rehabilitation knowledge clustering analysis model

In the rehabilitation Q&A knowledge mining task, clustering analysis is required after knowledge extraction. A clustering analysis model based on kmeans + + is constructed to cluster diseases in the DS, SFD, and NSFD dictionaries obtained from the relationship extraction phase. This model includes an input layer, clustering layer, and output layer, as shown in Fig. 2.

The following sections will explain each layer in the Rehabilitation Knowledge Clustering Analysis Model.

Input layer in the rehabilitation knowledge clustering analysis model

In the rehabilitation knowledge relationship extraction model, three relationship dictionaries are obtained and categorized based on entity relationships: DS dictionary, SFD dictionary, and NSFD dictionary. Each dictionary uses the disease name as the key and includes symptoms, suitable rehabilitation measures, and unsuitable rehabilitation measures as values.

In the input layer of the rehabilitation knowledge clustering analysis model, the values of the relationship dictionaries are vectorized using the TF-IDF (term frequency–inverse document frequency) method. This approach helps the clustering algorithm focus on more representative rehabilitation knowledge. For instance, in our corpus, common phrases in the SFD dictionary include “surgical treatment,” “medication treatment,” and “rehabilitation treatment,” where the term “treatment” appears frequently. Using this as a clustering basis may lack representativeness for different diseases, so its weight can be reduced.

Next, principal component analysis (PCA) is used to reduce the dimensionality of the text vectors. Dimensionality reduction helps eliminate noise and unimportant features, enhancing clustering effectiveness.

Clustering layer in the rehabilitation knowledge clustering analysis model

After obtaining the dimensionality-reduced text vectors, the kmeans + + algorithm is used for clustering. The traditional kmeans algorithm randomly selects k sample points as initial cluster centers, which can significantly affect the final results. The k-means + + algorithm improves this process by prioritizing points that are farther away from already selected center points when choosing the (n + 1)th center point.

Output layer in the rehabilitation knowledge clustering analysis model

In the output layer, clustering results from the three relationship dictionaries are output separately.

Rehabilitation knowledge mining evaluation metrics

Supervised learning algorithms are used for relationship extraction in the rehabilitation Q&A community. In these models, precision, recall, and F-score are commonly used to evaluate performance in relationship extraction in the medical field. Precision (P), recall (R), and F1-score are calculated as follows:

$$\:\text{P}=\:\frac{TP}{TP+FP}$$

(5)

$$\:\text{R}=\:\frac{TP}{TP+FN}$$

(6)

$$\:\text{F}1-\text{s}\text{c}\text{o}\text{r}\text{e}=\:\frac{2*precision*recall}{precision+recall}$$

(7)

In these formulas, TP refers to True Positives, FN to False Negatives, FP to False Positives, and TN to True Negatives. The metrics P and R represent the model’s precision and recall, while the F1-score, which combines P and R, serves as a third evaluation metric.

AUC (ROC Area) evaluates how well a model distinguishes positive and negative classes by plotting True Positive Rate (TPR) vs. False Positive Rate (FPR) across thresholds. A higher AUC (closer to 1) means better performance, suited for balanced data.

AUPR (PR Area) assesses positive class prediction quality by balancing Precision (accuracy) and Recall (coverage) in the PR curve. It excels in imbalanced data (e.g., rare events). AUC highlights overall class separation, while AUPR focuses on precision-recall trade-offs for positives, making them complementary.

All three metrics range from 0 to 1, with higher values indicating better performance.

For clustering analysis in the rehabilitation Q&A community, unsupervised learning algorithms are employed. The silhouette coefficient, denoted as s, evaluates these algorithms by measuring cohesion and separation, even without true cluster labels.

$$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\:\frac{b-a}{\text{m}\text{a}\text{x}（a，b）}$$

(8)

Here, ‘a’ denotes the average distance between a sample point and others in the same cluster, reflecting cohesion, while ‘b’ indicates the average distance to the nearest cluster, reflecting separation. The silhouette coefficient ranges from − 1 to 1, with values closer to 1 signifying better model performance.

Experimental design for knowledge mining in rehabilitation Q&A community

Two experiments focus on the knowledge mining goals of the rehabilitation Q&A community: relation extraction and clustering analysis. In the relation extraction experiment, a preprocessed Q&A corpus is annotated with three types of entity relations: symptoms linked to diseases, suitable rehabilitation measures, and unsuitable ones. The BERT-BiGRU-attention model is used for relation extraction. The annotated dataset is split into training and testing sets, with optimal hyperparameters chosen for training. The model’s performance is evaluated on the test set and compared to other baseline models to validate its effectiveness.

Using the results from relation extraction, the identified relations are organized into a relation dictionary, which is then used for clustering analysis. This analysis clusters diseases based on their symptoms and appropriate or inappropriate rehabilitation measures. A specific disease category is analyzed to offer recommendations for managing rehabilitation knowledge in online health communities.

Experimental dataset

The experimental dataset is obtained from the 120ask website (https://www.120ask.com), one of China’s largest online medical consultation platforms, featuring extensive physician resources and a well-developed management system. It has a specific Q&A section for “Rehabilitation Medicine,” where orthopedic postoperative rehabilitation questions are especially detailed and useful for knowledge mining.

For this experiment, data from March 2017 to October 2022 was collected through web scraping in the “Orthopedic Postoperative Rehabilitation” section, resulting in 37,879 entries compiled into a CSV file. The dataset includes patient information, question timestamps, titles and content, number of responding doctors, their account and professional details, response timestamps, and answer content.

In this experiment, the “question title” and “question content” refer to patient inquiries on a Q&A website, while the “answer content” includes doctors’ responses. Each patient inquiry usually has multiple doctor responses, which are fully collected during data scraping. Together, a patient inquiry and its corresponding responses form one data record.

Data preprocessing

The steps for preprocessing data to explore rehabilitation knowledge are as follows: First, filter the corpus to remove irrelevant content, such as questions about nucleic acid testing for fractures or the cost of fixed supports. Next, correct typographical errors and special characters in the medical Q&A corpus, which may contain mistakes from speech-to-text technology. For relation extraction experiments, identified medical entities from named entity recognition are used as input. This involves combining the “question title,” “question content,” and “answer content” into a single line. Finally, filter out instances with only one type of entity to create the final corpus for relation extraction.

Rehabilitation knowledge mining experiment process

Experimental environment

The experiment uses Python 3 and TensorFlow 1.15.0, a Google open-source platform for machine learning and deep learning, applicable in fields like computer vision and natural language processing.

Entity definition and experimental procedure

For rehabilitation knowledge mining, three entities are defined: diseases, symptoms, and rehabilitation measures. Simple entities like “going up and down stairs” were labeled as rehabilitation measures. Relationships between rehabilitation measures and diseases are categorized as “inappropriate” (e.g., “reduce,” “avoid”) or “appropriate” (e.g., “increase,” “improve”). Undefined relationships, noted as UKN, represent unclear links, such as between “fractured ankle” and treatment options in the question about conservative treatment versus surgery. The definitions of entities and relationships are shown in Tables 1 and 2.

Table 1 Named entity definition in rehabilitation knowledge.

Subjects

Abstract

Similar content being viewed by others

Towards knowledge-infused automated disease diagnosis assistant

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction

A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches

Introduction

Related work

Research on knowledge extraction in online health communities

Research in medical knowledge discovery

Rehabilitation support systems

Methodology

Input layer in relationship extraction model

BERT word embedding layer in relationship extraction model

BiGRU + attention layer in relationship extraction model

Output layer in the rehabilitation knowledge relationship extraction model

Rehabilitation knowledge clustering analysis model

Input layer in the rehabilitation knowledge clustering analysis model

Clustering layer in the rehabilitation knowledge clustering analysis model

Output layer in the rehabilitation knowledge clustering analysis model

Rehabilitation knowledge mining evaluation metrics

Experimental design for knowledge mining in rehabilitation Q&A community

Experimental dataset

Data preprocessing

Rehabilitation knowledge mining experiment process

Experimental environment

Entity definition and experimental procedure

Hyperparameter tuning and setting

Rehabilitation knowledge clustering analysis experimental design

Experiment results and analysis

Experiment results and analysis of relationship extraction

Comparison experiment and analysis of relationship extraction

Clustering analysis experimental results and analysis

Conclusion and future work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Human and animal participants

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Construction and Application of Traditional Chinese Medicine Knowledge Graph Based on Large Language Model

Search

Quick links