- Research
- Open access
- Published:
Application of the LDA model to identify topics in telemedicine conversations on the X social network
BMC Health Services Research volume 25, Article number: 369 (2025)
Abstract
The evolution experienced by global society, in the post-COVID 19 era, is marked by the quite obligatory use of digital media in many sectors, as is the case for the health sector. Quite frequently, both patients and health professionals use social media to express their telemedicine concerns or interests. The present research focuses on these social media comments as they represent a very interesting data source for researchers. In the present analysis, we focus on unstructured tweet texts written by Internet users and apply both machine learning and the Latent Dirichlet Allocation algorithm to model X databases and identify tweet topics. The results gathered provide professionals with information on the most important issues and factors of influence for telemedecine consumers.
Background
The use of new technologies has transformed society, affecting communication, information seeking and ways of working. Telemedicine, as a remote health practice through ICTs, has grown exponentially, especially after the pandemic.
Objective
We do apply a mixed methodology in our study and use both qualitative and quantitative techniques to explore the conversational topics generated about telemedicine through comments posted by users on X. This allows us to identify primary, secondary, and residual themes.
Methods
Natural Language Processing (NLP) and Machine Learning techniques, specifically the Latent Dirichlet Allocation (LDA) model, were used to analyse 156,633 comments extracted from “X” related to telemedicine topics.
Results
The study revealed several issues to be addressed. Data was collected using keywords such as "teleconsultation" and "telemedicine". We can see that the most frequent words in the comments include words such as "health", "service", "doctor" and "patient". The themes identified were grouped into four dimensions: general information, benefits sought, specific information and professional issues. The results showed that 60.1% of the comments focused on generic telemedicine topics, ease of use and service information. “X” queries were observed to be public and general in nature, focusing on benefits and accessibility, while disease or treatment specific topics were less frequent.
Conclusions
The results provide information for the proper development and study of telemedicine through social networks. “X” is a platform mainly used for general telemedicine queries, with convenience and accessibility as the main benefits mentioned. The results suggest that online telemedicine interactions are complex and offer valuable insights for improving telemedicine communication strategies. Future research could explore the use of hashtags and analyse differences in interaction patterns according to user profile, providing a deeper understanding of audiences' behaviour on social networks. These findings underline the importance of considering audience preferences to improve the effectiveness of telemedicine communications.
Introduction
The use of new technologies has set a new global pace during the past decade, with an ever-increasing presence and impact on society [1]. A social phenomenon that has attained global dimensions driven by interconnectivity. Actually, real-time online communications have helped trigger the number of digital experiences favouring individual interactions and social connections; and, this is also the case within the healthcare sector.
Telemedicine is often defined as a medical practice through the use of ICT (Information and Communication Technologies), where patients and medical professionals are in different locations [2]. And, although telemedicine has been used for years, after the pandemic, the demand for remote patient monitoring services has experienced an exponential growth; with a particular interest from EEUU and European governments to promote the use of teleconsultations. Data released by the Global Market Insights [3], show digital health market value figures to have reached $233.5 billion; with an expected outstanding growth till 2032 to reach $981.5 billion. Against all odds, the telemedicine concept has never been so trendy, so necessary, and so disruptive [4].
Meanwhile, technologies have also impacted on the way people -both patients and professionals- share and access information within the health sector [5], the importance of these consultations and exchange of information tools now prove vital. And this has been evidenced during the COVID-19 pandemic by the huge amount of user-generated contents on this topic. In fact, daily health information research by internet users through social networks has become quite routinery due to (1) the high number of contents on these media (2) to the high level of interactivity among users and, most of all, (3) to the easy access to information [6, 7]. Social networks have shown a particularly high capacity to disseminate informations and user experiences [6]. Many users indeed take into account other patients’ evaluations or experiences with their general practitioners before a personal decision, while others -facing decisions- follow their instincts after reading attractive publications [8].
Easy access to information is mainly due to the rapidity of social networks to get such information, and to the convenience of its use, since these can be accessed any time, anywhere [9], real time medical evaluations or interactions with health experts being even possible in remote. Finally, social media provide space for patient/professional interactions and also peer interactions (patients with patients or professionals with professionals); these conversations sometimes encourage or even convince on the choice of a particular professional doctor [9]. Social networks are thus very much involved in the eWom phenomenon.
According to Farsi et al. [9], patients use social networks with the following objectives: (1) Health information, (2) Telemedicine services, (3) Search for some health care professional, (4) User support and shared experiences, (5) Positive influence on health behavior. On the other hand, healthcare providers use social networks with the following objectives: (1) Promotion of healthy habits, (2) Professional development or promotion of practice, (3) Recruitment, (4) Professional networks and stress relief, (5) Professional medical education, (6) Telemedicine, (7) Scientific research and (8) Critical issues on public health care.
Some researchers use this data shared on social networks by future clients/users to study patient experiences and identify relevant behavioral factors [10] In fact, social networks provide an interesting source of data for researchers because consumers express themselves freely and seek information on topics that are currently relevant for them, like during the COVID-19 [11]. Social networks are valuable Internet-based applications for users to get quick information, shared experiences or peer support [12], and an important data source for researchers to analyze their behavior.
However, and despite this positive advantage, such an information source has merely been used by researchers to analyse consumer behaviors in the field of telemedicine. Notwistanding the already mentioned interest in internet conversations for healthcare and telemedicine topics. The study of these comments, made by social media users, could actually paliate significantly the existing lack of data on telemedicine perception and use, by both patients and professionals.
Our marked objectives were to develop a method by stages to apprehend and understand the use made of telemedicine. The main objective of the study being to unveil the most relevant interests for telemedicine users through online comments. The identification of relevant telemedicine-related issues for patients and professionals is paramount to list primary, secondary and residual issues. To do so, we formulated the following question:
-
RQ1: What are the most important telemedicine-related topics (issues) for doctors and patients?
Our study focuses on the dynamics of telemedicine-related conversation topics on social networks, most particularly those of the X social network. Research published on the digital world has shown the X social network to be a favorite for users and professionals, with different uses for both groups. So, while patients use it to get information on particular health problems [13] medical professionals use X as a tool to disseminate knowledge and information among the population population [14]. For professionals, internet is a very useful tool to promote positive healthcare habits and behaviours [9] because widespread information can be inserted in it in no time and because the access is easy to the public.
Methodology
For the present study, we decided to use Natural Language Processing (NLP) and the Latent Dirichlet Allocation (LDA) model, a technique based on Machine Learning (ML) that has been widely and successfully applied to automatically detect topics, with the Python programming language [15, 16]. Topical detection algorithms, including LDA, are useful when grouping documents and organizing large amounts of textual data. They allow for the retrieval of information from unstructured texts to further group them according to context patterns. This technique has been used and recommended not only to detect recurring topics, but also to discover hidden dimensions or patterns in a collection of texts [10, 17,18,19] After methodology is chosen, workflow shown in Fig. 1 is applied to carry out the study:
Data collection
The collection of data from telemedicine tweets published on the X social media was performed using the Snscrape library. After validation by two telemedicine professionals and two specialist researchers on online user behavior, we selected four Spanish keywords, ‘teleconsultation’, ‘virtual assistance’, ‘virtual doctor’ and ‘telemedicine’ to be used as extraction parameters. This search was carried out using the keywords separately. We extracted, during the data collection phase, some 26,052 tweets with the word ‘teleconsultation’, 129,858 with ‘telemedicine’, 687 with ‘virtual assistance’ and 36 with the word ‘virtual doctor’; a total of 156,633 tweets in Portuguese, Spanish, Italian and English.
Other complementary data on contents and personal information were also retrieved together with the tweet such as: tweet ID number, name of user posting the tweet, number of retweets or comments, number of ‘likes’, type of language used, other users mentioned in the tweet, tweet URL, user location or hashtags contained in the tweet.
Selection, cleaning, transformation and pre-processing
After data collection, tweets were gathered into a single data frame and duplicate or incomplete texts removed. A selection of all tweets in Spanish was then performed to attend the principal object of the present study; a total of 109,662 tweets being included in the final database to be studied.
A first technical procedural step was necessary to adapt the textual format before texts processing could be applied: uppercase in texts were converted to lowercase and all non-alphabetic characters like @,!,?, numbers, symbols and special characters were removed. A second step focused on the removal of all stop-words, since these words have no significant weight or are irrelevant for data interpretation [17, 20]. The NLTK library was applied at this stage, and a total of 320 stop-words was deleted from the data frame.
All final texts were then divided into word units to be analyzed individually. During the lemmatization process, words sharing the same meanings were grouped to establish cleaned tokens used during the mining stage. This is shown in Table 1:
Data mining
We used topical detection algorithms to detect the most relevant topics for X users. There are numerous algorithms available on the market to build topic detection models. Each system applies different algorithms however, all models are based on the same fundamental assumption: each text contains various topics, and each topic contains a collection of words. In the present study, we used the unsupervised machine learning model known as LDA. According to Stevens et al. [17], LDA learns the relationships between words, topics, and texts by assuming that texts are generated by a specific probabilistic model. Such a model provides therefore better results when dealing with a large volume of texts. This is exactly the case for the present study. The model inputs texts and outputs topics to identify latent topics in a corpus of texts; it also automatically detects and extracts latent semantic relationships for large information bulks [17, 20]. Our topical detection model choice was influenced by these characteristics.
To build our LDA topic model, we created a dictionary and corpus with the extracted tweets. The tokens, cleaned up during the previous stage, were also imported to create a specific dictionary and generate a corpus for the LDA model.
This dictionary contains 185 302 words. To reduce possible noise during the creation of our model, words were filtered by frequency. Words appearing only once and, those with a fraction of total corpus size inferior to 0.25%, were removed. This was the case for 25 items. Actually, using the fraction of the total corpus size plus the absolute number was useful because it provided a more precise way to measure the relative frequency of terms. To construct our model, we were left with a final corpus of 100,000 words.
After the obtention of a dictionary and a corpus for the two data frames, we used the latent Dirichlet package (LDA) from the Gensim library in Python. Tweets were fed into a text corpus and the LDA model algorithm was applied for topic detection. The three main preset parameters being: number of topics: 8 for all comments; words per topic: 5 for all comments; texts handled by Alfa: 10.
All three parameters were given to the algorithm for the model to reset the distribution of topics within texts and distribute keywords within topics, before results showed the appropriate composition of keyword distribution. Topics can be defined as a determined distribution of probabilities for the appearance of different vocabulary words. These are basically a collection of prominent keywords or words showing a high probability for a particular topic; which is very useful to identify what the topics are about. This process actually unvails the hidden topics within the collection, it classifies the texts into discovered topics, and uses that classification to organize, summarize, and search all texts. Once the machine learning process was completed, results shown in Figs. 2 and 3 were obtained. At this stage of the study, we were able to answer our second preliminary question.
We analysed public tweets published in the X social network so as to identify all telemedicine-related conversational topics. And, though the X data is publicly accessible, we implemented rigorous ethical measures to guarantee user privacy and protect all sensitive data during the study.
Findings
Figure 4 represents the most frequent words issued by the system. These are all closely related to the very concept of telemedicine: ‘health’, ‘service’, ‘doctor’ and ‘patient’. We also found some internet-related tool words like ‘Internet’, ‘resources’, ‘technology’, ‘platform’; or action based ones like ‘consultation’, ‘to receive’, ‘provide’, ‘attention’, ‘services’; or moments like ‘day’, ‘today’, ‘time’, ‘now’; and even circumstances like ‘pandemic’, ‘covid19’, ‘distance’, ‘home’. All these words being linked, in some way, to the telemedicine context, as reflected by the word cloud.
To avoid possible interpretation errors with the words in the cloud, we established a top 20 of most extracted terms, as represented in Fig. 2.
The top 20 of most relevant terms also shows a very high connection with the teleconsultation topic. It also focuses on more general topics, like medical consultations and functional aspects that are related with technology and resources.
Most commented topics in all tweets were grouped in 8 categories, as shown in Fig. 3.
The figure represents the most salient terms in different comments like: ‘doctor’, ‘provides’, ‘resources’, ‘consultation’, ‘technological’, ‘Internet’, ‘to receive’, ‘health’ and ‘virtual assistance’. It also shows 5 clearly differentiated groups. Two of these groups are superimposed: topics: topic 5 with topic 7 and topic 4 with topic 6. Topic 6 and 8 also partially overlap. This would indicate that some groups are related, though different.
Groups 3, 5 and 7 all refer to information requests on specific aspects of telemedicine: general consultation, information on how to use the APP and how to request services. But all of them are grouped too in the request information pool and are then inserted in the same dimension.
These eight topics can be described through 4 main dimensions: (1) General information, (2) Researched benefits, (3) Specific information and (4) Professional topics. These dimensions are of importance to interpret better and understand the overlapping of groups. Such a dimensional difference between these two overlapping groups is paramount.
We present (in the following Table 2) all topics, their relevance, and related keywords for each topic.
These results by dimensions underline the outstanding weight of the ‘Specific information’ topic with 35.1%, followed by ‘Professional topics’ (23.7%), the ‘Generic information’ (21.5%) and, finally, by the ‘Researched benefits’ (19.7%). However, when focussing on topics, three of them make for 60.1% of the tweets (‘Generic Telemedicine’, ‘Ease of Use’ and ‘Service Information’).
Discussion and conclusion
We extracted several significant findings as a result of our study, and these are aligned with the literature published on telemedicine uses. Among others, our results coincide with those mentioned by Farsi [12]: firstly, the abundance of information exchanged during the pandemic (7 month period in 2020) -coinciding with our results as reflected in Fig. 4. Secondly, the topics we detected also coincide with the three main topics reviewed by researchers: (1) services (‘attention’, ‘consultation’, ‘virtual assistance’, etc.), (2) ICTs (‘internet’, ‘resource’, ‘system’ etc.) and (3) process (‘request’, ‘to receive’, ‘provides’, etc.)
Our findings also reveal detailed information about the uses of X members, since these tend to search information on general aspects as well. Users actually consult fairly generic topics like telemedicine information or advantages, without delving into specificity -like some particular diseases or medical specialists-. Our results differ from those of Farsi [12] which included various platforms and a wider variety of topics, including specific ones like cancer, chronic disease, chemical treatment, among others. A difference which suggests the importance of platform types: they seem to determine the type of telemedicine consultation. On X, most tweets are of public and generic nature. While on other specialised platforms, like those analyzed by Farsi [12], the audience shows interest for more specialised matters.
Our results also unvail the existence of two well-differentiated user profiles: professionals and patients, which reflects also different concerns and uses. In both cases, we observed the emergence of a ‘novice user’ telemedicine character. A novice profile which is reflected by the abundance of general comments like those made by initial users; their lack of experience shows during the search for basic information. This is also the case for professionals seaking interlocutors to get information on professional business issues, on the starting process or launching of their online activity for example.
Finally, we noted two other relevant issues: firstly, most frequently used words in these tweets are not necessarily the most important ones. Which means, common words are not always the most important or meaningful for the ongoing discussion. This is the case for the ‘COVID’ word, for example. It is included in the top 20 of most mentioned words, but there is no trace of such word in the topic keywords. A second issue attends the massive interrelation activity between users (over 66,400 tweets); a fact underlying the importance of peer support played by social networks for both patients and healthcare professionals.
Our study also makes a substantial difference with previous literature published on telemedicine from a methodological point of view. Such difference is mainly due to the use of keywords -and not hashtags- as a searching tool on comments, which was the case in previous studies. The use of hashtags is helpful to avoid data biases and it eases the tedious work of cleaning up non-relevant comments though, not all internet members typically use hashtags to post comments. The study of tweets with hashtags therefore restrains the information gathered, most particularly all that contained in comments with no hashtags. Our study, based on keywords -and not on hashtags-, overcomes this limitation.
The semantic similarity of the extracted words was based on the proximity of the space vector; this plays an essential role in the creation of groups by associating words with similar meanings. A limitation entailing that words with multiple meanings can be grouped into overlapping topics, because no distinction can be made between the different meanings of a word in different contexts. This limitation can result in the misgrouping of words with shared meanings, but with different contextual uses.
X is a generic platform where Internet users - both patients and professionals - often consult information about telemedicine. Primary searches focus on patient aspects like medical advantages, telemedicine convenience, and accessibility. Marketers need thus to take into account the preferences of these users to improve the effectiveness of telemedicine communications, and ensure greater user engagement and understanding. Telemedicine services managers should also be aware of the importance of user information and education as regards ITC tools. Some aspects like how to request an appointment, to navigate through platforms and to use some tools are generally commented topics among consumers, most particularly among new internet or social media users. Improving the digital literacy of users with tools is paramount to increase their trust and understanding; it also provides for a wider use of telemedicine. This could make a real difference for the development and continuous improvement of telemedicine services worldwide.
New lines of research
The identification of themes used by X members presented here accounts for the first part of telemedicine uses and results obtained during our study. These first results suggest the possibility of differences between tweets with and without peer support, and also of different behaviors according to uses -by professionals or by general public. Some specific lines that are of interest for further studies, most particularly those concerning the different behavioral patterns, topics of interest and uses of social networks, observed between peers or non supported peers, or between professionals and non professionals. Results on these lines could affect positively the planification of information strategies by professionals, the disclosure of information on the internet and patient experiences with telemedicine itself.
Other future research lines proposed by our team verse on the longitudinal analysis of these uses over time to study and apprehend better the behavior of users. The massive number of comments over different periods (before, during, and after COVID-19), unvails many possible lines on how discussion trends change over time, for example.
In conclusion, findings of the present research underline the complexity and richness of online interactions. These provide valuable insights for telemedicine managers on social networks audiences’ behaviors. These research lines will undoubtedly expand our knowledge on online interactions in the context of telemedicine and impact significantly -both the healthcare professionals and the general public-: their interested commitment with these platforms will surely improve the future of healthcare and of global communication.
Data availability
Data is provided in the following repository https://docs.google.com/spreadsheets/d/1fzxekofjyvL601wrmQtBffOmzAX5PPQA/edit?usp=sharing&ouid=112451748020460408104&rtpof=true&sd=true.
Abbreviations
- LDA:
-
Latent Dirichlet Allocation
- WHO:
-
World Health Organization
- ICT:
-
Information and Communication Technologies
References
Behl A, Nigam A, Vrontis D. Guest editorial overview:‘ mapping the future of consumer behaviour using disruptive technologies.’ J Consum Behav. 2024;23(4):1854–8.
Peetso T. Telemedicine: the time to hesitate is over! Eurohealth OBSERVER Eurohealth Incorporating Euro Observer. 2014;20(3):15–7. https://www.united4health.
Swain R, Kharad S. Direct-to-consumer (DTC) genetic testing market. Glob Market Insights. 2023. https://www.gminsights.com/industry-analysis/direct-to-consumer-dtc-genetic-testing-market
Silva GS, Schwamm LH. Advances in stroke: Digital Health. Stroke. 2021;52(1):351–5. https://doi.org/10.1161/STROKEAHA.120.033239. Wolters Kluwer Health.
Chaet D, Clearfield R, Sabin JE, Skimming K. Ethical practice in telehealth and telemedicine. J Gen Intern Med. 2017;32(10):1136–40. https://doi.org/10.1007/s11606-017-4082-2.
Afful-Dadzie E, Afful-Dadzie A, Egala SB. Social media in health communication: a literature review of information quality. Health Inform Manage. 2023;J52(1):3–17. https://doi.org/10.1177/1833358321992683. SAGE Publications Inc.
Ojeda-Martín Á, López-Morales P, Jáuregui-Lobera I, Herrero-Martín G. Use of social networks and risk of suffering from ED in young people use of social networks and risk of suffering from ED in young people. J Negat No Positivie Results. 2021;6(10):1289–307.
Alalawi A, Aljuaid H, Natto ZS. The effect of social media on the choice of dental patients: a cross-sectional study in the City of Jeddah, Saudi Arabia. Patient Prefer Adherence. 2019;13:1685–92. https://doi.org/10.2147/PPA.S213704.
Farsi D, Martinez-Menchaca HR, Ahmed M, Farsi N. Social media and health care (Part II): narrative review of social media use by patients. J Med Inter Res 2022;24(1). https://doi.org/10.2196/30379. JMIR Publications Inc.
Mishra M. Customer experience: extracting topics from tweets. Int J Market Res. 2022;64(3):334–53. https://doi.org/10.1177/14707853211047515.
Garcia K, Berton L. Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA. Appl Soft Comput. 2021;101:107057. https://doi.org/10.1016/j.asoc.2020.107057.
Farsi D. Social media and health care, Part I: literature review of social media use by health care providers. J Med Internet Res. 2021;23(4). https://doi.org/10.2196/23205. JMIR Publications Inc.
Antheunis ML, Tates K, Nieboer TE. Patients’ and health professionals’ use of social media in health care: motives, barriers and expectations. Patient Educ Couns. 2013;92(3):426–31. https://doi.org/10.1016/j.pec.2013.06.020.
Sánchez Conde J, et al. Análisis cualitativo de la cuenta de twitter de la Federación de Asociaciones de Matronas de España. Enfermería Glob. 2022;21:488–513. https://doi.org/10.6018/eglobal.502891.
Da C, Duan Y, Ji Z, Chen J, Xia H, Weng Y, … & Cai T. Assessing the needs of patients with breast cancer and their families across various treatment phases using a Latent Dirichlet Allocation model: a text-mining approach to online health communities. Support Care Cancer. 2024;32(5):314.
Zhou S, Kan P, Huang Q, Silbernagel J. A guided latent dirichlet allocation approach to investigate real-time latent topics of Twitter data during hurricane Laura. J Inform Sci. 2023;49(2):465–79.
Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D. Exploring Topic Coherence over many models and many topics. Association for Computational Linguistics. 2012. http://mallet.cs.umass.edu/
Aggarwal S, Gour A. Peeking inside the Minds of tourists using a novel web analytics approach. J Hospitality Tourism Manage. 2020;45:580–91. https://doi.org/10.1016/j.jhtm.2020.10.009.
Pardo C, Pagani M, Savinien J. The strategic role of social media in business-to-business contexts. Ind Mark Manage. 2022;101:82–97. https://doi.org/10.1016/j.indmarman.2021.11.010.
Tran B, Xuan Nghiem, Son Sahin, Oz V, Manh T, Vu T, Manh CA, Tam Wilson, Ho CSH, Ho Roger CM. Modeling research topics for artificial intelligence applications in medicine: Latent dirichllocation application study. J Med Intern Res. 2019;21(11):1–13.
Acknowledgements
Not applicable.
Authors’ information
The authors involved have different levels of professional qualifications. MSM is an academic at the University of Malaga and a professional expert in technological fields and social networks, working on the development and study of the behaviour of individuals through new technologies. PAU is a professor at the University of Malaga with extensive experience in qualitative studies related to consumer behaviour and has a solid research career. FWC is an expert in data mining and Big Data studies. All authors have experience in qualitative consumer behaviour studies.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
The authors involved have different levels of professional qualifications. MSM is an academic at the University of Malaga and a professional expert in technological fields and social networks, working on the development and study of the behaviour of individuals through new technologies. PAU is a professor at the University of Malaga with extensive experience in qualitative studies related to consumer behaviour and has a solid research career. FWC is an expert in data mining and Big Data studies. All authors have experience in qualitative consumer behaviour studies.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Martín, M.S., Chen, FW. & Urbistondo, P.A. Application of the LDA model to identify topics in telemedicine conversations on the X social network. BMC Health Serv Res 25, 369 (2025). https://doi.org/10.1186/s12913-025-12493-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12913-025-12493-3