Abstract
Recently short text messages, tweets, comments and so on, have become a large portion of the online text data. They are limited in length and different from traditional documents in their shortness and sparseness. As a result, short text tends to be ambiguous and its degree is not the same for all languages; and as Arabic is a very high flexional language, where a single word can have multiple meanings, the short text representation plays a vital role in any Text Mining task. To address these issues, we propose an efficient representation for short text based on concepts instead of terms using BabelNet as an external knowledge. However, in the conceptualization process, while searching polysemic term-corresponding concepts, multiple matches are detected. Therefore, assigning a term to a concept is a crucial step and we believe that short text similarity can be useful to overcome the problem of mapping term to the corresponding concept. In this paper, we reintroduce Web-based Kernel function for measuring the semantic relatedness between concepts to disambiguate an expression versus multiple concepts. The proposed method has been evaluated using an Arabic short text categorization system and the obtained results illustrate the interest of our contribution.



Similar content being viewed by others
References
Alahmadi A, Joorabchi A, Mahdi AE (2014) Arabic text classification using bag-of-concepts representation. In: Proceedings of the international conference on knowledge discovery and information retrieval (KDIR), pp 374–380
Albitar S, Fournier S, Espinasse B (2012) The impact of conceptualization on text classification. In: WISE 2012, LNCS 7651, pp. 326–339
Aly M, Atiya A (2013) LABR: large-scale Arabic book reviews dataset. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria, pp 494–498
Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence, pp 805–810
Bekkali M, Lachkar A (2017) Web search engine-based representation for Arabic tweets categorization. In: Kaya M, Erdoǧan Ö, Rokne J (eds) From social data mining and analysis to prediction and community detection. Lecture notes in social networks, Springer, New York, pp 79–101. ISBN: 978-3-319-51367-6
Bekkali M, Lachkar. SahmoudiI A (2015) Enriching Arabic tweets representation based on web search engine and the rough set theory. In: Proceedings of the 2015 IEEE/ACM international conference on advances in social networks analysis and mining, pp 1573–1574
Blei DM, Ng A, Jordan. M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Duan L, Xu T (2016) A short text similarity algorithm for finding similar police 110 incidents. In: Proceedings of the 7th international conference on cloud computing and big data, Macau, China, pp 260–264
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: 21st National conference on artificial intelligence, vol 2, pp 1301–1306
Guo W, Diab M (2012) Learning the latent semantics of a concept by its definition. In: Proceedings of the 50th annual meeting of the association for computational linguistics, pp 140–144
Hu X, Zhang X, Lu C, Park EK, Zhou X (2009a) Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Paris, pp 389–396
Hu X, Sun N, Zhang C, Chua T-S (2009b) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of 18th ACM conference on information and knowledge management, pp 919–928
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: International conference research on computational linguistics
Kenter T, de Rijke M (2015) Short text similarity with word embeddings. In CIKM, pp 1411–1420
Khoja S, Garside R (1999) Stemming Arabic text. Computer Science Department, Lancaster University, Lancaster
Komorowski J, Polkowski L, Andrzej S (1998) Rough sets: a tutorial
Landauer TK, Foltz PW, Laham D (1998) Introduction to latent semantic analysis. Discourse Process 25:259–284
Larkey L, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. In: Proceedings of SIGIR’02, pp 275–282
Li J, Cai Y, Cai Z, Leung H, Yang K (2017) Wikipedia based short text classification method. DASFAA 2017 Workshops, LNCS 10179, pp 275–286
Lund K, Burgess C, Atchley RA (1995) Semantic and associative priming in a high-dimensional semantic space. In: Cognitive SCIENCE PROCEEDINgs (LEA), pp 660–665
Nagoudi EMB, Schwab D (2016) Semantic similarity of arabic sentences with word embeddings. In: Proceedings of the third arabic natural language processing workshop (WANLP), Valencia, pp 18–24
Navigli R, Ponzetto S (2012) BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193, Elsevier, pp217–250
Ngo CL (2003) A tolerance rough set approach to clustering web search results. Warsaw University, Poland
Pawlak Z (1991) Rough sets: theoretical aspects of reasoning about data. Kluwer, Dordrecht
Phan X-H, Nguyen L-M, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of 17th international conference on World Wide Web, pp 91–100
Sahami M, Heilman T (2006) A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of international World Wide Web, Edinburgh, Scotland, pp 377–386
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Tang J, Wang X, Gao H, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. Front Comput Sci Chin 6(1):88–101
Wang X, Chen R, Jia Y, Zhou B (2013) Short text classification using Wikipedia concept based document representation. In: The international conference on information technology and applications, pp 471–474
Yih W-T, Meek C (2007) Improving similarity measures for short segments of text. In: Proceeding AAAI’07 proceedings of the 22nd national conference on artificial intelligence, V2, pp 1489–1494
Yousif SA, Samawi VW, Elkabani I, Member IAENG (2017) Arabic text classification: the effect of the AWN relations weighting scheme. In: Proceedings of the world congress on engineering, London
Zhang J, Chen S (2013) A study on clustering algorithm of Web search results based on rough set. In: Software engineering and service science (ICSESS), pp 292–295
Zhixing L, Zhongyang X, Yufang Z, Chunyong L, Kuan L (2011) Fast text categorization using concise semantic analysis. Pattern Recogn Lett 32:441–448
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bekkali, M., Lachkar, A. An effective short text conceptualization based on new short text similarity. Soc. Netw. Anal. Min. 9, 1 (2019). https://doi.org/10.1007/s13278-018-0544-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-018-0544-8