Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a pivotal tool for exploring cellular landscapes across diverse species and tissues. Precise annotation of cell types is essential for understanding these landscapes, relying heavily on empirical knowledge and curated cell marker databases. In this study, we introduce MarkerGeneBERT, a natural language processing (NLP) system designed to extract critical information from the literature regarding species, tissues, cell types, and cell marker genes in the context of single-cell sequencing studies. Leveraging MarkerGeneBERT, we systematically parsed full-text articles from 3702 single-cell sequencing-related studies, yielding a comprehensive collection of 7901 cell markers representing 1606 cell types across 425 human tissues/subtissues, and 8223 cell markers representing 1674 cell types across 482 mouse tissues/subtissues. Comparative analysis against manually curated databases demonstrated that our approach achieved 76% completeness and 75% accuracy, while also unveiling 89 cell types and 183 marker genes absent from existing databases. Furthermore, we successfully applied the compiled brain tissue marker gene list from MarkerGeneBERT to annotate scRNA-seq data, yielding results consistent with original studies. Conclusions: Our findings underscore the efficacy of NLP-based methods in expediting and augmenting the annotation and interpretation of scRNA-seq data, providing a systematic demonstration of the transformative potential of this approach. The 27323 manual reviewed sentences for training MarkerGeneBERT and the source code are hosted at https://github.com/chengpeng1116/MarkerGeneBERT.
Similar content being viewed by others
Introduction
Single-cell sequencing technology has pioneered a burgeoning field of research across numerous species and tissues due to its exceptional resolution at the singular cell level1. This advancement has laid the foundation for comprehensive exploration of cellular landscapes, allowing for precise delineation of all cell types within distinct tissues and organs. Achieving a thorough annotation of diverse cell types necessitates the identification of potential cell types within tissues and subsequent aggregation of corresponding cell type marker genes via comprehensive literature reviews or referencing existing database. Notably, existing tools such as CellAssign2 and scCATCH3, provide coarse-grained annotation by leveraging such databases4,5,6. Additionally, various databases, including CellMarker2.07, PanglaoDB8, singleCellBase9, PCMDB10, and CancerSEA11 have been established, offering an extensive collection of cell markers for different species and tissue types. These databases are predominantly sourced through manual review and curation of scientific articles, enabling the acquisition of highly accurate marker genes; however, this approach demands substantial human effort and time.
Numerous text mining-based methodologies have been implemented in various research fields for identifying entities of interest and discerning the relationships between these entities by parsing syntactic dependencies within the text. For instance, Shetty et al. developed a language model called MaterialsBERT, which was trained on 2.4 million abstracts from the polymer literature to autonomously extract various properties of organic and polymer materials from the literature abstract12. Gu et al. employed a pretrained NLP text mining system called MarkerGenie to identify entities of interest, such as diseases, microbiomes, genes, and metabolites, that were mentioned in texts. After entity identification, the system parses the syntactic structure of the text and extracts contextual features for each word, thereby distinguishing between the types of relationships-diagnostic, predictive, prognostic, predisposing, or treatment related-among diseases, microbiomes, genes, and metabolites13. Naseri et al. utilized an NLP pipeline to identify pain-related medical terms from largely unstructured and non-standardized clinical consultation notes, subsequently predicting pain scores based on recognized pain terms14. Doddahonnaiah et al. utilized a precompiled cell type and gene vocabulary to assess the correlation between gene and cell type entities by calculating their co-occurrence frequency within more than 26 million biomedical documents15. In conclusion, these published methods provided a more efficient and comprehensive analysis of research articles than manual curation by aiding in the identification of rare or novel entities of interest along with their interrelations.
In this study, we present MarkerGeneBERT, an NLP-based system designed for the automatic extraction of cell marker genes from single-cell sequencing studies. Leveraging biomedical corpora such as CRAFT16, JNLPBA17, and BIONLP13CG18 along with a text classification model trained on a manual curation dataset of 27323 sentences, MarkerGeneBERT aims to automatically identify cell and gene entities while removing false positive associations. We collected 3702 single-cell sequencing articles published from January 2017 to June 2023 from free-text PubMed and PubMed Central, then put them into MarkerGeneBERT to extract cell marker genes, subsequently validating our findings against manually curated databases. Moreover, we applied our marker gene list using scCATCH for cell cluster annotation in brain tissue samples, yielding results consistent with prior studies. An overview of MarkerGeneBERT is given in Fig. 1, which consists of four main components: literature retrieval, extraction of marker-related sentences, establishment of cell-marker associations, and inference of species, tissue, and disease information within the articles.
Methods
Data collection
The main texts of single-cell RNA sequencing studies were downloaded and parsed from free-text PubMed and PubMed Central. Specifically, we employed the R package "RISmed"19 to retrieve literature using the search terms "Animals"[MeSH Terms] AND "Single-Cell Analysis"[MeSH Terms] OR "single-cell" AND "expression" within a specified time frame. These rigorous rules enabled us to obtain a comprehensive collection of PMID from single-cell research-related studies. Subsequently, using the R package "easyPubMed"20, we acquired basic information such as titles, abstracts, and literature sources for each PMID. For the literature sourced from the PMC, we utilized the R package "europepmc" to retrieve the main text documents and systematically extracted sections, including the introduction, method, and results. For other manually collected literature in PDF format, we employed the python library "scipdf_parser" to parse the PDF files and extract pertinent sections such as the introduction, method, and results based on the parsed outcomes.
Marker-related sentence classification model
Supervised training data generation for marker-related sentence classification
To identify marker-related sentences in the main text of the literature, specifically concerning those containing both cell and gene names with a particular syntactic structure, such as "Gene A is a marker of Cell B" or "Gene A (specific to Cell B)", we constructed a text classification model based on the spaCy21 and "textcat" modes to pretrained on a manually annotated marker-related dataset curated by our team.
Specifically, a total of 62,000 main text sentences were initially collected from approximately 900 single-cell RNA sequencing studies. Subsequently, over ten bioinformatics engineers with expertise in single-cell research manually screened these sentences to isolate those containing cell-marker genes from the raw sentences encompassing both cells and genes. This processing step resulted in narrowing down the 62000 initial sentences to 27323 remaining sentences. Following this, the 27323 sentences underwent random shuffling and were redistributed to the aforementioned bioinformatics engineers according to predefined rules (Table 1) and their personal expertise to conduct manual labeling. Collation and review of the annotated sentences were conducted by two senior bioinformatics engineers. Any sentences with disputed annotations were subjected to discussion and potential re-annotation.
Text preprocessing of marker-related sentences
Text preprocessing has been a traditionally important step for NLP tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better. Specifically, the 27323 sentences were initially input into the SciBERT model22, and the tokenizer and parser components of the SciBERT model were used for part-of-speech tagging and syntactic dependency parsing of the sentences. This process generated contiguous spans of tokens, including words, punctuation symbols, and whitespace. Non-gene entity tokens were subsequently lemmatized and converted to lowercase characters. Additionally, tokens classified as stop words, punctuation marks (excluding parentheses), or numerical values were filtered out. Lastly, based on prior knowledge, we selectively retained parentheses only when the token inside or preceding the parentheses constituted a gene name.
We performed text preprocessing on each sentence, and the cleaned sentences were exclusively utilized for training a text classification model.
Marker-related sentence classification model construction
The TextCategorizer (TextCat) module within the spaCy natural language processing library served as a potent tool in our development of an innovative marker-related sentence classification model. Incorporating both a bag-of-words approach and a neural network model, we configured the "vectors" parameter to "en_core_web_trf", which means a transformer pipeline tailored for English text classification while maintaining default values for other parameters. Leveraging the remaining 27323 sentences as the training dataset for our marker-related sentence classification model, our trained model generated probability values in its output, facilitating the assessment of sentence credibility concerning cell types and marker genes.
To determine an appropriate probability threshold for distinguishing marker-related sentences, the original training dataset was evenly divided into 10 parts, ensuring a 1:1 ratio of label 0 and label 1 in each subset. Subsequently, a tenfold cross-validation approach was employed, where 9 parts of the original training set were used as a new training set to train the text classification model, while the remaining 1 part served as the validation set for evaluating the model’s performance. The sentences from the validation set were inputted into the model, yielding a predicted probability corresponding to the likelihood of the sentence being a marker-related sentence. We evaluated the precision and recall of the model at different probability thresholds and calculated the F1 score. Finally, based on the variations in F1 score under different thresholds, an appropriate threshold was selected.
Entity extraction
Named entity recognition (NER)
As shown in the Table 2, each scispacy23 NER model was originally built to identify distinct entity types. Employing the spaCy Python package, we seamlessly integrated these four NER models with default parameters to extract cell, species, tissue, and disease entities comprehensively. In our study, our primary focus did not revolve around optimizing text-to-token conversion efficiency. For instance, when extracting cell entities, we amalgamated tokenization results from various NER models rather than relying on partial outputs to enhance entity extraction completeness. Consequently, we disabled the "tagger" "parser" "attribute_ruler" and "lemmatizer" components within the NER models to enhance processing speed. On average, processing a single sentence required approximately 4 s. The total runtime of our model displayed a direct correlation with the number of articles, exhibiting a linear relationship between them. The peak resource consumption during execution peaked at around 21 GB.
Generation of gene vocabulary
The complete set of human and mouse protein-coding genes was obtained from the GTF file in Cell Ranger v5.0.1, and gene entities were extracted using exact string matching.
Cell entity recognition
First, each sentence was parsed, and cell names were extracted using three NER models independently (Fig. 2). Specifically, the "en_ner_craft_md" model identified entities with the entity type "CL" as cell names, the "en_ner_jnlpba_md" model recognized entities with the entity types "CELL_TYPE" and "CELL_LINE" as cell names, and the "en_ner_bionlp13cg_md" model identified entities with the entity type "CELL" as cell names. Subsequently, we performed exact string matching on the same sentence using the comprehensive cell names obtained from the cell ontology database24. Finally, we retained the cell names that were extracted by at least two sources as the cell names present in the respective sentences.
To alleviate the potential limitations in capturing all cell names comprehensively, specifically regarding instances such as "CD4 + T cell" where three models may extract disparate cell entities, we conducted a comparison and completion of cell names identified by different models at the same position within the text. In cases where two models extracted "CD4 + T cell" and "T cell" as cell entities at the same position, we completed "T cells" to "CD4 + T cells."
A full-text-based strategy for extracting species and tissue entities
We employed a full-text-based strategy in which the literature was divided into sections such as abstracts, methods, and results, and entity recognition was performed using NER models on each section, followed by comprehensive analysis and judgment.
Species entity recognition. The extraction of species entities primarily relied on MeSH (Medical Subject Headings) terms, which are controlled vocabulary thesaurus used by the National Library of Medicine (NLM) for indexing articles in PubMed. For each study, we utilized the "en_ner_craft_md" model to identify species entities from the MeSH terms. If no species entities were identified from the MeSH term text provided by PubMed, we further performed species entity recognition based on the overall structure of the full text. Specifically, we employed the "en_ner_craft_md" model to identify species entities separately from the text in the title section, methods section, and first paragraph of the Results section. The most frequently occurring species entity was selected as the species type studied in the respective literature.
Tissue entity recognition. We utilized the "en_ner_bionlp13cg_md" model for the recognition of tissue entities. Specifically, for each study, we identified tissue entities separately from the MeSH term text, the text in the title section, and the sentences within the full text that contained keywords related to single-cell sequencing, such as "single-cell" and "dissociation". If a tissue entity was identified in all three text sections of the literature, it was considered the correct tissue type. Otherwise, we supplemented the recognition of tissue entities by analyzing the text in the first paragraph of the Results section and the Methods section of the article. We calculated the frequency of each entity extracted from different text sources and ranked them accordingly. Additionally, we determined the frequency of co-occurrence between each entity and keywords related to single-cell sequencing in the same sentence, as well as the frequency of co-occurrence between each entity and all cell entities identified in the literature. The top two tissue types based on the cumulative ranks from these three ranking results were considered candidate tissue types according to the literature.
Disease entity recognition
We employed the "en_ner_bc5cdr_md" model to identify disease entities from the title section, which were considered the disease types studied in the respective literature. If no disease entities were detected, the literature was assumed to be "normal" by default.
Cell type–gene relation classification
To extract cell marker genes from marker-related sentences, we initiated the process by retaining sentences that concurrently contained both cell and gene names, as identified through entity recognition. Subsequently, these sentences underwent text preprocessing before being input into a text classification model. Upon surpassing a predetermined probability threshold, the original sentences were further classified into two types: those conducive to extracting cell-gene relationship pairs based on predefined rules and those necessitating manual extraction of cell-gene relationship pairs (Table 3).
For sentences meeting the criteria for extracting cell-gene relationship pairs based on predefined rules, the tagger and parse components of the SciBERT model were employed to parse the syntactic structure of the sentences and generate a syntactic dependency tree (Fig. 3). Each syntactic dependency tree consisted of numerous subtrees, where a subtree was defined as a sequence encompassing the token and all its syntactic descendants. This process effectively delineated relationships between tokens, allowing for the extraction of cell-gene relationship pairs located within the same subtree. Additionally, it is worth mentioning that sentence structures that conformed to the pattern "cell name (gene name)" were directly selected for the extraction of cell–gene pairs.
Statistics
All the statistical analyses were performed in R (version 4.1). The performance of the marker-related sentence classification model was evaluated using the precision, recall, and F1 score of the predicted entity tag compared to the ground truth labels.
Results
Identification of gene and cell entities using MarkerGeneBERT
Pretrained NER models for entity extraction have proven to be effective in various research fields. MarkerGeneBERT integrates three pretrained NER models based on diverse biomedical corpora. Additionally, we incorporated cell names curated from the Cell Ontology database for exact string matching. Given the standardized gene names, the MarkerGeneBERT utilized only gene symbol IDs exclusively sourced from the GTF file in Cell Ranger for accurate gene entity recognition. Further details can be found in the Methods section.
As detailed in the Methods section, 27323 sentences, initially labeled with cell and gene names and manually annotated by our team for the marker-related sentence classification model, were used to validate the performance of "en_ner_bionlp13cg_md", "en_ner_craft_md", "en_ner_jnlpba_md", and MarkerGeneBERT in identifying cell and gene entities. Compared to the three pretrained NER models used individually, MarkerGeneBERT demonstrated higher precision and recall in the extraction of cell and gene names (Table 4). Specifically, for gene name identification, MarkerGeneBERT achieved an F1 score of 87% (precision: 89%, recall: 99%), surpassing the second-best model by 20%. In terms of cell name identification, MarkerGeneBERT obtained an F1 score of 92% (precision: 86%, recall: 98%), outperforming the second-best model by 8%, thus representing the optimal trade-off between precision and recall.
Cell–biomarker associative binary classification
We introduced a supervised marker-related text classification model to determine which sentences included not only cell entities and gene entities but also specific syntactic patterns indicating that a gene is a marker of a cell. More details about the model and training dataset construction process are available in the Methods section.
To evaluate the performance of the marker-related text classification model in distinguishing specific syntactic patterns indicating that genes are markers of cells, we partitioned the training dataset into 10 subsets, randomly selecting 9 subsets for model training and reserving one subset for validation. The evaluation results depicted in Fig. 4A demonstrated a mean average precision (mAP) of 0.876 (ranging from 0.84 to 0.91), a mean precision of 0.844 (ranging from 0.8 to 0.9), and a mean recall of 0.734 (ranging from 0.56 to 0.78).
After processing by the model, each sentence could obtain a predicted probability value. A sentence was classified as a marker-related sentence if the predicted probability value was greater than the threshold, so the threshold setting was very important for the performance of our model. We calculated the F1 score for different thresholds, as illustrated in Fig. 4B, and the fitting threshold was 0.7. Under these threshold settings, the F1 score achieved optimal performance across different validation sets.
For the remaining marker-related sentences whose predicted probability was greater than 0.7, we employed syntactic structure-based analysis within each sentence to identify and extract reliable cell-marker relationship pairs. The extraction criteria are described in detail in the Methods section.
In addition, we employed an appropriate NER model, as shown in Table 2, to assess the species, organs, and disease information in each study. Further details are provided in the Methods section.
Statistics of the NLP system extraction results
We employed MarkerGeneBERT to extract 3280 cell types and 16124 genes from 3702 literature sources (Supplemental Table 1). Compared to existing databases manually curated by domain experts over the years, our model achieved competitive retrieval results (Table 5). The maximum memory of our system, which included all the scripts and models, was 21 GB, and the parsing and entity extraction of one paper could be quickly completed in 7 min.
Concordance between MarkerGeneBERT and manually curated databases
To validate the accuracy of the system for detecting cell entities, gene entities, cell-marker pairs, species, tissue, and disease information, we conducted a comparison with CellMarker2.0, widely recognized as the gold standard for manual curation. As our methodology chiefly extracted gene markers from main text, we specifically compared gene markers from 1027 articles present in both CellMarker2.0 and our database. Other articles were excluded due to reasons such as unavailability for download or because the markers were sourced from supplemental materials; additional details are available in Supplemental Fig. 1.
The MarkerGeneBERT identifies most cell and gene entities recorded in databases
In this 1027 studies, the CellMarker2.0 manual curated a total of 4646 cell types with 12,874 marker genes, while the main text parts covered 3185 cell types and 8683 marker genes; approximately 84% of the valuable information was derived from the main text (Supplemental Fig. 2). MarkerGeneBERT identified 90.8% of the marker gene entities (7890/8683) and 92.7% of the cell type entities (2954/3185) in these common studies (Fig. 5A).
Through a systematic comparison of the results extracted from each literature source with those of CellMarker2.0, MarkerGeneBERT revealed an additional 1764 cell types associated with the marker genes (Fig. 5B). Among the 1764 newly identified cell types, 1344 were initially excluded by CellMarker2.0 in the corresponding literature; however, these were reported in other studies of the same tissue.
It is noteworthy that 89 cell types were not cataloged in CellMarker2.0, primarily comprising tissue-specific cell types. These cells, including enteric mesothelial fibroblasts from the intestine and retinal progenitor cells from ocular tissue, exhibited low frequencies.
Additionally, 302 cell types were detected with CellMarker2.0 but not with corresponding tissues. We categorized these 89 newly recorded cell types and 302 reported cell types according to their tissue information (Fig. 6). These cell types primarily represent functional cells distributed across different tissues; for instance, in the literature related to human gastric tissue, cancer-associated fibroblasts (CAFs), as central components of the tumor microenvironment in primary and metastatic tumors, profoundly influence the behavior of cancer cells and are involved in cancer progression through extensive interactions with cancer cells and other stromal cells25. Our method can be used to directly record CAFs in both cancer and gastric tissues. The detailed cell marker information is available in Supplemental Table 2, and the additional cell types and marker genes identified by MarkerGeneBERT have been manually reviewed.
High consistency of the marker gene list between the MarkerGeneBERT and the database
For each study, we assessed the consistency of the cell marker genes identified between CellMarker2.0 and MarkerGeneBERT. As illustrated in Fig. 7, approximately 47% of the cell types and their corresponding marker gene pairs were the same in the CellMarker2.0 database and MarkerGeneBERT. Additionally, for approximately 23% of the cell types, the marker genes extracted by MarkerGeneBERT were present in CellMarker2.0, and they accounted for 87% of the corresponding marker genes recorded in CellMarker2.0. The reason for the extraction results falling short of 100% was primarily due to certain cell types that record multiple marker genes within a single document, and it was possible that MarkerGeneBERT may have filtered out some marker genes based on preset conditions (Supplemental Fig. 3). And still, most of such cell markers also showed a high level of precision, often reaching 100%. Overall, MarkerGeneBERT exhibited a high percentage of true positives, and there was a high level of consistency between the results extracted from the MarkerGeneBERT and CellMarker2.0 databases.
The completeness and accuracy of marker extraction for each cell in every study. Accuracy: For a cell, the proportion of the intersection of the number of markers collected in the Cellmarker2.0 database and the number of markers extracted by the NLP-based model in a specific literature to the number of markers extracted by the model. Completeness: For a cell, the proportion of the intersection of the number of markers collected in the CellMarker2.0 database and the number of markers extracted by the model in a specific literature to the number of markers collected in the Cellmarker2.0 database
Additionally, approximately 13% of the cells and their marker genes reported in CellMarker2.0 were 100% of those found by MarkerGeneBERT, and on average, MarkerGeneBERT obtained 25% more marker genes that were not recorded by CellMarker2.0. We traced back some newly discovered marker genes in the original text and found that CellMarker2.0 may ignore marker genes inconsistent with the main research themes of the paper or that only the first half of the information was extracted, while the following half was ignored.
Consistency of species, tissue, and disease
We compared the consistency of species, tissue and disease information extracted from 1540 studies between the NLP system and CellMarker2.0. Overall, the consistency rates were 75% for species information, 77% for tissue information, and 66% for disease information (Fig. 8).
The primary reason for the lower-than-expected consistency stemmed from our emphasis on organizing and analyzing information extracted from the full texts of the specific studies, summarizing the main species, tissues, and disease types studied. In contrast, the CellMarker2.0 database uses literature IDs as indices to trace cell markers referenced from other literature sources, capturing the associated species, tissue, and disease information from both reference and specific literature. Consequently, there is variance in the information recorded by these two methods in the same study.
Increased cell type annotation efficiency through multi-marker annotation strategies
MarkerGeneBERT collected 166 brain cell types from approximately 190 studies, including some cell types not previously cataloged in CellMarker2.0, such as tissue-resident memory T cells, neuroblasts, and myeloid-derived suppressor cells (Supplemental Table 3). We utilized these 166 brain cell types and their compiled marker gene lists on published posterior hippocampus single-cell RNA data for cell type annotation by using scCATCH, which is a cell type annotation tool based on preset marker gene list. As illustrated in Fig. 9A, the cell type annotations obtained directly by scCATCH were almost the same as those in the original paper labels26. Notably, among the top 5 differentially expressed genes (DEGs) identified by scCATCH for cell type annotation, seven were newly discovered in our database and not recorded in CellMarker2.0 (Fig. 9B). This indicates that while many cell types possess representative marker genes, such as the CD3 marker for immune cells, which is mentioned and used in numerous articles, a more comprehensive list of marker genes can enhance the annotation efficiency of automated cell type annotation methods or tools.
Consistency between the cell type annotation results of single-cell sequencing in hippopotamus tissue and the original annotation results. A Cell type annotation results of single-cell sequencing in hippopotamus tissue. B The seven top DEGs used for scCATCH cell type annotation were newly extracted by MarkerGeneBERT
Discussion
In the coming years, single-cell sequencing technology is expected to be applied to a wider range of species and tissue types. This necessitates that researchers possess effective capabilities for annotating and analyzing such data. Although there are several manually curated marker gene databases and methods for constructing corresponding databases, manual curation remains prohibitively time-consuming and introduces potential biases, particularly when confronted with complex cell type annotations.
The developed NLP models for extracting entity relationships from text have been extensively applied across diverse domains. Doddahonnaiah et al. utilized the co-occurring theory to calculate the co-occurring frequencies of 500 pre-selected cell names and common genes in biomedical documents, thereby inferring cell marker genes15. However, as discussed in their work and as mentioned in a related study13, the co-occurring theory itself has several limitations, as a pair of entities with low co-occurring frequencies can be reliable but may go undetected. Additionally, the pre-selection of cell names significantly constrains the scalability of the method. In contrast to co-occurrence methods, we optimized a dependency analysis method based on NLP models to capture all mentioned cell names within the text and extract cell–gene pairs that are related in grammatical structure. Additionally, utilizing artificially annotated marker-related sentences created by our team, we developed a marker-related text classification model to ensure that sentences containing cell–gene pairs are inherently marker related, thereby filtering out cell–gene pairs merely co-occurring within a sentence.
This study has several limitations. First, we utilized only the main text of the literature for extracting cell marker genes; however, in some related studies, certain cell markers are presented in the figures or Supplemental materials. Additionally, there was no standard tissue nomenclature, such as the use of "ascites" in single-cell studies, where it is frequently mentioned but not considered a tissue. We currently provide a pattern for two candidate tissues to meet the automated extraction needs for tissue information as much as possible. Unlike the recognition of cell and gene entities, correct tissue information typically relies on a comprehensive understanding of the entire text. Although we attempted to extract tissue information separately from the titles, abstracts, results, and methods sections, we relied on the frequency of different tissues to select the appropriate true tissue; in some cases, the tissues extracted were still erroneous, so the accuracy of extracting organizational information was not particularly high. In addition, to balance computational resources and time, we analyzed only human and mouse scRNA-seq literatures, as the gene names of different species are not completely the same, existing data preparation for humans and mouse cannot be directly applied to the extraction of new species. If there is a need to extract cell marker genes from the literature from other species, it is necessary to organize and construct a new gene dictionary. Finally, the most apparent advantage of NLP system is its ability to swiftly extract sentences containing key terms (cell and gene) and determine their associative relationships, there is no unified name or classification system for cells, and the cell names is largely influenced by personal writing habits, such as T cells, t cell, B cells, CD4 + T cells, cd4 t cells, and CD8 + T cells. For better construction of cell marker databases, classifying different cell names according to the true cell type remains an enormous challenge for both human manual and NLP automated methods.
During the comparison of the cell markers MarkerGeneBERT and Cellmarker2.0, 802 cell types were detected in Cellmarker2.0 and MarkerGeneBERT, but they were classified as non-marker-related due to the significantly lower probability of being predicted by the model (Supplemental Fig. 4). In addition, the texts of 242 other cell types met the model threshold. However, the complex grammatical structures of these cells currently challenge our methods for identifying cell-marker pairs within them.
The singleCellBase was previously published by our team9, all the 913 cell types, their corresponding marker genes, the relationships evidence and even the source 618 single-cell analysis publications were manually collected, extracted and standardized. Given the standardization and renaming of gene markers and cell names within singleCellBase, some difference from original texts might exist. we compared the outputs from MarkerGeneBERT with the cell-marker gene pairs delineated in singleCellBase across the 618 publications on a one-to-one literature basis. The assessment revealed a completeness rate of 71% and an accuracy score of 75%, slightly trailing behind the comparison results obtained from Cellmarker2.0 analyses.
Comparing with the artificial method, MarkerGeneBERT model has reached a usable level, it can significantly accelerate the efficiency of extracting cell marker information. Of course, subsequently manual reviewed the results would be better. Our goal in the future is to incorporate a more diverse training dataset for marker-related text classification model training to further accommodate the screening of diverse marker-related texts.
Conclusion
We developed an NLP-based text mining system named MarkerGeneBERT to identify cells and marker genes from both the text and tables in the literature sourced from PubMed and PubMed Central. Using artificially annotated marker-related sentences, we constructed a supervised text classification model to initially screen out texts containing both gene and cell names; then, cell and marker genes were extracted from those texts according to marker-related patterns. Our cell marker gene identification pipeline achieved 75% accuracy and 76% completeness when compared with CellMarker2.0, demonstrating the success and state-of-the-art of cell marker gene extraction by NLP text mining.
Data availability
All data generated or analyzed during this study are included in this published article and its supplementary information files. MarkerGeneBERT and its source code are hosted at https://github.com/chengpeng1116/MarkerGeneBERT.
References
Jovic, D. et al. Single-cell RNA sequencing technologies and applications: A brief overview. Clin. Transl. Med. 12(3), e694 (2022).
Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods 16(10), 1007–1015 (2019).
Shao, X. et al. scCATCH: Automatic annotation on cell types of clusters from single-cell RNA sequencing data. iScience 23(3), 100882 (2020).
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20(2), 163–172 (2019).
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16(10), 983–986 (2019).
Cao, Z. J., Wei, L., Lu, S., Yang, D. C. & Gao, G. Searching large-scale scRNA-seq databases via unbiased cell embedding with cell BLAST. Nat. Commun. 11(1), 3458 (2020).
Hu, C. et al. Cell marker 2.0: An updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 51(D1), D870–D876 (2023).
Franzen, O., Gan, L. M., Bjorkegren, J. L. M. PanglaoDB: A web server for exploration of mouse and human single-cell RNA sequencing data. Database (Oxford) 2019, (2019).
Meng, F. L. et al. singleCellBase: A high-quality manually curated database of cell markers for single cell annotation across multiple species. Biomark. Res. 11(1), 83 (2023).
Jin, J. et al. PCMDB: A curated and comprehensive resource of plant cell markers. Nucleic Acids Res. 50(D1), D1448–D1455 (2022).
Yuan, H. et al. CancerSEA: A cancer single-cell state atlas. Nucleic Acids Res. 47(D1), D900–D908 (2019).
Shetty, P. et al. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. NPJ. Comput. Mater. 9(1), 52 (2023).
Gu, W. et al. MarkerGenie: An NLP-enabled text-mining system for biomedical entity relation extraction. Bioinform. Adv. 2(1), vbac035 (2022).
Naseri, H. et al. Development of a generalizable natural language processing pipeline to extract physician-reported pain from clinical reports: Generated using publicly-available datasets and tested on institutional clinical reports for cancer patients with bone metastases. J. Biomed. Inform. 120, 103864 (2021).
Doddahonnaiah, D. et al. A literature-derived knowledge graph augments the interpretation of single cell RNA-seq datasets. Genes (Basel) https://doi.org/10.3390/genes12060898 (2021).
Bada, M. et al. Concept annotation in the CRAFT corpus. BMC Bioinform. 13, 161 (2012).
Collier, N., Kim, J.-D. Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pp. 73–78 (2004)
Pyysalo, S. et al. Overview of the cancer genetics and pathway curation tasks of BioNLP shared task 2013. BMC Bioinform. 16(Suppl 10), S2 (2015).
Kovalchik, S. Download content from NCBI databases. R package version 4(0):2021 (2014).
Fantini, D., Fantini, M. D. Package ‘easy PubMed’. In: CRAN (2017).
Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear 7(1):411–420 (2017).
Beltagy, I., Lo, K., Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv preprint (2019).
Neumann, M., King, D., Beltagy, I., Ammar, W. ScispaCy: Fast and robust models for biomedical natural language processing. arXiv preprint (2019).
Diehl, A. D. et al. The cell ontology 2016: Enhanced content, modularization, and ontology interoperability. J. Biomed. Semantics 7(1), 44 (2016).
Yang, D., Liu, J., Qian, H. & Zhuang, Q. Cancer-associated fibroblasts: From basic science to anticancer therapy. Exp. Mol. Med. 55(7), 1322–1332 (2023).
Ayhan, F. et al. Resolving cellular and molecular diversity along the hippocampal anterior-to-posterior axis in humans. Neuron 109(13), 2091-2105 e2096 (2021).
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
Conceptualization, X.L.Z., P.C., Y.P. and Y.Z.L.; methodology, P.C. and X.L.Z.; formal analysis, P.C.; investigation, X.L.Z.; data curation, P.Y. and Y.Z.L.; writing—S.C. and P.C.; writing—review and editing, P.Y. and Y.Z.L.; funding acquisition, Y.P. and Y.Z.L. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Cheng, P., Peng, Y., Zhang, XL. et al. A natural language processing system for the efficient extraction of cell markers. Sci Rep 14, 21183 (2024). https://doi.org/10.1038/s41598-024-72204-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-72204-6