Abstract
Natural language processing is utilized in a wide range of fields, where words in text are typically transformed into feature vectors called embeddings. BioConceptVec is a specific example of embeddings tailored for biology, trained on approximately 30 million PubMed abstracts using models such as skip-gram. Generally, word embeddings are known to solve analogy tasks through simple vector arithmetic. For example, subtracting the vector for man from that of king and then adding the vector for woman yields a point that lies closer to queen in the embedding space. In this study, we demonstrate that BioConceptVec embeddings, along with our own embeddings trained on PubMed abstracts, contain information about drug–gene relations and can predict target genes from a given drug through analogy computations. We also show that categorizing drugs and genes using biological pathways improves performance. Furthermore, we illustrate that vectors derived from known relations in the past can predict unknown future relations in datasets divided by year. Despite the simplicity of implementing analogy tasks as vector additions, our approach demonstrated performance comparable to that of large language models such as GPT-4 in predicting drug–gene relations.
Similar content being viewed by others
Introduction
Natural language processing (NLP) is a computer technology for processing human language. NLP is used in various applications, such as machine translation1,2, sentiment analysis3,4, and sentence similarity computation5,6. Many of these applications use models such as skip-gram7,8 or BERT4 to convert words in text to embeddings or distributed representations, which are feature vectors with hundreds of dimensions. In particular, skip-gram is a method that learns high-performance embeddings by predicting the surrounding words of a word in a sentence. These embeddings form the foundational building blocks of recent advances in large language models, enabling models to understand and generate human language with unprecedented accuracy and coherence.
Mikolov et al.7 showed that skip-gram embeddings have properties such as \(\mathbf{u}_{king} - \mathbf{u}_{man } + \mathbf{u}_{woman }\approx \mathbf{u}_{queen },\) where the vector \(\mathbf{u}_w\) represents the embedding of word w. In the embedding space, the vector differences such as \(\mathbf{u}_{king } - \mathbf{u}_{man }\) and \(\mathbf{u}_{queen } - \mathbf{u}_{woman }\) can be seen as vectors representing the royalty relation. Since these relations are not explicitly taught during the training of skip-gram, the embeddings acquire these properties spontaneously. Solving questions such as “If man corresponds to king, what does woman correspond to?” requires understanding the relations between words. This type of problem is known as an analogy task and is used to evaluate a model’s language comprehension and reasoning skills. In this example, the model must understand the royalty relations from man to king and apply a similar relation to woman. Allen and Hospedales9 explained why analogy computation with such simple vector arithmetic can effectively solve analogy tasks.
Recently, NLP methods have also gained attention in the field of biology10,11,12. However, since traditional skip-gram models are typically trained on web corpora, they do not properly handle the specialized terms that appear in biological texts. In particular, when multiple words represent the same concept, these concepts should be normalized beforehand and represented by the same embedding. With this in mind, Chen et al.13 proposed a method to compute word embeddings of biological concepts using approximately 30 million PubMed abstracts in which the mentions of these concepts had been previously normalized using PubTator14. PubTator is an online tool that supports the automatic annotation of biomedical text and aims at extracting specific information efficiently. Particularly, it can identify the mentions of biological concepts and entities within text and classify them into appropriate categories (e.g., diseases, genes, drugs). They trained four embedding models, including skip-gram, and named their embeddings BioConceptVec, and assessed the usefulness of BioConceptVec in two ways: intrinsic and extrinsic evaluations. For intrinsic evaluations, they identified related genes based on drug–gene and gene–gene interactions using cosine similarity of the embeddings. For extrinsic evaluations, they performed protein–protein interaction prediction and drug–drug interaction extraction using neural network classifiers with the embeddings. However, the embeddings’ performance on solving the analogy tasks is yet to be explored.
In this study, we consider analogy tasks in biology using word embeddings. We trained skip-gram embeddings similarly to BioConceptVec, aiming to compare them with BioConceptVec embeddings and to develop embeddings from datasets divided by year in an extended experiment. To evaluate the performance of analogy computation, we focus on predicting target genes from a given drug. These target genes are associated with proteins that the drug interacts with, and these interactions are called drug-target interactions (DTIs)15,16. In this paper, we will refer to these connections as drug–gene relations. Our research aims to show that embeddings learned from biological text data contain information about drug–gene relations. In the example of \(\mathbf{u}_{king } - \mathbf{u}_{man } + \mathbf{u}_{woman } \approx \mathbf{u}_{queen }\), the vector differences \(\mathbf{u}_{king } - \mathbf{u}_{man }\) and \(\mathbf{u}_{queen } - \mathbf{u}_{woman }\) represent the royalty relation. However, there are multiple drug–gene pairs with drug–gene relations, and a single drug often has multiple target genes. Therefore, we calculate the vector differences between the embeddings of each drug and its target genes and average these vector differences to define a mean vector representing the relation.
To evaluate the performance in predicting drug–gene relations, we use data derived from KEGG17,18,19 as the ground truth. We first consider analogy tasks in the global setting where all drugs and genes are included and demonstrate high performance in solving these tasks. Next, to consider more detailed analogy tasks, we use information common to both drugs and genes. Specifically, we focus on biological pathways to categorize drugs and genes. To do this, we use a list of human pathways from the KEGG API. We group all drugs and genes that are associated with the same pathway into a single category and define vectors representing drug–gene relations for each pathway. In this pathway-wise setting, we demonstrate that using these vectors in the analogy computation can improve its performance. Finally, as an application of analogy computation, we investigate the potential of our approach to predict unknown drug–gene relations. For this purpose, we divide the vocabulary by year to distinguish between known and unknown drug–gene relations. We then train skip-gram embeddings using PubMed abstracts published before the specified year and redefine vectors representing the known relations to predict unknown relations. The experimental results show that our approach can predict unknown relations to a certain extent.
The structure of this paper is as follows. First, we explain related work and detail our approach in this study. In the following sections, we perform experiments for each setting and evaluate the performance using metrics such as top-1 accuracy. Finally, we discuss the prediction results and then conclude.
Related work
Analogy computation of word embeddings trained from corpora in a particular field of natural science can be used to predict the relations that exist between specialized concepts or terms. Tshitoyan et al.20 have shown that the relation between specialized concepts in the field of materials science such as “\(\text {ferromagnetism} - \text {NiFe} + \text {IrMn} \approx \text {antiferromagnetism}\)” can be correctly predicted using analogy computation of word embeddings trained with domain-specific literature. This success in the field of materials science demonstrates the potential of analogy computation of word embeddings to predict relationships between specialized concepts in scientific domains.
In the biomedical field, capturing the relations between biomedical concepts, such as drug–drug and protein–protein interactions, is an important issue. Word embeddings trained on biomedical domain corpora have been used to predict drug–drug interactions for new drugs21 and to construct networks of gene–gene interactions22.
While word embeddings have shown promise in various biomedical applications, their potential for predicting drug–gene relations remains relatively unexplored. To the best of our knowledge, there is limited research specifically focused on utilizing analogy computation of word embeddings to predict drug–gene relations in the biological domain. Our study aims to address this gap by investigating the application of analogy computation of word embeddings to drug–gene relation prediction.
In addition, neural networks have been used in a number of studies to predict these relations23,24,25,26,27. However, it is important to note that these relation extraction approaches differ from our proposed task. While they are useful for identifying explicitly stated relationships in text, our focus is on leveraging the latent semantic information captured by word embeddings to predict potential drug–gene relations, even when such relationships are not explicitly mentioned in the literature.
The technologies developed in NLP, not limited to word embeddings, have been applied to a wide variety of problems in the biomedical domain, demonstrating their utility in this field. Müller et al.28 have created an online literature search and curation system using an “ontology dictionary” obtained by text mining, and Friedman et al.29 have used syntactic parsing to extract and structure information about cellular pathways from biological literature. Yeganova et al.30 normalize synonyms in the biomedical literature by word embedding similarity. Furthermore, Du et al.31 applied the word embedding algorithm to gene co-expression across the transcriptome to compute vector representations of genes. In recent years, there has also been intensive research on fine-tuning BERT4 and other pre-trained language models to the biomedical domain10,32,33.
Methods
Analogy tasks
This section illustrates a basic analogy task, a problem setting commonly used in benchmark tests to evaluate the performance of word embedding models. We define \(\mathscr {V}\) as the vocabulary set, and \(\mathbf{u}_w \in \mathbb {R}^K\) as the word embedding of word \(w \in \mathscr {V}\). In skip-gram models, the “closeness of word embeddings” measured by cosine similarity correlates well with the “closeness of word meanings”7. For example, in word2vec7, \(\cos (\mathbf{u}_{media },\mathbf{u}_{press })=0.601>\cos (\mathbf{u}_{media },\mathbf{u}_{car })=0.020\). This result shows that the embedding of media is closer to the embedding of press than to the embedding of car. This is consistent with the rational human interpretation that media is semantically closer to press than to car. The task for predicting relations between words is known as an analogy task, and word embeddings can effectively solve analogy tasks using vector arithmetic7. For example, to solve the question “If man corresponds to king, what does woman correspond to?”, using pre-trained skip-gram embeddings, we search for \(w\in \mathscr {V}\) that maximizes \(\cos (\mathbf{u}_{king } - \mathbf{u}_{man } + \mathbf{u}_{woman },\mathbf{u}_w)\), and find \(w={queen }\), indicating
This analogy computation is illustrated in Fig. 1a . Here, the vector \(\mathbf{v}_{royalty }\) represents the royalty relation between man and king, and is defined as the vector difference between \(\mathbf{u}_{king }\) and \(\mathbf{u}_{man }\):
Adding the royalty relation vector \(\mathbf{v}_{royalty }\) to \(\mathbf{u}_{woman }\) then yields \(\mathbf{u}_{queen }\):
Analogy computations by adding the relation vector. (a) Example of a basic analogy task. The question “If man corresponds to king, what does woman correspond to?” is solved by adding the relation vector \(\mathbf{v}_royalty\) to \(\mathbf{u}_woman\). (b) Example of an analogy task in setting G. The target genes of the drug d are \(g_1\), \(g_2\), and \(g_3\), predicted by adding the relation estimator \(\hat{\mathbf{v}}\) to the drug embedding \(\mathbf{u}_d\).
Analogy tasks for drug–gene pairs
In this section, based on the basic analogy task from the previous section, we explain analogy tasks for predicting target genes from a drug. First, we consider the global setting, where all drugs and genes in the vocabulary set are used. Then, we consider the pathway-wise setting, where drugs and genes are categorized based on biological pathways.
Global setting
As a first step, based on the basic analogy task, we consider analogy tasks in the global setting, where all drugs and genes are used. We define \(\mathscr {D}\subset \mathscr {V}\) and \(\mathscr {G}\subset \mathscr {V}\) as the sets of all drugs and genes, respectively. We also define \(\mathscr {R}\subset \mathscr {D}\times \mathscr {G}\) as the set of drug–gene pairs with drug–gene relations. Thus, if a drug \(d\in \mathscr {D}\) and a gene \(g\in \mathscr {G}\) have a drug–gene relation, the pair (d, g) is in \(\mathscr {R}\). For illustrating the positions of drugs and genes in the 100-dimensional BioConceptVec skip-gram embeddings, we sampled 200 drug–gene pairs in \(\mathscr {R}\) and plotted them in Fig. 2a using Principal Component Analysis (PCA). The vector difference between the mean embeddings of drugs and genes points in roughly the same direction as the vector differences between the embeddings of each sampled drug and its target gene.
2-d visualization of embeddings using PCA. Drugs and genes are shown in blue and orange, respectively. Solid lines represent the relations. The symbols red star and red square represent the mean embeddings of the drugs and genes, respectively. The direction of the dashed lin connecting these two symbols can be considered as the drug–gene relation vector. (a) Randomly sampled 200 drug–gene pairs from \(\mathscr {R}.\) The mean embeddings are computed by Eq. (9). (b) Drugs \(d \in \mathscr {D}_p\) and genes \(g \in \mathscr {G}_p\) for \(p=\)ErbB signaling pathway. Those that have drug–gene relations are shown in boxes. The mean embeddings are computed by Eq. (S2) in Supplementary Information1.1.1.
To apply the analogy computation of Eq. (3) to drug–gene pairs \((d,g)\in \mathscr {R}\), we consider the analogy tasks for predicting the gene g from the drug d. Using the vector \(\mathbf{v}\) representing the drug–gene relation, we predict \(\mathbf{u}_g\) by adding the relation vector \(\mathbf{v}\) to \(\mathbf{u}_d\) as follows:
Thus, we need methods for estimating the relation vector \(\mathbf{v}\). In Eq. (2), the royalty relation vector \(\mathbf{v}_\text {royalty }\) is calculated from the embeddings of the given pair (man, king). Therefore, we estimate the relation vector \(\mathbf{v}\) from the embeddings of genes and drugs. The following equation gives the vector difference between the mean vectors of \(\mathscr {D}\) and \(\mathscr {G}\) as a naive estimator of \(\mathbf{v}\):
In Eq. (6), \(\textrm{E}_\mathscr {D}\{\cdot \}\) and \(\textrm{E}_\mathscr {G}\{\cdot \}\) represent the sample means over the set of all drugs \(\mathscr {D}\) and the set of all genes \(\mathscr {G}\), respectively. However, the definition of \(\hat{\mathbf{v}}_\text {naive}\) in Eq. (5) includes the embeddings of unrelated genes and drugs. Therefore, a better estimator may be defined by using only the pairs \((d,g)\in \mathscr {R}\). We consider the estimator \(\hat{\mathbf{v}}\) as the mean of the vector differences \(\mathbf{u}_g-\mathbf{u}_d\) for \((d, g) \in \mathscr {R}\) as follows:
where \(\textrm{E}_\mathscr {R}\{\cdot \}\) is the sample mean over the set of drug–gene pairs \(\mathscr {R}\). For easy comparison with Eq. (5), Eq. (7) is rewritten as the difference of mean vectors:
To measure the performance of the estimator \(\hat{\mathbf{v}}\) in Eq. (7), we prepare the evaluation of the analogy tasks. We define D and G as the sets of drugs and genes contained in \(\mathscr {R}\), respectively. Specifically, we define \(D := \pi _\mathscr {D}(\mathscr {R})\) as the set of drugs d such that \((d, g) \in \mathscr {R}\) for some genes g, and \(G := \pi _\mathscr {G}(\mathscr {R})\) as the set of genes g such that \((d, g) \in \mathscr {R}\) for some drugs d, where the projection operations \(\pi _\mathscr {D}\) and \(\pi _\mathscr {G}\) are defined as
We also define \([d]\subset \mathscr {G}\) as the set of genes that have drug–gene relations with a drug \(d\in D\), and \([g]\subset \mathscr {D}\) as the set of drugs that have drug–gene relations with a gene \(g\in G\). These are formally defined as
Given the above, we perform the analogy computation in the following setting.
Setting G. In the analogy tasks, the set of answer genes for a query drug \(d \in D\) is [d]. The predicted gene is \(\hat{g}_d = {{\,\textrm{argmax}\,}}_{g\in \mathscr {G}}\cos (\mathbf{u}_d+\hat{\mathbf{v}}, \mathbf{u}_g)\) and if \(\hat{g}_d \in [d]\), then the prediction is considered correct. We define \(\hat{g}_d^{(k)}\) as the k-th ranked \(g \in \mathscr {G}\) based on \(\cos (\mathbf{u}_d+\hat{\mathbf{v}}, \mathbf{u}_g)\). For the top-k accuracy, if any of the top k predictions \(\hat{g}_d^{(1)}, \ldots , \hat{g}_d^{(k)} \in [d]\), then the prediction is considered correct.
Unlike in the basic analogy task, there may be multiple target genes for a single drug. While the basic analogy task is one-to-one, we address one-to-many analogy tasks in this study, where a single source may correspond to multiple targets34. Analogy computation for setting G is illustrated in Fig. 1b , where the set of answer genes [d] consists of \(g_1\), \(g_2\), and \(g_3\). If the predicted gene \(\hat{g}_d\) is one of these three genes, then the prediction is considered correct. Note that the analogy tasks for predicting drugs from a query gene can also be defined similarly. See Supplementary Information 1.3 for details.
Pathway-wise setting
As a next step, based on the analogy tasks in setting G, we consider more detailed analogy tasks. To do this, we consider the analogy tasks in the pathway-wise setting, where drugs and genes are categorized using biological pathways. We define \(\mathscr {P}\) as the set of pathways p, and \(\mathscr {D}_p \subset \mathscr {D}\) and \(\mathscr {G}_p \subset \mathscr {G}\) as the sets of drugs and genes categorized in each pathway \(p \in \mathscr {P}\), respectively. We then restrict the set \(\mathscr {R}\) to each pathway p and define the subset of \(\mathscr {R}\) as:
A specific example of these sets for the ErbB signaling pathway is shown in Table 1, and their BioConceptVec skip-gram embeddings are illustrated in Fig. 2b . The vector difference between the mean embeddings of drugs and genes roughly points in the same direction as the vector differences between the embeddings of each drug and its target gene in \(\mathscr {R}_p\), although such a two-dimensional illustration should be interpreted with caution.
For drug–gene pairs \((d,g) \in \mathscr {R}_p\), we consider the analogy tasks for predicting the target genes g from a drug d. To solve these analogy tasks, we use the relation vector \(\mathbf{v}_p\), which represents the relation between drugs and target genes categorized in the same pathway p. We predict \(\mathbf{u}_g\) by adding the relation vector \(\mathbf{v}_p\) to \(\mathbf{u}_d\), expecting that
Equation (13) corresponds to Eq. (4). Therefore, similar to the estimator \(\hat{\mathbf{v}}\) in Eq. (7), we define an estimator \(\hat{\mathbf{v}}_p\) for the relation vector \(\mathbf{v}_p\) as the mean of the vector differences \(\mathbf{u}_g-\mathbf{u}_d\) for \((d, g) \in \mathscr {R}_p\):
where \(\textrm{E}_{\mathscr {R}_p}\{\cdot \}\) is the sample mean over the set of drug–gene pairs \(\mathscr {R}_p\).
To measure the performance of the estimator \(\hat{\mathbf{v}}_p\), we prepare the evaluation of the analogy tasks. Similar to D and G, we define \(D_p\) and \(G_p\) as the sets of drugs and genes contained in \(\mathscr {R}_p\), respectively. Using the operations of Eq. (10), we define \(D_p:=\pi _{\mathscr {D}_p}(\mathscr {R}_p)\subset \mathscr {D}_p\) as the set of drugs d such that \((d, g) \in \mathscr {R}_p\) for some genes g, and \(G_p:=\pi _{\mathscr {G}_p}(\mathscr {R}_p)\subset \mathscr {G}_p\) as the set of genes g such that \((d, g) \in \mathscr {R}_p\) for some drugs d. We also define \([d]_p\subset \mathscr {G}_p\) as the set of genes that have drug–gene relations with a drug \(d\in D_p\), and \([g]_p\subset \mathscr {D}_p\) as the set of drugs that have drug–gene relations with a gene \(g\in G_p\). Similar to Eq. (11), these sets are formally defined as
Given the above, we perform the analogy computation in the following two settings.
Setting P1. For the target genes that have drug–gene relations with a drug d, only genes categorized in the same pathway p as the drug d are considered correct. In other words, for a query drug \(d \in D_p\), the set of answer genes is \([d]_p\). The search space is the set of all genes \(\mathscr {G}\), not limited to \(\mathscr {G}_p\), the set of genes categorized in the pathway p. The predicted gene is \(\hat{g}_d = {{\,\textrm{argmax}\,}}_{g \in \mathscr {G}} \cos (\mathbf{u}_d + \hat{\mathbf{v}}_p, \mathbf{u}_g)\), and if \(\hat{g}_d \in [d]_p\), then the prediction is considered correct. We define \(\hat{g}_d^{(k)}\) as the k-th ranked \(g\in \mathscr {G}\) based on \(\cos (\mathbf{u}_d + \hat{\mathbf{v}}_p, \mathbf{u}_g)\). For the top-k accuracy, if \(\hat{g}_d^{(k)} \in [d]_p\), then the prediction is considered correct.
Setting P2. The gene predictions \(\hat{g}_d\) and \(\hat{g}_d^{(k)}\) are defined exactly the same as those in setting P1, but the answer genes are defined the same as in setting G. That is, for the target genes that have drug–gene relations with a drug d, genes are considered correct regardless of whether they are categorized in the same pathway p as the drug d or not. In other words, for a query drug \(d \in D\), the set of answer genes is [d], and the prediction is considered correct if \(\hat{g}_d \in [d]\). For the top-k accuracy, if \(\hat{g}_d^{(k)} \in [d]\), then the prediction is considered correct. Note that the experiment is performed for \(d \in \mathscr {D}_p \cap D\) for each p.
Figure S1 in Supplementary Information 1.1.1 shows the differences between settings P1 and P2 using specific examples. Table S2 in Supplementary Information 1.3 summarizes the queries, answer sets, and search spaces in settings G, P1, and P2.
Analogy tasks for drug–gene pairs by year
In this section, based on analogy tasks for drug–gene pairs, we explain analogy tasks in the setting where datasets are divided by year. To do so, we first use embeddings trained on PubMed abstracts up to year y and consider analogy tasks in a global setting such as setting G. Next, we separate the drug–gene relations into “known” or “unknown” based on their chronological appearance in the PubMed abstracts. Using the embeddings from the datasets divided by year, we test whether embeddings of “known” relations have the ability to predict “unknwon” relations. We then consider the pathway-wise settings by year, where drugs and genes are categorized based on pathways in the datasets divided by year. In these settings, we use the year-specific embeddings and consider analogy tasks in settings such as P1 and P2. Finally, we evaluate whether “unknown” relations can be predicted by “known” relations.
Global setting by year
First, using embeddings trained on PubMed abstracts up to year y, we consider analogy tasks in a global setting. In preparation, we define \(y_d\) as the year when a drug d first appeared in a PubMed abstract and \(y_g\) as the year when a gene g first appeared in a PubMed abstract.
Consider a fixed year y. When learning embeddings using PubMed abstracts up to year y as training data, we define \(\mathscr {D}^y := \{ d \in \mathscr {D} \mid y_d \le y\}\) and \(\mathscr {G}^y := \{ g \in \mathscr {G} \mid y_g \le y\}\) as the sets of drugs and genes that appeared up to year y, respectively. The set of drug–gene pairs that have drug–gene relations and whose drugs and genes appeared up to year y is expressed as
Similar to the global setting, for drug–gene pairs \((d,g) \in \mathscr {R}^y\), we consider the analogy tasks for predicting the target genes g from a drug d. To solve these analogy tasks, we use the relation vector \(\mathbf{v}^y\). We predict \(\mathbf{u}_g\) by adding the relation vector \(\mathbf{v}^y\) to \(\mathbf{u}_d\):
Equation (17) corresponds to Eq. (4). Therefore, we define the estimator \(\hat{\mathbf{v}}^{y}\) for the relation vector \(\mathbf{v}^{y}\) by using \(\mathscr {R}^{y}\) instead of \(\mathscr {R}\) in the estimator \(\hat{\mathbf{v}}\) in Eq. (7). Given the above, similar to setting G, we perform the analogy tasks in the following setting.
Setting Y1. Using embeddings trained on PubMed abstracts up to year y, we consider analogy tasks in setting G. Thus, if y is the most recent, it simply corresponds to setting G.
See Supplementary Information 1.2.1 for more details on the setting.
Global setting to predict unknown relations by year
Based on analogy tasks in the global setting by year, we consider analogy tasks to predict unknown relations using known relations. Specifically, drug–gene relations that appeared up to year y are considered known, while those that appeared after year y are considered unknown. We then use embeddings trained on PubMed abstracts up to year y, redefine vectors representing known relations, and use these vectors to predict unknown relations. In preparation, we define \(y_{(d,g)}\) as the year when both drug d and gene g first appeared together in a PubMed abstract. We consider \(y_{(d,g)}\) as a substitute for the year when the relation (d, g) was first identified. By definition, \(\max \{y_d, y_g\} \le y_{(d,g)}\) holds. The relation (d, g) is interpreted as either known by year y if \(y_{(d,g)} \le y\) or unknown by year y if \(y < y_{(d,g)}\).
We define two subsets of \(\mathscr {R}^y\) based on whether \(y_{(d,g)} \le y\) or \(y < y_{(d,g)}\). To do this, we define two intervals, \(L_y:=(-\infty , y]\) and \(U_y:=(y, \infty )\). Using \(L_y\) and \(U_y\), we define the subsets \(\mathscr {R}^{y\mid L_y}\) and \(\mathscr {R}^{y\mid U_y}\) of \(\mathscr {R}^y\) as follows:
The set of “known” relations is expressed as \(\mathscr {R}^{y \mid L_y}\), and the set of “unknown” relations is expressed as \(\mathscr {R}^{y \mid U_y}\). By definition, \(\mathscr {R}^{y \mid L_y} \cap \mathscr {R}^{y \mid U_y} = \emptyset\) and \(\mathscr {R}^{y \mid L_y} \cup \mathscr {R}^{y \mid U_y}\subset \mathscr {R}^y\).
In analogy tasks, we use “known” \(\mathscr {R}^{y \mid L_y}\) and then predict the target genes g from a drug d for (d, g) in “unknown” \(\mathscr {R}^{y \mid U_y}\). Using the relation vector \(\mathbf{v}^{y\mid L_y}\), which represents the drug–gene relations in \(\mathscr {R}^{y \mid L_y}\), we predict \(\mathbf{u}_g\) by adding the relation vector \(\mathbf{v}^{y\mid L_y}\) to \(\mathbf{u}_d\):
Equation (19) corresponds to Eq. (17). Therefore, we define the estimator \(\hat{\mathbf{v}}^{y\mid L_y}\) for the relation vector \(\mathbf{v}^{y\mid L_y}\) by using \(\mathscr {R}^{y\mid L_y}\) instead of \(\mathscr {R}\) in the estimator \(\hat{\mathbf{v}}\) in Eq. (7). Given the above, we perform the analogy tasks in the following setting.
Setting Y2. Using embeddings trained on PubMed abstracts up to year y, we consider analogy tasks with a reduced answer set. For the target genes that have drug–gene relations with a drug d, only genes whose relations appeared after year y are considered correct. In other words, only new discoveries are counted as correct.
See Supplementary Information 1.2.2 for more details on the setting. Note that \(y_d\), \(y_g\), and \(y_{(d,g)}\) are defined based on the year the drugs, genes, and drug–gene relations appeared in PubMed abstracts. Thus, they do not fully correspond to their actual years of discovery, and we only consider the analogy tasks in these hypothetical settings.
Pathway-wise setting by year
As a next step, based on the analogy tasks in settings P1, P2, and Y1, we consider the analogy tasks in the pathway-wise setting by year, where drugs and genes are categorized based on pathways in datasets divided by year. To do this, we perform the analogy tasks in the following two settings.
Setting P1Y1. Using embeddings trained on PubMed abstracts up to year y, we consider analogy tasks in setting P1. Thus, if y is the most recent, it corresponds to setting P1.
Setting P2Y1. Using embeddings trained on PubMed abstracts up to year y, we consider analogy tasks in setting P2. Thus, if y is the most recent, it corresponds to setting P2.
See Supplementary Information 1.2.3 for details on the analogy tasks and these settings.
Pathway-wise setting to predict unknown relations by year
Furthermore, based on the analogy tasks in settings P1, P2, and Y2, we consider analogy tasks to predict unknown relations using known relations in the pathway-wise setting by year. To do this, we perform the analogy tasks in the following two settings.
Setting P1Y2. Using embeddings trained on PubMed abstracts up to year y, we consider analogy tasks in setting P1 with a reduced answer set. For the target genes that have drug–gene relations with a drug d, only genes categorized in the same pathway p as d, and whose relations appeared after year y, are considered correct.
Setting P2Y2. Using embeddings trained on PubMed abstracts up to year y, we consider analogy tasks in setting P2 with a reduced answer set. For the target genes that have drug–gene relations with a drug d, only genes whose relations appeared after year y are considered correct, regardless of whether they are categorized in the same pathway p as the drug d or not.
See Supplementary Information 1.2.4 for details on the analogy tasks and these settings.
Embeddings
BioConceptVec13 provides four pre-trained 100-dimensional word embeddings: CBOW7, skip-gram7,8, GloVe35, and fastText36. Since CBOW is a simpler model than skip-gram and skip-gram performs better in analogy tasks compared to GloVe9, we used BioConceptVec skip-gram embeddings for our experiments. Note that fastText can essentially be considered as a skip-gram using n-grams. To complement the pre-trained BioConceptVec embeddings, we further trained 300-dimensional skip-gram embeddings on the publicly available PubMed abstracts. As with the original BioConceptVec, to train our embeddings, we used PubTator14 to convert six major biological concepts (genes, mutations, diseases, chemicals, cell lines, and species) in PubMed abstracts into their respective IDs, followed by tokenization using NLTK37. Note that since widely used embeddings such as word2vec and GloVe typically have 300 dimensions, we set the dimensions of our embeddings to 300 instead of the original 100. The hyperparameters used to train our embeddings are shown in Table S4 in Supplementary Information 1.4.
For BioConceptVec and our skip-gram embeddings, Table 2 shows some basic statistics for setting G. Due to the difference in training data size and minimal word occurrence, the sizes of certain sets such as \(|\mathscr {D}|\) and \(|\mathscr {G}|\) differ significantly between BioConceptVec and our skip-gram embeddings, but the sizes of other sets, such as \(|\mathscr {R}|\), which represents the size of overall relations, show somewhat similar trends. Figure S2 in Supplementary Information 1.4 shows the distribution of the sizes of the answer sets for each drug d under settings G, P1, and P2 for both BioConceptVec and our skip-gram embeddings. Table S1 in Supplementary Information 1.1.1 shows the statistics for settings P1 and P2. The hyperparameters used to train our skip-gram are shown in Table S4 in Supplementary Information 1.4.
Datasets
Corpus. Following Chen et al.13, we used PubMed (https://pubmed.ncbi.nlm.nih.gov/) abstracts to train our skip-gram embeddings. We used about 35 million abstracts up to the year 2023, while they used about 30 million abstracts.
Drug–gene relations. For drug–gene relations, we used publicly available data from AsuratDB38, which collects information from various databases including the KEGG17,18,19 database. KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive database system that integrates a wide range of bioinformatics information such as genomics, chemical reactions, and biological pathways.
Biological pathways. We obtained a list of human pathways from the KEGG API (https://rest.kegg.jp/list/pathway/hsa) for use in our experiments. For each pathway, we again used the KEGG API to define sets of drugs and genes. To avoid oversimplification of analogy tasks, we excluded the pathways where only one type of drug or gene had drug–gene relations.
For more details, see Data availability section and Supplementary Information 1.5.
Baselines
To evaluate the performance of predicting target genes by adding the relation vector to a drug embedding, we compare it to baseline methods.
As a simple yet reasonable baseline, we randomly sampled genes from the set G containing the genes that have drug–gene relations with at least one drug. To increase the probability of sampling genes with more drug–gene relations, a gene \(g \in G\) was sampled with a probability proportional to |[g]|, the size of the set of drugs that have correct drug–gene relations with g. In other words, sampling was performed with probabilities proportional to the number of queries that have correct relations. For simplicity, this sampling method was consistently used across settings G, P1, and P2. We refer to this baseline as the random baseline. In this baseline, we repeated the experiments 10 times and used the average score as the final result. Similar random baselines can also be applied to settings by year and to settings for predicting drugs from genes. See Supplementary Information 1.6 for details.
The analogy task we consider can be regarded as a task of predicting relationships between entities. Therefore, we adopted Knowledge Graph Embedding (KGE)39 as a general baseline method for predicting drug–gene relations. KGE learns embeddings for head entities, tail entities, and relations from triplets by leveraging the explicit structure of the knowledge graph. Using the learned embeddings, it predicts a specific tail corresponding to a given head and relation. Among various KGE methods, we used TransE40, one of the most representative approaches, with the embedding dimensions set to the commonly used size of 500. Similar to the analogy computation, TransE predicts the tail embedding by adding the relation vector to the head embedding. However, while the analogy computation relies solely on document data and pathway information, TransE leverages explicit relations in the knowledge graph to directly learn embeddings and relation vectors. Word embeddings such as skip-gram are learned from large corpora (e.g., PubMed abstracts) without being tied to any particular research field or specific target task. Consequently, they capture broad linguistic and semantic information (not restricted to drug–gene relations), enabling them to be applied to various tasks (e.g., analogy tasks41, sentence similarity tasks5). In contrast, KGE is learned based on knowledge and relational information tailored to a specific task, resulting in embeddings specialized for that task. For details on the dataset splits and hyperparameters used for training TransE, see Supplementary Information 1.8.
In addition to this, as strong generative model baselines, we used the Generative Pre-trained Transformer (GPT)42 series to predict the top 10 target genes from a query drug by zero-shot. We used the GPT-3.543, GPT-444, and GPT-4o (https://platform.openai.com/docs/models/gpt-4o) models with the temperature hyperparameter set to 0. See Supplementary Information 1.7 for more details on the models and prompt template for the predictions.
Evaluation metrics
As evaluation metrics, we use the top-k accuracy explained in each setting, especially the top-1 and top-10 accuracies. We also use Mean Reciprocal Rank (MRR) as another evaluation metric. MRR is a statistical measure to evaluate search performance, expressed as the sum of the reciprocals of the ranks at which the predicted results first appear in the answer set. Note that, in the random baseline, all genes \(g \in G\) are ranked without duplication through sampling with probabilities proportional to |[g]|. By treating these rankings as prediction results, we can calculate not only top-1 and top-10 accuracies but also MRR. Since MRR cannot be calculated for the GPT models due to the limited number of prediction candidates output by GPT, only top-1 and top-10 accuracies are calculated.
When evaluating performance using top-1 accuracy, top-10 accuracy, and MRR, the embeddings of all genes \(\mathscr {G}\) (or \(\mathscr {G}^y\) in settings by year) are centered using the mean embedding \(\textrm{E}_\mathscr {G}\{\mathbf{u}_g\}\) in Eq. (6) (or \(\textrm{E}_{\mathscr {G}^y}\{\mathbf{u}_g\}\)). Since cosine similarity is affected by the origin, centering mitigates this effect. In the case of setting G, for example, we actually calculate the value of \(\cos (\mathbf{u}_d-\textrm{E}_\mathscr {G}\{\mathbf{u}_g\}+\hat{\mathbf{v}}, \mathbf{u}_g-\textrm{E}_\mathscr {G}\{\mathbf{u}_g\})\) instead of \(\cos (\mathbf{u}_d+\hat{\mathbf{v}}, \mathbf{u}_g)\).
Results
Gene prediction performance in settings G, P1, and P2
Table 3 shows the results of experiments with BioConceptVec and our skip-gram embeddings in settings G, P1, and P2. For GPT-3.5, GPT-4, and GPT-4o, only the best results are shown; see Supplementary Information 2.1.3 for detailed results. In setting G, the prediction by simply adding the global relation vector \(\hat{\mathbf{v}}\) resulted in a top-1 accuracy of about 0.3, a top-10 accuracy of over 0.6, and an MRR also over 0.4. These results show that the embeddings have the ability to interpret the drug–gene relations. As with the basic analogy task, these relations were not explicitly provided during the training of the embeddings.
In settings P1 and P2, prediction by adding the pathway-wise relation vector \(\hat{\mathbf{v}}_p\) showed better performance than in setting G. For example, our skip-gram embeddings achieved a top-1 accuracy of over 0.5 in both settings P1 and P2. This is probably because the queries are specific drugs categorized in some pathways, and we use the pathway information to calculate the relation vectors. Since the analogy tasks are considered for each pathway, these settings are also likely to make the tasks easier than those in setting G.
The performance of the random baseline in settings G, P1, and P2 was consistently low across all evaluation metrics. Although random sampling was restricted to genes in the set G, these results demonstrate that random prediction is challenging in these analogy tasks. This also confirms the superior performance of prediction by adding relation vectors. TransE, as a KGE method, naturally outperforms our proposed approach across all three evaluation metrics. Interestingly, however, the top-10 accuracy of the prediction by vector addition is comparable to that of TransE. This suggests that our method has the latent potential to achieve performance close to TransE without directly exploiting explicit relational information. We then hypothesize that the strong performance of TransE observed in Table 3 depends on the size of the training data. In the Discussion section, we compare analogy computation and TransE under varying proportions of the training data (Fig. 4). Among the GPT series, GPT-4o performed the best, followed by GPT-4, which outperformed GPT-3.5. In top-1 accuracy, the GPT series performed better than predictions made by adding relation vectors. However, in top-10 accuracy, the difference becomes smaller. Notably, predictions using \(\mathbf{v}_p\) with our skip-gram embeddings in settings P1 and P2 performed comparably to GPT-4o and outperformed both GPT-3.5 and GPT-4.
In addition, our skip-gram outperformed BioConceptVec except for the top-1 accuracy in setting G. This is probably because BioConceptVec has a dimensionality of 100 and the size of the search space is \(|\mathscr {G}|=117282,\) while our skip-gram has a dimensionality of 300 and the size of the search space is \(|\mathscr {G}|=28284\), as shown in Table 2.
For the results in Table 3, we used the better estimators in Eqs. (7) and (14), not the naive estimators in Eqs. (5) and (S3), and calculated the cosine similarities after centering the embeddings of all genes \(\mathscr {G}\) (or \(\mathscr {G}^y).\) The results of the naive estimators and the centering ablation study are shown in Table S7 in Supplementary Information 2.1.1. The results of the analogy tasks for predicting drugs from a query gene are shown in Table S8 in Supplementary Information 2.1.2.
Gene prediction performance in settings Y1 and Y2
Figure 3a and b show the results of experiments in settings Y1 and Y2 with our skip-gram embeddings trained on the year-specific divisions up to year y. BioConceptVec embeddings are not appropriate for the settings by year, since they are pre-trained on the entire dataset without year-specific divisions.
Setting Y1 is easier than setting Y2, and results in high evaluation metric scores. For example, from \(y=1980\) to \(y=2023\), the top-1 accuracy reaches about 0.3, and the top-10 accuracy is about 0.6. In setting Y1, as in setting G, all relations contained in \(\mathcal {R}\) are considered correct, so all the scores for \(y=2023\) are identical to those of setting G. For example, a top-10 accuracy of 0.686 is confirmed in both Fig. 3a and Table 3.
In setting Y2, since only genes whose relations appeared after year y are included in the answer set, the tasks are more challenging. Therefore, the evaluation metric scores are lower compared to those in setting Y1, where genes whose relations appeared up to year y are also included in the answer set. Yet, in setting Y2, the top-1 accuracy from \(y=1985\) to \(y=2015\) is about 0.1, the top-10 accuracy is over 0.3, and the MRR is about 0.2. Although in setting Y2 the relations to be predicted appeared after year y and are not used to calculate \(\hat{\mathbf{v}}^{y\mid L_y}\), these results show that adding \(\hat{\mathbf{v}}^{y\mid L_y}\) to \(\mathbf{u}_d\) can predict these relations.
In setting Y1, the number of queries increases steadily from older to more recent years, while in setting Y2, the number of queries increases initially and then decreases over time. In addition, the size of the search space is small for older years (e.g., \(|\mathscr {G}^{1975}|=725\)) and increases for later years (e.g., \(|\mathscr {G}^{2020}|=47509\)). As a result, in both settings Y1 and Y2, the datasets for older years such as \(y=1975\), and 1980 have fewer queries and smaller search spaces, resulting in unusually high evaluation metric scores. In setting Y2, the datasets for later years have fewer queries and larger search spaces, which tends to produce lower evaluation metric scores. In both settings, the performance of the random baseline is extremely high for older years such as \(y=1975\) and \(y=1980\), where the number of genes related to drugs is small. However, after \(y=1985\), our method consistently outperforms the random baseline. For more detailed results for settings Y1 and Y2, see Table S12 in Supplementary Information 2.2.1.
Gene prediction performance in settings P1Y1, P2Y1, P1Y2, and P2Y2
Figure 3c and d show the results of experiments with our skip-gram embeddings in settings P1Y1 and P1Y2. The results in settings P1Y1 and P1Y2 not only follow similar trends to those in settings Y1 and Y2 but also consistently show higher evaluation metric scores. From the definition of setting P1Y1, the scores for \(y=2023\) are identical to those of setting P1. For example, a top-10 accuracy of 0.862 is confirmed in both Fig. 3c and Table 3.
The analogy tasks in settings P1Y1 and P1Y2 use detailed pathway information. Therefore, predicting the relations that appeared after year y is easier in settings P1Y1 and P1Y2 than in settings Y1 and Y2. In addition, the results for the older years show high evaluation metric scores in settings P1Y1 and P1Y2, similar to those in settings Y1 and Y2. The performance of the random baseline also shows similar trends to those observed in settings Y1 and Y2.
For the results in settings P2Y1 and P2Y2, see Fig. S4 in Supplementary Information 2.2.2. They are very similar to those in settings P1Y1 and P1Y2. For detailed results for settings P1Y1, P2Y1, P1Y2, and P2Y2, see Table S15 in Supplementary Information 2.2.2.
Biological insights from predicting drug–gene relations
In this section, we investigate the biological insights obtained through analogy tasks by adding the relation vector in settings G, P1, and P2. For this purpose, we focus on genes and drugs categorized in the ErbB signaling pathway as shown in Table 1 and Fig. 2b. Furthermore, since our skip-gram embeddings showed better performance compared to the BioConceptVec skip-gram embeddings (Table 3), we adopted our embeddings for this analysis. \(\mathscr {D}_p\), \(\mathscr {G}_p\), and \(\mathscr {R}_p\) of our embeddings are identical to those of the BioConceptVec embeddings in Table 1.
In Table 4, we have listed the predicted target genes for drugs that are categorized in the ErbB signaling pathway and whose answer set size in setting G is two or more. For all drugs meeting these criteria, at least one answer gene was included in the top 10 predicted target genes for any of the settings. Therefore, it can be said that the prediction of drug–gene relations through analogy tasks functioned appropriately. Additionally, we provide several examples demonstrating that some genes ranked high in the predictions, even though they were not in the answer set, can still be interpreted biologically.
First, the target genes of Bosutinib are ABL1 and SRC, which are both known to be non-receptor tyrosine kinases (non-RTKs)45. Among the top 10 predicted target genes, TXK46 and JAK247 can also be categorized as non-RTKs, sharing several properties with ABL1 and SRC. This implies that the structural and biochemical similarities of these genes may have been reflected in their high-dimensional representations.
For Masoprocol, we were able to predict the correct target gene EGFR as the fifth prediction in settings P1 and P2. However, although EGFR is the sole target gene of Masoprocol according to the information deposited in KEGG, it has also been reported to inhibit lipoxygenase activity48. Since ALOX5 codes for a lipoxygenase, this prediction should not be considered inaccurate. The diseases in which Masoprocol is used for its treatment include actinic keratosis49, and it has been previously known that the mutation of TP53 is involved in the onset of this disease50,51. Also, the gene that was ranked the highest in this setting, ALOX5, is reported to be one of the transcription targets of TP5352. Together, this information suggests that TP53 is deeply related to the target gene and symptoms of Masoprocol. Therefore, although it may not be a direct target, its high rank in the prediction is justified.
When we set Poziotinib as the query, EGFR and ERBB2 were correctly ranked the highest in settings P1 and P2. However, in setting G, ROS1, EML4, and ALK were also found among the highly ranked targets. Since Poziotinib was initially developed as an effective drug to treat lung cancer with HER2 mutation53, this context appears to be reflected in the embeddings, as ROS1 and EML4-ALK fusion genes are also characteristic mutations and therapeutic targets in lung cancer54,55.
Selumetinib and Trametinib both target MAP2K1 and MAP2K2, and in both cases these genes were found within the top 10 predictions. Aside from these target genes, we observed BRAF and PI3KCA placed in higher rank among the predicted genes. Given the history that these drugs were both developed to treat cancers with BRAF V600 mutation and that combination treatment with inhibition of the PI3K-AKT pathway has been explored56,57,58, we assumed that such background was reflected in these results.
Discussion
Connections to trends in existing studies
Mikolov et al.59 has shown that skip-gram achieves a top-1 accuracy of 0.56 on the Microsoft Research Syntactic Analogies Dataset41. Given this, the performance of BioConceptVec and our skip-gram embeddings in the analogy tasks for drug–gene pairs, as presented in Table 3, is comparable to that of skip-gram in the basic analogy tasks. Also, Tshitoyan et al.20 reported an overall accuracy of 60.1% when using analogy computation of word embedding to predict 29,046 relationships across various concepts in materials science, including element names, crystal symmetries, and magnetic properties. Our results for drug–gene relation prediction show a similar level of accuracy, suggesting that the approach of using analogy computation of word embedding can be effective in the biomedical domain as well. This demonstrates the potential of this method across different scientific disciplines. However, it’s important to note that the accuracy varies significantly across different types of relationships in the materials science study. For instance, predictions for chemical element names showed high accuracy (71.4%), while crystal structure names had lower accuracy (18.7%). This variability suggests that the effectiveness of word embedding analogies may depend on the specific type of relationship being predicted.
Strengths and insights from prediction by adding the relation vector
To test the hypothesis that the superior performance of TransE depends on the size of the training data, we compared the performance of our method with that of TransE in setting P1 by varying the proportion of training data used to compute relation vectors (see Supplementary Information 2.4 for experimental details). As Fig. 4 shows, TransE’s performance degrades as the proportion of training data decreases, whereas our method maintains nearly constant performance. In particular, our method outperforms TransE when the available training data is small. This is because our approach derives relation vectors from already-trained word embeddings, while TransE learns all embeddings from scratch using the training data. These findings suggest that our method can function efficiently even with limited datasets.
Changes in evaluation metrics on the test data for analogy computation and TransE under setting P1 as the proportion of training data varies. Analogy computation maintains consistent performance regardless of the proportion of training data, and outperforms TransE when the proportion of training data is small. See Supplementary Information 2.4 for experimental details.
In addition, in Table 3, predictions using \(\mathbf{v}_p\) with our skip-gram embeddings in settings P1 and P2 performed on par with GPT-4o and outperformed both GPT-3.5 and GPT-4. It is rather surprising that simple vector addition can achieve performance comparable to large language models. Given that the GPT series is trained on vast amounts of text data, using relation vector addition is an efficient method. Furthermore, the results showing that target genes can be predicted with performance comparable to or better than large language models imply that high-quality information about drug–gene relations is encoded in the embeddings. This also suggests that the computation of analogies may be a fundamental inference principle within large language models.
Furthermore, we divided the datasets by year and redefined the vectors representing known relations to predict unknown relations. The experimental results presented in Fig. 3 demonstrate that our approach can predict unknown future relations to some extent. The experiments conducted with year-split datasets are a novel attempt not found in existing research, and confirmed the effectiveness of predicting unknown future events.
Biological information intrinsic to the embedding space
In Table 4, predicted genes in the ErbB signaling pathway are closely related to the genes or symptoms that have direct molecular interactions with the target genes, or alternative therapeutic targets for the target disease. This may suggest that the high-dimensional representations of genes and drugs calculated from biomedical texts not only capture the simple drug–gene relations we aimed to predict in this study but also integrate broader, higher-order information.
Limitations
This study has several limitations.
First, we focus exclusively on drug–gene relations. Consequently, the applicability of this method to other biological relationships remains untested.
Additionally, as shown in Table 3, our approach for predicting genes from drugs by simply adding relation vectors performs worse than KGE models such as TransE in terms of accuracy. While the goal of this study is not to improve performance but to explore the information intrinsic to the embedding space, efforts to improve performance remain an important future direction.
Furthermore, the embeddings used in this study are static embeddings learned with skip-gram, which cannot handle out-of-vocabulary concepts. Assigning a single embedding to a concept that serves different roles depending on the context (e.g., concepts used differently in the past and present) may also be suboptimal.
Future directions
Building on the limitations of this study, several future directions can be considered.
First, since BioConceptVec includes concepts such as diseases and mutations, it would be useful to investigate whether analogy computations using relation vectors can also be applied to relationships beyond drugs and genes. This could serve as a valuable follow-up study.
Additionally, using dynamic embeddings instead of static embeddings represents another promising direction. Since models like BERT4 compute embeddings based on context, taking advantage of this feature may allow more accurate modeling of drug–gene relations. For example, this approach could help capture relations that vary with context, as well as drugs whose applications evolve over time.
Furthermore, it would be interesting to investigate how the properties of drugs or genes affect performance in analogy tasks. For example, if the drug–gene relations of a drug have been extensively studied, can related genes for that drug be more easily predicted? To address this question, we conducted a simple experiment to examine the correlation between the size of the answer set for each drug and the rank of the search results predicted by adding relation vectors. Scatter plots of these values for settings G, P1, and P2, using BioConceptVec and our embeddings, are shown in Fig. S5 in Supplementary Information 2.3. Mathematically, a larger answer set is expected to make prediction easier, resulting in search result ranks closer to 1. The results showed a slight negative correlation in setting G, as expected (BioConceptVec: \(-0.201\), our embeddings: \(-0.196\)). However, in settings P1 and P2, little to no correlation was observed (BioConceptVec: 0.054 and \(-0.102\), our embeddings: 0.079 and \(-0.041\)). This suggests that in settings P1 and P2, incorporating pathway information into relation vectors helps maintain good search results even when the answer sets are small. By performing such detailed analyses, we can further deepen our understanding of the biological knowledge intrinsic to the embedding space.
Conclusions
In this study, we used embeddings learned from biological texts and performed analogy tasks to predict drug–gene relations. We defined vectors representing these relations and showed that these vectors can accurately predict the target genes for given drugs. Additionally, we categorized drugs and genes based on biological pathways and defined vectors representing drug–gene relations for each pathway. Analogy computations with these vectors showed performance improvement. Our analogy computations demonstrated performance comparable to analogy tasks in other fields and predictions by state-of-the-art large language models, reinforcing the effectiveness of relationship prediction through simple vector addition. Moreover, the experiments with year-split datasets demonstrated that it is possible to predict unknown future relations. Not only were the predictions highly accurate, but the top genes predicted from drugs by our analogy computations were also confirmed to be reasonable from the perspective of biological expertise.
Data availability
The datasets generated and analysed during the current study available in the GitHub repository, https://github.com/shimo-lab/Drug-Gene-Analogy. External resources analysed in this study are publicly available from the following locations: AsuratDB https://github.com/keita-iida/ASURATDB; BioConceptVec https://github.com/ncbi/BioConceptVec; Drug and gene sets for each pathway (e.g. ErbB signalling pathway, hsa04012) https://rest.kegg.jp/get/hsa04012; KEGG pathway list https://rest.kegg.jp/list/pathway/hsa; PubMed abstracts corpus https://ftp.ncbi.nlm.nih.gov/pubmed.
Code availability
Our code is available at https://github.com/shimo-lab/Drug-Gene-Analogy.
References
Vaswani, A. et al. Attention is all you need. In Guyon, I. et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020).
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, 1631–1642 (ACL, 2013).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I. & Specia, L. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 1–14. https://doi.org/10.18653/v1/S17-2001 (Association for Computational Linguistics, Vancouver, Canada, 2017).
Mu, J. & Viswanath, P. All-but-the-top: Simple and effective postprocessing for word representations. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings (OpenReview.net, 2018).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. In Bengio, Y. & LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings (2013).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Burges, C. J. C., Bottou, L., Ghahramani, Z. & Weinberger, K. Q. (eds.) Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 3111–3119 (2013).
Allen, C. & Hospedales, T. Analogies explained: Towards understanding word embeddings. In Chaudhuri, K. & Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, 223–231 (PMLR, 2019).
Lee, J. et al. Biobert: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240. https://doi.org/10.1093/bioinformatics/btz682 (2020).
Giorgi, J., Bader, G. & Wang, B. A sequence-to-sequence approach for document-level relation extraction. In Proceedings of the 21st Workshop on Biomedical Language Processing, 10–25. https://doi.org/10.18653/v1/2022.bionlp-1.2 (Association for Computational Linguistics, Dublin, Ireland, 2022).
Luo, R. et al. Biogpt: Generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. https://doi.org/10.1093/bib/bbac409 (2022).
Chen, Q. et al. Bioconceptvec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput. Biol. https://doi.org/10.1371/journal.pcbi.1007617 (2020).
Wei, C., Kao, H. & Lu, Z. Pubtator: A web-based text mining tool for assisting biocuration. Nucleic Acids Res. 41, 518–522. https://doi.org/10.1093/nar/gkt441 (2013).
Sachdev, K. & Gupta, M. K. A comprehensive review of feature based methods for drug target interaction prediction. J. Biomed. Inform. 93, 103159 (2019).
Djeddi, W. E., Hermi, K., Ben Yahia, S. & Diallo, G. Advancing drug-target interaction prediction: A comprehensive graph-based approach integrating knowledge graph embedding and protbert pretraining. BMC Bioinform. 24, 488 (2023).
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30. https://doi.org/10.1093/nar/28.1.27 (2000).
Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951. https://doi.org/10.1002/pro.3715 (2019).
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677. https://doi.org/10.1093/nar/gkae909 (2025).
Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98. https://doi.org/10.1038/s41586-019-1335-8 (2019).
Shtar, G., Greenstein-Messica, A., Mazuz, E., Rokach, L. & Shapira, B. Predicting drug characteristics using biomedical text embedding. BMC Bioinformatics 23, 526. https://doi.org/10.1186/s12859-022-05083-1 (2022).
Alachram, H., Chereda, H., Beißbarth, T., Wingender, E. & Stegmaier, P. Text mining-based word representations for biomedical data analysis and protein–protein interaction networks in machine learning tasks. PLoS One 16, 1–20. https://doi.org/10.1371/journal.pone.0258623 (2021).
Liu, S., Tang, B., Chen, Q. & Wang, X. Drug–drug interaction extraction via convolutional neural networks. Comput. Math. Methods Med. 2016, 6918381. https://doi.org/10.1155/2016/6918381 (2016).
Sahu, S. K. & Anand, A. Drug–drug interaction extraction from biomedical texts using long short-term memory network. J. Biomed. Inform. 86, 15–24 (2018).
Jiang, Z., Li, L. & Huang, D. A general protein–protein interaction extraction architecture based on word representation and feature selection. Int. J. Data Min. Bioinform. 14, 276–291. https://doi.org/10.1504/IJDMB.2016.074878 (2016).
Quan, C., Luo, Z. & Wang, S. A hybrid deep learning model for protein-protein interactions extraction from biomedical literature. Appl. Sci. https://doi.org/10.3390/app10082690 (2020).
Zhang, Y. et al. A hybrid model based on neural networks for biomedical relation extraction. J. Biomed. Inform. 81, 83–92. https://doi.org/10.1016/j.jbi.2018.03.011 (2018).
Müller, H.-M., Kenny, E. E. & Sternberg, P. W. Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLOS Biol. https://doi.org/10.1371/journal.pbio.0020309 (2004).
Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. GENIES: A natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17(Suppl 1), S74-82 (2001).
Yeganova, L. et al. Better synonyms for enriching biomedical search. J. Am. Med. Inform. Assoc. 27, 1894–1902. https://doi.org/10.1093/jamia/ocaa151 (2020).
Du, J. et al. Gene2vec: Distributed representation of genes based on co-expression. BMC Genom. 20, 82. https://doi.org/10.1186/s12864-018-5370-x (2019).
Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, 58–65. https://doi.org/10.18653/v1/W19-5006 (Association for Computational Linguistics, Florence, Italy, 2019).
Fang, L., Chen, Q., Wei, C.-H., Lu, Z. & Wang, K. Bioformer: An efficient transformer language model for biomedical text mining (2023). 2302.01588.
Kutuzov, A., Velldal, E. & Øvrelid, L. One-to-X analogical reasoning on word embeddings: A case for diachronic armed conflict prediction from news texts. In Tahmasebi, N., Borin, L., Jatowt, A. & Xu, Y. (eds.) Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, 196–201. https://doi.org/10.18653/v1/W19-4724 (Association for Computational Linguistics, Florence, Italy, 2019).
Pennington, J., Socher, R. & Manning, C. D. Glove: Global vectors for word representation. In Moschitti, A., Pang, B. & Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, 1532–1543. https://doi.org/10.3115/v1/d14-1162 (ACL, 2014).
Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146. https://doi.org/10.1162/tacl_a_00051 (2017).
Bird, S. & Loper, E. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, 214–217 (Association for Computational Linguistics, Barcelona, Spain, 2004).
Iida, K., Kondo, J., Wibisana, J. N., Inoue, M. & Okada, M. ASURAT: Functional annotation-driven unsupervised clustering of single-cell transcriptomes. Bioinformatics 38, 4330–4336. https://doi.org/10.1093/bioinformatics/btac541 (2022).
Ali, M. et al. Bringing light into the dark: A large-scale evaluation of knowledge graph embedding models under a unified framework. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8825–8845. https://doi.org/10.1109/TPAMI.2021.3124805 (2022).
Bordes, A., Usunier, N., García-Durán, A., Weston, J. & Yakhnenko, O. Translating embeddings for modeling multi-relational data. In Burges, C. J. C., Bottou, L., Ghahramani, Z. & Weinberger, K. Q. (eds.) Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States 2787–2795 (2013).
Mikolov, T., Yih, W. & Zweig, G. Linguistic regularities in continuous space word representations. In Vanderwende, L., III, H. D. & Kirchhoff, K. (eds.) Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, 746–751 (The Association for Computational Linguistics, 2013).
Brown, T. B. et al. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. & Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020).
Ouyang, L. et al. Training language models to follow instructions with human feedback. In Koyejo, S. et al. (eds.) Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 (2022).
OpenAI. GPT-4 technical report. https://doi.org/10.48550/ARXIV.2303.08774 (2023). 2303.08774.
Siveen, K. S. et al. Role of non receptor tyrosine kinases in hematological malignances and its targeting by natural products. Mol. Cancer 17, 31. https://doi.org/10.1186/s12943-018-0788-y (2018).
Maruyama, T., Nara, K., Yoshikawa, H. & Suzuki, N. Txk, a member of the non-receptor tyrosine kinase of the Tec family, forms a complex with poly(ADP-ribose) polymerase 1 and elongation factor 1\(\alpha\) and regulates interferon-\(\gamma\) gene transcription in Th1 cells. Clin. Exp. Immunol. 147, 164–175. https://doi.org/10.1111/j.1365-2249.2006.03249.x (2007).
Hu, X., Li, J., Fu, M., Zhao, X. & Wang, W. The JAK/STAT signaling pathway: From bench to clinic. Signal Transduct. Target. Ther. 6, 1–33. https://doi.org/10.1038/s41392-021-00791-1 (2021).
Tappel, A. L., Lundberg, W. O. & Boyer, P. D. Effect of temperature and antioxidants upon the lipoxidase-catalyzed oxidation of sodium linoleate. Arch. Biochem. Biophys. 42, 293–304. https://doi.org/10.1016/0003-9861(53)90359-2 (1953).
Callen, J. P., Bickers, D. R. & Moy, R. L. Actinic keratoses. J. Am. Acad. Dermatol. 36, 650–653. https://doi.org/10.1016/S0190-9622(97)70265-2 (1997).
Park, W.-S. et al. P53 mutations in solar keratoses. Hum. Pathol. 27, 1180–1184. https://doi.org/10.1016/S0046-8177(96)90312-3 (1996).
Brash, D. E. Roles of the transcription factor p53 in keratinocyte carcinomas. Br. J. Dermatol. 154, 8–10. https://doi.org/10.1111/j.1365-2133.2006.07230.x (2006).
Gilbert, B. et al. 5-Lipoxygenase is a direct p53 target gene in humans. Biochim. Biophys. Acta (BBA) Gene Regul. Mech. 1003–1016, 2015. https://doi.org/10.1016/j.bbagrm.2015.06.004 (1849).
Elamin, Y. Y. et al. Poziotinib for EGFR exon 20-mutant NSCLC: Clinical efficacy, resistance mechanisms, and impact of insertion location on drug sensitivity. Cancer Cell 40, 754-767.e6. https://doi.org/10.1016/j.ccell.2022.06.006 (2022).
Davies, K. D. et al. Identifying and targeting ROS1 gene fusions in non-small cell lung cancer. Clin. Cancer Res. 18, 4570–4579. https://doi.org/10.1158/1078-0432.CCR-12-0550 (2012).
Sasaki, T., Rodig, S. J., Chirieac, L. R. & Jänne, P. A. The biology and treatment of EML4-ALK non-small cell lung cancer. Eur. J. Cancer 46, 1773–1780. https://doi.org/10.1016/j.ejca.2010.04.002 (2010).
US Food and Drug Administration. FDA approves dabrafenib plus trametinib for adjuvant treatment of melanoma with BRAF V600E or V600K mutations (2018).
Patel, S. P. & Kim, K. B. Selumetinib (AZD6244; ARRY-142886) in the treatment of metastatic melanoma. Expert Opin. Investig. Drugs 21, 531–539. https://doi.org/10.1517/13543784.2012.665871 (2012).
Tolcher, A. W. et al. A phase I dose-escalation study of oral MK-2206 (allosteric AKT inhibitor) with oral selumetinib (AZD6244; MEK inhibitor) in patients with advanced or metastatic solid tumors. J. Clin. Oncol. 29, 3004–3004. https://doi.org/10.1200/jco.2011.29.15_suppl.3004 (2011).
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. In Bengio, Y. & LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings (2013).
Acknowledgements
This study was partially supported by JSPS KAKENHI 22H05106, 23H03355, JST CREST JPMJCR21N3.
Author information
Authors and Affiliations
Contributions
H.Y., R.H., K.A., Ma.O. and H.S. conceived the experiments, H.Y., R.H., K.A., Y.Z. and K.M. conducted the experiments, H.Y., R.H., K.A., K.M., S.S., Y.Z., Ma.O., and H.S. analysed the results, R.H. and Mo.O. surveyed existing research. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yamagiwa, H., Hashimoto, R., Arakane, K. et al. Predicting drug–gene relations via analogy tasks with word embeddings. Sci Rep 15, 17240 (2025). https://doi.org/10.1038/s41598-025-01418-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-01418-z