Introduction

Recently, researchers introduced KG into a recommendation system to improve performance. These graphs involve rich semantic and structural information, which can contribute to achieving more accurate, diversified and interpretable recommendations. This method is called the knowledge graph-aware recommendation (KGR), which strengthens the representation of users and items by encoding additional item-level semantic associations1. Given the natural ability of GNNs to encode collaborative signals and their proficiency in extracting multi-hop relationships, utilizing GNNs for KGR has currently become one of the mainstream modeling frameworks for recommendation systems, as shown in the reference2,3. These models spread the characteristics of items in the KG and use a GNN to model the KG to assist the recommendation system. The performance of the KGR model depends largely on the quality of KG. However, the KG in the real world is often incomplete. A serious long-tail distribution problem exists, that is, a few nodes have a large number of edges. It can be imagined that several celebrities on social networks have more followers than most ordinary users. This imbalance4 causes the existence of a large number of super-nodes, which brings challenges including sparse supervision signals and affects the accurate capture of user preferences by the model.

In contrastive learning5, by comparing different data representation forms, it is possible to know more robust and distinguishing features and make remarkable achievements in computer vision and other fields. This result proves its potential in handling sparse and unbalanced data. Contrastive learning is introduced into the recommendation system. By comparing various representations of items, the model can dig out deeper semantic associations, thus effectively focusing on long tail nodes. This graph contrastive learning method is mainly based on two core components: data enhancement and contrastive learning. Specifically, data enhancement creates multiple views of nodes by generating multiple variants for each node. Contrastive learning focuses on analyzing and learning the consistency between these different views to promote more robust feature learning. At present, many researchers focus on the field of graph data enhancement. For instance, references6,7proposed the structure-based graph data enhancement technology. Reference8 studied the feature-based method. These methods have been proven to significantly improve the model performance. Although the model based on graph contrastive learning has made some achievements in the recommendation system, it faces some challenges as follows:

1) Most studies focus more on data enhancement technology and less on improving contrastive learning itself, which limits the potential of the model.

2) At present, contrastive learning usually only compares different views of the same node, which ignores the relationship between users and items. This practice fails to fully explore the relationship between users and items, which may reduce the accuracy and personalization of recommendations.

To solve the above problems, this paper proposes a recommendation algorithm framework, namely the Multi-level-Knowledge Dynamic Graph Attention Network Momentum Contrast (ML-KDGATMoco). This framework integrates KG and user-item interaction into a unified graph structure (as shown in Fig. 1), which is handled in the following ways:

Fig. 1
figure 1

Collaborative KG.

  1. 1)

    Combining the three types of nodes in the graph structure, such as users, items and knowledge and their relationships, a three-level contrastive learning task is constructed by taking random edge dropout technology as an auxiliary self-supervision to strengthen the model performance.

  2. 2)

    GAT is used for deep learning of graph structure, and the representation of KG is further optimized through dynamic attention mechanism and nonlinear transformation.

  3. 3)

    In the contrastive learning stage, momentum encoder and updating strategy are introduced to construct high-quality negative samples. Hence, the contrastive learning effect is improved.

  4. 4)

    Multi-level contrastive learning is taken as an auxiliary task. It is jointly trained with the primary supervised task of knowledge-aware recommendation to effectively solve the problem of data sparseness.

The main contributions of this paper are as follows:

  1. 1)

    A three-level contrastive learning strategy is designed, including user-user, item-item and user-interactive item contrastive learning. This contrastive learning is realized by using nonlinear transformation and negative sample construction strategy of momentum encoder.

  2. 2)

    A dynamic attention mechanism is introduced into the GNN architecture, which effectively improves the ability of KG modeling and enhances the overall performance of the recommendation system.

  3. 3)

    Taking contrastive learning as an auxiliary self-supervised task, it is trained jointly with a knowledge-aware recommendation task. This strategy promotes the representation learning of the model. Therefore, performance in the recommendation task is improved.

Related work

Recommendation based on KG

Integrating KGs into the recommendation system as auxiliary information can improve the accuracy of recommendations and provide additional interpretability for the results. There are three main methods in the research of Knowledge Graph Embedding (KGE) recommendation systems. To be specific, the first is the solid embedding method. For example, the embedding technology that the CFKG model9relies on is TransE10, while that for the CKE model11is TransR12. These methods integrate the representation of KG by generating item embedding and then combine it with the interactive data between users and items. The second is the path-based method. The KG can form a heterogeneous graph with the data of user-interactive items in the recommendation system. Therefore, researchers propose a Meta-path method13to deeply mine this kind of heterogeneous graph. For instance, Xiting Wang et al14. utilized multi-level reinforcement learning for recommendations, and S. Guo et al15. applied similar techniques for disease diagnosis, which expanded path-based strategies beyond standard recommendations. The third one adopts the GNN technology. These technologies propagate information recursively between multi-hop nodes, considering the long-range relationship structure. For example, the knowledge graph attention (KGAT) network model3combines the KG and user-items graph and realizes end-to-end modeling of the high-order relationship between users and items by constructing the KGAT network. Building upon these GNN-based approaches, recent sophisticated models like and TGCL4SR16M3KGR17 have emerged, focusing on enhancing the temporal and multi-modal aspects of knowledge graphs to improve sequential recommendation systems.

The integration of KGs’ semantics with GNNs’ insights sets a path for developing recommendation systems that are accurate and aligned with users’ detailed preferences.

Application of contrastive learning in recommendation systems

In traditional recommendation systems, methods including collaborative filtering18and matrix decomposition19 exhibit robust performance with ample data. However, their efficacy witnesses notable diminishes when there is data sparsity.

Contrastive learning as a novel paradigm has garnered considerable attention for its exceptional ability to learn under few-shot conditions20,21. This approach enhances the distinction of feature representations by minimizing the distance between positive pairs and maximizing that between negative pairs in the feature space. Recent progress includes the introduction of the KAUR model, which enhances model efficiency through directly contrasting node embeddings from various graph views22. Additionally, the KGCL4and KGCCL23frameworks have been proposed for sequential recommendation. These leverage knowledge graph embeddings to better capture hierarchical structures in user behavior sequences, thereby enhancing the effectiveness of contrastive learning in recommendation systems. Furthermore, research on SGL-WA illustrates the advantages of minimal graph augmentations in optimizing recommendation systems24.

In their pioneering work, Oord et al. introduced the Contrastive Predictive Coding (CPC) framework25, which demonstrated the efficacy of feature learning via mutual information maximization. Robust theoretical groundwork is then provided for contrastive learning. Building upon the principles of CPC, this study designs three levels of contrastive learning as follows:

  1. (1)

    At the user-user level, the analysis of behavioral pattern similarities among users, including through contrastive learning derived from users’ historical activities, uncovers latent preference clusters. This approach effectively mitigates the issue of data scarcity at the user level.

  2. (2)

    At the user-item level, the examination of users’ varying preferences for different items enables a deeper understanding of their needs and preferences. Through using this approach, the precision of personalized recommendation systems can be significantly improved.

  3. (3)

    At the item-item level, exploring and contrasting the interconnectedness among items, including recognizing similar products, not only addresses the issue of item-level sparsity but also augments the content diversity of recommendation systems.

By incorporating contrastive learning as a crucial element in the training process, this study further refines the quality of embeddings. This approach enhances the embeddings’ discriminative prowess, which increases the accuracy and stability of the recommendation system, and strengthens its aptitude to handle sparse and varied data.

Method

Task definition

Before introducing the ML-KDGATMoco model, the following concepts and symbols are defined first.

User-Item Graph: User-item graph \({\mathcal{G}}_{1}\)

is the interaction between users and items, and is represented by \(\{(u,{y}_{ui},i)|u\in \mathcal{U},i\in \mathcal{I}\}\), where \(\mathcal{U}\) is the user set, \(\mathcal{I}\) is a collection of items and \({y}_{ui}\) is the interaction data between users and items. Its definition is as follows:

$${y}_{ui}={\{}_{0\text{, otherwise.}}^{\text{1, if interaction (}u,i)\text{ is observed};}$$
(1)

KG: The KG provides auxiliary information on items, which is recorded as \({\mathcal{G}}_{2}\). The \((h,r,t)\) triplets represent a large number of entity-relationship-entity tuples in the KG, where \(h\in \varepsilon ,r\in R,t\in \varepsilon\) represent the head entity, relationship and tail entity in the triplet respectively.

CKG: The CKG refers to the combination of the user-item graph and knowledge graph, which encodes user behavior and item knowledge. The interaction behavior of each user is represented by \((u,Interact,i)\) triplets, where an additional connection between the user \(u\) and the item \(i\) is represented \({y}_{ui}=1\). According to the item-entity alignment matrix, the user-item diagram can be seamlessly integrated with the KG as a unified diagram \(\mathcal{G}=\{(h,r,t)|h,t\in {\mathcal{E}}^{\prime},\) where \({\mathcal{E}}^{\prime}\in \mathcal{E}\cup \mathcal{U}\), \({\mathcal{R}}^{\prime}\in \mathcal{R}\cup \{Interact\}\).

Task definition: for a given collaborative KG \(\mathcal{G}\) composed of \({\mathcal{G}}_{1}\) and \({\mathcal{G}}_{2}\), the recommendation task is defined as predicting whether a user \(u\) has a potential interest in an item \(i\) that has not been interacted with. This task aims to construct a prediction function \({\widehat{y}}_{ui}=F(u,i;\Theta )\), where \({\widehat{y}}_{ui}\) is the probability of interest of the user \(u\) in the item \(i\) and \(\Theta\) represents the model parameter.

Model description

In this study, the ML-KDGATMoco model under the framework of self-supervised graph learning is designed. The primary task of this model lies in a Knowledge-aware Graph Attention Network (KDGAT). Self-supervised learning is used as an auxiliary task to enhance the model performance. The overall framework of the model is shown in Fig. 2. Among them, the self-supervised task is to generate two groups of variant subgraphs by randomly dropping the edges of a partial CKG, and then making contrastive learning at three levels: user-user interactions, item-item interactions and user-item interactions.

Fig. 2
figure 2

Model Framework of ML-KDGATMoco.

The contrastive learning model is mainly composed of the following five parts.

(1) Data enhancement layer: This layer is designed to enhance the generalization ability of the model by randomly dropping some edges of CKG, which can generate two groups of subgraphs with slight differences.

(2) Embedding layer: The model parameterizes each node (i.e. user or item) as an embedding vector while retaining the structural information of the CKG subgraph.

(3) Dynamic attention embedding dissemination layer: it is responsible for recursively propagating embedding from the neighbor of the node to update its representation. To capture different aspects of knowledge and weigh the importance of neighbors, the model adopts the dynamic attention mechanism of knowledge-aware.

(4) Nonlinear transformation layer: after the dynamic attention embedding dissemination layer is aggregated, the model further extracts features and enhances expression ability by passing the representations of users and items through a nonlinear transformation layer.

(5) Contrastive learning layer: The model performs the task of contrastive learning at three levels, and guides the model learning to distinguish positive and negative sample pairs by minimizing the contrast loss. This loss is combined with other losses to jointly optimize the model.

To facilitate the integration of contrastive learning in this graph attention framework, node embeddings produced by GATs are used to establish contrastive pairs. This strategic incorporation plays an important role in sharpening the model’s focus on the discriminative attributes of nodes. The process involves generating positive pairs from adjacent nodes and negative pairs from non-adjacent nodes, which are dynamically updated through a momentum encoder. This tailored integration effectively boosts the model’s ability to differentiate between similar and dissimilar nodes and significantly enhances the handling of sparse and dynamically evolving datasets.

Data enhancement layer

The data enhancement layer aims to provide training data for subsequent contrastive learning by generating two groups of CKG subgraphs enhanced by data. Different from CV and NLP fields, the enhancement of graph data should consider the complex connection between users and items. In this paper, the Edge Dropout7 method, a graph data enhancement technology, is adopted. This method is realized by randomly removing the edges in the graph with a certain probability. The specific operation of this method is to define a mask vector of an edge. The edge is then selected randomly to be retained or deleted through this mask vector. The following is the corresponding formula:

$${s}_{1}\left(\mathcal{G}\right)=\left({\mathcal{E}}^{\prime},{{\varvec{A}}}_{1}\odot {\mathcal{R}}^{\prime}\right),{s}_{2}\left(\mathcal{G}\right)=\left({\mathcal{E}}^{\prime},{{\varvec{A}}}_{2}\odot {\mathcal{R}}^{\prime}\right)$$
(2)

where: \({{\varvec{A}}}_{1},{{\varvec{A}}}_{2}\in \{\text{0,1}{\}}^{|{\mathcal{R}}^{\prime}|}\) are two mask vectors, which act on the edge set \({\mathcal{R}}^{\prime}\) of CKG, while only some neighbors contribute to the node representation. By deleting some edges in the original CKG, two different subgraphs can be generated. Each subgraph captures the local structure in the original image. This data enhancement method not only improves the generalization ability of the model to unknown data but also enhances the robustness to potential noise and outliers.

Embedded layer

KGE is a technology that graphs entities and relationships within KG into vector representations. This technology aims to preserve semantic relationships and graph structures among entities in vector space. On CKG, this paper chooses the TransR11 method, a translation model that graphs entities and relationships to different vector spaces. TransR optimizes the vector representation of entities and relationships by treating the tail entity as a “translation” of the head entity and the relationship vector. This process can be understood as the transformation of entities on the “bridge” of relations. Specifically, with a given triple \((h,r,t)\), the TransR method first graphs the vector \({e}_{h}\) of the head entity \(h\) and the vector \({e}_{t}\) of the tail entity \(t\) to the vector space of the relationship \(r\). The model then learns the vector representation of the entity and the relationship by minimizing the “translation” distance between the head entity and the tail entity in the vector space of the relationship.

Where: \({e}_{h}^{r}\) and \({e}_{t}^{r}\) represent the projection of \({e}_{h}\) and \({e}_{t}\) in the contact \(r\) space. The scoring function is:

$$g(h,r,t)=||{{\varvec{W}}}_{r}{e}_{h}+{e}_{r}-{{\varvec{W}}}_{r}{e}_{t}|{|}_{2}^{2}$$
(3)

where:\({e}_{h},{e}_{t}\in {\mathbb{R}}^{d}\) and \({e}_{r}\in {\mathbb{R}}^{k}\) represent the embedment of \(h\), \(t\) and \(r\) respectively. As a relational transformation matrix of contact \(r\), \({{\varvec{W}}}_{r}\in {\mathbb{R}}^{k\times d}\) linearly transforms entities from the original \(d\) dimensional entity space to the \(k\) dimensional relationship vector space. A lower value of \(g(h,r,t)\) indicates that the triple \((h,r,t)\) is more likely to exist in the KG.

When training the TransR model, both positive and negative examples should be considered to encourage the distinction between them. The loss function is defined as:

$${\mathcal{L}}_{KG} = {\sum }_{(h,r,t,{t}^{\prime})\in T}-\mathit{ln}\sigma \left(g\left(h,r,{t}^{\prime}\right)-g\left(h,r,t\right)\right)$$
(4)

where:\(T =\{(h,r,t,{t}{\prime})|(h,r,t)\in \mathcal{G},(h,r,{t}{\prime})\notin \mathcal{G}\}\). \((h,r,{t}^{\prime})\) represents a negative example, which is a negative sample set obtained by negative sampling. \(\sigma (\cdot )\) is the sigmoidfunction. Compared with other KGE methods, including TransE10, the advantage of TransR is that it can better handle the complex relationship between entities and thus is more suitable for the embedding layer of this model. By mapping entities and relationships to different vector spaces, TransR improves the representation ability of the model to the KG structure and enhances the regularization effect of the direct relationship between entities in vector representation.

Dynamic attention embedding dissemination layer

Dynamic attention embedding dissemination layer is an architecture based on the graph convolution network26. It captures the high-order connectivity between entities in the KG by recursively propagating the embedding information of entities. Each layer consists of three core modules: information dissemination, knowledge-aware dynamic attention calculation and information aggregation.

Information dissemination module: In this module, this paper assumes that an entity can appear in multiple triples and disseminate information through these triples. For an entity \(h\), all the triple sets \({\mathcal{N}}_{h} =\{(h,r,t)|(h,r,t)\in \mathcal{G}\}\) that it participates in with \(h\) as the head entity are defined as the neighbors of \(h\), called the neighbor vector27 of \(h\). Each neighbor is connected by triplets and transmits information through the following linear combinations:

$${e}_{{\mathcal{N}}_{h}} = {\sum }_{\left(h,r,t\right)\in {\mathcal{N}}_{h}}\pi \left(h,r,t\right){e}_{t}$$
(5)

where: \(\pi (h,r,t)\) is the attenuation factor that controls the information dissemination on each edge \((h,r,t)\), which represents how much information propagates from \(t\) to \(h\) under the relation \(r\) condition.

Knowledge-aware dynamic attention calculation module: In the traditional GAT, the attention weight of each node only depends on its neighbor nodes. This paper adopts the dynamic attention mechanism proposed in reference6, which allows the weight ranking of neighboring nodes to be updated accordingly with the update of query nodes. At the same time, to enhance the model’s stability and its adaptability to complex data structures, layer normalization and residual networks are introduced into the dynamic attention mechanism. Before the calculation of the attention coefficients, each node in the graph, including the head node \(e_h\), tail node \(e_t\), and the relational vector \(e_r\), uniformly performs residual connections and layer normalization:

$$e_x^{\prime} = LayerNorm(e_x + Transform(e_x ))$$
(6)

where: \(e_x^{\prime}\) denotes the original embedding vector of the node \(x\), which can be either a head node \(e_h\), a tail node \(e_t\), or a relational vector \(e_r\). The function \(LayerNorm( \cdot )\) and \(Transform( \cdot )\) corresponds to layer normalization and residual connection operations, respectively. Layer normalization aims to standardize the features across each layer to stabilize learning, while residual connection operations allow the flow of information and gradients through the network, which mitigates the vanishing gradient problem. These operations are integral to enhancing the model’s capability to process complex data structures and guarantee robust learning dynamics.

By adopting optimized node and contact vector representations, the attenuation factor formula for calculating the dynamic attention weight is as follows:

$$\pi (h,r,t)={a}^{T}\mathit{LeakyReLU}({\varvec{W}}\cdot [{e}_{h}^{\prime}||{e}_{t}^{\prime}||{e}_{r}^{\prime}])$$
(7)

where: the nonlinear activation function is LeakyReLU28, and \(a\) and \({\varvec{W}}\) are learnable parameters. || indicates the splicing operation of vectors. Then, to ensure that these scores are comparable to the distribution of neighbors, all attenuation factors are normalized by the softmax function to obtain the final dynamic attention weight:

$$\pi (h,r,t)=\frac{\mathit{exp}(\pi (h,r,t))}{{\sum }_{(h,{r}^{\prime},{t}^{\prime})\in {\mathcal{N}}_{h}}\mathit{exp}(\pi (h,{r}^{\prime},{t}^{\prime}))}$$
(8)

Information aggregation module: In the last module, the entity representation \({e}_{h}\) and neighbor vector representation \({e}_{{\mathcal{N}}_{h}}\)are aggregated by using the Bi-Interaction aggregator3. That allows the embedding of the head entity to interact with the embedding of its neighbors and updates the representation of the head entity. The specific definition of \(f(\cdot )\) is as follows:

$${f}_{Bi-Interaction}=\mathit{LeakyReLU}\left({{\varvec{W}}}_{1}({e}_{h}+{e}_{{\mathcal{N}}_{h}} )\right)+\mathit{LeakyReLU}\left({{\varvec{W}}}_{2}({e}_{h}\odot {e}_{{\mathcal{N}}_{h}} )\right)$$
(9)

where: \({{\varvec{W}}}_{1},{{\varvec{W}}}_{2}\in {\mathbb{R}}^{{d}{\prime}\times d}\) is a learnable weight matrix, \({d}{\prime}\) is the conversion size, and \(\odot\) represents the product at the element level. In this way, each layer of dynamic attention embedding communication layer can adjust the way of information dissemination according to the neighbor situation of the current entity. Accordingly, the network can flexibly adapt to the structural changes in the KG.

By stacking multiple such dissemination layers, information on neighbor nodes with higher hops can be collected. After \(l\) layers of iteration, the final representation of the header entity \(h\) is updated to:

$${e}_{h}^{(l)}=f\left({e}_{h}^{l-1},{e}_{{\mathcal{N}}_{h}}^{(l-1)}\right)$$
(10)

where: \({e}_{{\mathcal{N}}_{h}}^{(l-1)}\) is the neighbor vector representation of the \(l\)-th layer head entity, which involves the information disseminated from multi-hop neighbor nodes:

$${e}_{{\mathcal{N}}_{h}}^{(l-1)} ={\sum }_{(h,r,t)\in {\mathcal{N}}_{h}}\pi (h,r,t){e}_{t}^{(l-1)}$$
(11)

where: \({e}_{t}^{(l-1)}\) is the representation of the tail entity \(t\) generated by the previous information dissemination step.

Nonlinear transformation layer

After performing the \(L\) aggregation layers, multiple hierarchical representations of the user \(u\) and item \(i\) in the original CKG are obtained, namely \(\{{e}_{u}^{(1)},\cdot \cdot \cdot ,{e}_{u}^{(L)}\}\) and \(\{{e}_{i}^{(1)},\cdot \cdot \cdot ,{e}_{i}^{(L)}\}\). The multi-level representation of the user nodes \({u}_{1}\), \({u}_{2}\), \({i}_{1}\) and \({i}_{2}\) of \({s}_{1}(\mathcal{G})\) and \({s}_{2}(\mathcal{G})\)enhanced by two groups of ED data are obtained. To enhance the expressive ability of these representations, the experiment follows the method in reference3 and adopts the layer aggregation mechanism, for fusing the user and item representations of each layer into a comprehensive representation through a splicing operation:

$${e}_{u}^{*}={e}_{u}^{(1)}\parallel \cdot \cdot \cdot \parallel {e}_{u}^{\left(L\right)}, {e }_{i}^{*}={e}_{i}^{(1)}\parallel \cdot \cdot \cdot \parallel {e}_{i}^{(L)}$$
$${e}_{{u}_{1}}^{*}={e}_{{u}_{1}}^{(1)}\parallel \cdot \cdot \cdot \parallel {e}_{{u}_{1}}^{\left(L\right)}, { e}_{{i}_{1}}^{*}={e}_{{i}_{1}}^{(1)}\parallel \cdot \cdot \cdot \parallel {e}_{{i}_{1}}^{(L)}$$
$${e}_{{u}_{2}}^{*}={e}_{{u}_{2}}^{(1)}\parallel \cdot \cdot \cdot \parallel {e}_{{u}_{1}}^{\left(L\right)}, {e}_{{i}_{2}}^{*}={e}_{{i}_{2}}^{(1)}\parallel \cdot \cdot \cdot \parallel {e}_{{i}_{2}}^{(L)}$$
(12)

Given that contrastive learning aims at maximizing the consistency of positive sample pairs, information beneficial to recommendation tasks can be sacrificed in the optimization process. To alleviate this problem, as inspired by SimCLR20, a learnable nonlinear transformation layer \(g(\cdot )\) is introduced before the comprehensive representation is sent to the contrastive learning layer. This transformation layer is realized by a multi-layer perceptron (MLP) composed of a hidden layer with a ReLU activation function:

$${z}_{input}=g({e}_{input}^{*})={{\varvec{W}}}^{(2)}\sigma ({{\varvec{W}}}^{(1)}{e}_{input}^{*})$$
(13)

where: \(\sigma (\cdot )\) represents the ReLU nonlinear activation function. \({e}_{input}^{*}\) and \(input\in \{u,{u}_{1},{u}_{2},i,{i}_{1},{i}_{2}\}\) are the representations of the converted user and item. After \(g(\cdot )\) transformation, \({z}_{u}\), \({z}_{{u}_{1}}\), \({z}_{{u}_{2}}\), \({z}_{u}\), \({z}_{i}\), \({z}_{{i}_{1}}\) and \({z}_{{i}_{2}}\) are obtained, which are used for subsequent contrastive learning respectively.

Through introducing a nonlinear transformation layer, the information of the aggregation layer is retained. In addition, the expression ability of the model before applying comparative loss is enhanced, which helps improve the performance of the final recommendation system.

Contrastive learning layer

Contrastive learning relies on the number of negative samples to generate high-quality representations. For example, SimCLR20proposed an end-to-end model, which was trained in large quantities (including 4096). However, a too-large batch greatly consumes memory and video memory because each additional sample leads to a significant increase in model parameters. Hence, it may not be conducive to optimizing the training process. To solve this problem, Memory Bank29proposed an independent dictionary to store the representations of all samples as negative samples and update them regularly. Nevertheless, frequent updates of representations in Memory Bank may bring vector inconsistency and affect the uniformity of the model. For this reason, Moco21 adopted a contrastive learning strategy which includes a dynamic queue and momentum encoder. Despite the consistency of updating parameters of the momentum encoder and query encoder, the updating speed is slow. The queue adopts the “first in first out” mode, which ensures the continuity and consistency of the dictionary. Based on this, the Moco is introduced to construct high-quality negative samples and enhance the model’s feature learning ability.

In this study, a novel contrastive learning framework designed to operate across three distinct levels, including user-user, item-item, and user-interactive item, is introduced. Each level targets specific aspects of recommendation systems to refine and enhance predictive accuracy and personalization capabilities. Taking the user-to-user contrast as an example, the KDGATMoCo algorithm is developed, which effectively captures the differences between user characteristics through the Momentum Encoder. The implementation details of KDGATMoCo are shown in Algorithm 1.

Algorithm 1
figure a

KDGATMoCo.

At first, KDGATMoCo initializes two encoders (including kdgat_q and kdgat_k) respectively to generate the representation of the query and key. These encoders are updated synchronously through the MoCo. Specifically, the momentum coefficient is set to 0.999 by default to ensure the consistency and stability of key representation. The formula of the momentum renewal mechanism is expressed as:

$${\theta }_{k}\leftarrow m{\theta }_{k}+\left(1-m\right){\theta }_{q}$$
(14)

where: \({\theta }_{q}\) is the parameter of the query encoder, \({\theta }_{k}\) is the parameter of the keys encoder, and \(m\in [\text{0,1})\) is the momentum update coefficient.

In the case of user-user contrastive learning, the representations of user nodes are utilized as queries, and a group of user representations are taken as keys in the dictionary. The model is optimized by the InfoNCE30 loss function, a pivotal choice for its ability to modulate the similarities between sample representations via a finely tuned temperature hyperparameter \(T\). InfoNCE loss is defined as:

$${\mathcal{L}}_{q} = -\mathit{log}\frac{\mathit{exp}(q\cdot {k}_{+}/T)}{{\sum }_{i=0}^{K}\mathit{exp}(q.{k}_{i}/T)}$$
(15)

where: K is the negative sample number, \({k}_{+}\) is the key that matches the query , and \(T\) is the temperature hyperparameter.

The total loss of user-user contrastive learning is calculated as:

$${\mathcal{L}}_{cl}^{u-u} = {\sum }_{j=0}^{K}{\mathcal{L}}_{{q}_{j}}$$
(16)

In the same way, the loss of item-item contrastive learning \({\mathcal{L}}_{cl}^{i-i}\) and the loss of user-interactive item contrastive learning \({\mathcal{L}}_{cl}^{u-i}\) can be calculated. Finally, the total loss of the model is formed by adding up the contrastive learning losses of these three levels to guide the training of the model:

$${\mathcal{L}}_{mlcl} = {\mathcal{L}}_{cl}^{u-u}+{\mathcal{L}}_{cl}^{i-i}+{\mathcal{L}}_{cl}^{u-i}$$
(17)

Therefore, KDGATMoCo can effectively employ the user and item information in the KG, and improve the accuracy and efficiency of the recommendation system.

Joint training loss function

To optimize the ML-KDGATMoCo model, a joint loss function is adopted. The function combines the main loss of the recommendation model, the KG training loss of the embedded layer, and three groups of contrastive learning losses. The loss function of joint optimization can be expressed as:

$${\mathcal{L}}_{joint} = {{\lambda }_{1}\mathcal{L}}_{rec}+{{\lambda }_{2}\mathcal{L}}_{KG}+{\lambda }_{3}{\mathcal{L}}_{mlcl}+{\lambda }_{4}\parallel \Theta {\parallel }_{2}^{2}$$
(18)

where: \(\Theta\) represents the model parameter set, \({L}_{2}\) is a regularization term, and \({\lambda }_{1}\),\({\lambda }_{2}\),\({\lambda }_{3}\),\({\lambda }_{4}\) are the loss function coefficients that balance the contributions of each loss component to optimize overall model performance.

\({\mathcal{L}}_{rec}\)is the main loss function. The widely used BPR loss31 is selected as the main loss, with the following calculation formula:

$${\mathcal{L}}_{rec} = {\sum }_{(u,i,j)\in \mathcal{O}}-\mathit{ln}\sigma \left(\widehat{y}(u,i)-\widehat{y}(u,j)\right)$$
(19)

where: \(\mathcal{O}=\left\{(u,i,j)\left|(u,i)\in {\mathcal{R}}^{+},(u,j)\in {\mathcal{R}}^{-}\right.\right\}\) represents the training set, and \(\sigma \left(\cdot \right)\) is the sigmoid function. \(\widehat{y}(u,i)\) calculates the similarity between the user and the item after all the dissemination layers are aggregated. In other words, their matching score is as follows:

$$\widehat{y}\left(u,i\right)={{e}_{u}^{*}}^{T}{e}_{i}^{*}$$
(20)

Through designing this joint loss function, the model can not only effectively learn the feature representation of users and items, but also meaningfully employ the structured information in the KG. Accordingly, more accurate and personalized predictions can be realized in the recommendation system.

Experiment and analysis

In this paper, the performance of the ML-KDGATMoco model is verified on the RecBole32 platform. The experimental environment is a single GPU(RTX 2080 Ti), with a CPU of Intel Core i7-8700K@4.7 GHz and memory of 128G.

Dataset

Through employing the two datasets, ML-1m_KG and Amazon-books-KG, based on KB4Rec33, this paper combines the dataset of MovieLens and Amazon-books with the Knowledge Graph Freebase (KGF). These datasets are preprocessed by 15-core filtering and 5-core filtering to improve data quality. Table 1 summarizes the statistical information of the preprocessed dataset.

Table 1 Statistics of the datasets.

Contrast model

In the experiment, several representative models in RecBole are selected for contrast as follows:

(1) SGL7: This is an innovative contrastive learning recommendation model, which generates enhanced views through random walk sampling and random edge/node loss technology to enhance the learning of user preferences.

(2) KGCN2: As a model based on a GNN, KGCN uses a KG as auxiliary information to improve the accuracy and interpretability of the recommendation system.

(3) SimGCL24: By directly adding uniform noise to embeddings, which creates contrastive views, SimGCL smoothly adjusts representation uniformity, thereby demonstrating significant advantages in recommendation accuracy and training efficiency over augmentation-based methods.

(4) KGAT3: KGAT uses the GNN to model the complex relationships in the KG, assists the recommendation process, and emphasizes the in-depth utilization of the KG structure.

(5) XSimGCL34: This model uses a simple, effective noise-based method to create contrastive samples, enhancing accuracy while reducing computational costs.

Experimental evaluation indicators and experimental settings

In the experiment, the dataset is preprocessed. 80% of each user’s interaction records are randomly selected as the training set, and the remaining 20% as the test set. In addition, 10% of the interaction records are randomly selected from the training set as the verification set for tuning the hyperparameter. In this process, all items that the user has not interacted with are regarded as negative samples.

To comprehensively evaluate the effectiveness of the Top-K recommendation, various evaluation indicators are adopted, including Recall@K, MRR@K, NDCG@K, Hit@K and Precision @K. Specifically, the value of k is set to 10. Choosing K = 10 is based on the characteristics of the selected dataset-music and book recommendation. In this kind of recommendation scenario, users tend to browse more recommended content to find items that they may be interested in. Therefore, improving the performance of Hit@10 is more practical than improving Hit@1. All evaluation indicators are calculated based on the average of all users in the test set.

To ensure the fairness and comparability of the experiment, the parameter settings of all baseline models follow the recommended configuration officially recommended by RecBole. Specifically, the user and item embedding dimensions of the two datasets are set to 64, with the training batch size of 4096, and the training epoch set to 500. For the ML-KDGATMoco framework, the parameter setting of the main supervision module is consistent with the KGAT model.

In the self-supervised contrastive learning module, the initial value of random Edge Dropout is set to 0.2, the parameter of the loss function is set to 0.01, the parameter of the loss function is set to 0.01, the parameter of temperature is set to 1, and the optimizer is Adam. In addition, an early stop strategy is adopted. This means that if Recall@10 on the verification set is not improved in 50 consecutive epochs, then the training will be terminated early.

Analysis of experimental results

The experimental results are shown in Table 2. Compared with the other four baseline models, the ML-KDGATMoco model indicates significant performance improvement in each evaluation indicator. In particular, on the sparse Amazon-books-KG dataset, compared with the baseline model, the proposed model is improved by 15.6%, 15.8%, 16.6%, 8.6% and 9.3% on Recall@10, NDCG@10, Hit@10 and Precision @10, respectively. This result highlights the powerful ability of ML-KDGATMoco to handle sparse datasets.

Table 2 Performance comparison with baselines on two datasets.

On the ML-1m_KG dataset, the KGAT model performs best in most performance indicators. However, on the more challenging and sparser Amazon-book-KG dataset, XSimGCL models show obvious advantages. Through the self-supervised graph learning method, the XSimGCL model effectively uses graph structure information to learn the representation of nodes and edges. Therefore, it performs well in handling sparse problems within large-scale KGs.

In contrast, the KGAT model may face the challenge of learning effective entity and relationship information in extremely sparse datasets. Although its attention mechanism contributes to identifying the importance of entities and relationships, it may not be efficient enough in sparse datasets with a small number of samples. Thus, it has difficulties in allocating attention.

Compared with these models, the outstanding performance of ML-KDGATMoco on Amazon-book-KG datasets is mainly due to its multi-level contrastive learning and optimization strategy for sparse data. This result shows that the combination of contrastive learning and KG can effectively improve the performance of the recommendation system in handling sparse data.

Ablation experiment

To verify that the self-supervised learning task of contrastive learning and its sub-models in ML-KDGATMoco can effectively improve the whole recommendation performance, some modules are deleted respectively. Three variants, including MLCL-d, MLCL-m and MLCL-p, are obtained. Specifically, MLCL-d is the contrastive learning using the static attention network in KGAT. MLCL-m is contrastive learning using the common negative sample strategy based on the batch. MLCL-p is the contrastive learning without the nonlinear transformation layer.

The three variants are put into two datasets for test experiments again. The results are shown in Table 3.

Table 3 Results of the ablation study.

It can be concluded:

1) All the performance indicators of the Contrastive Learning Variant (MLCL-d), which replaces the dynamic attention network with the static attention network in 2 datasets, have declined to different degrees. In particular, on the Amazon-books-KG dataset, each performance indicator decreases by 3.79–5.85 points. The impacts on indicators Recall@10, NDCG@10 and Precision@10.

are over 5 points. The result shows that the dynamic attention mechanism can better model the KG with sparse data than the static attention mechanism, which plays an important role in improving model performance.

2) Removing the Moco module or the nonlinear transformation layer can reduce the performance of the two datasets. Relatively speaking, the Moco module exerts a great impact on the performance. The effects reach 1.90–3.41 points on each performance indicator of the ML-1m_KG dataset and 1.95–4.26 points on each performance indicator of the Amazon-books-KG dataset. However, the nonlinear transformation layer affects about 1.30–2.93 points on each performance indicator of the ML-1m_KG dataset, and about 0.88–3.11 points on each performance indicator of the Amazon-books-KG dataset. It is verified that introducing a momentum encoder to expand negative samples and a nonlinear transformation layer to retain more feature information can play a positive role in contrastive learning.

Data sparsity experiment

To test whether ML-KDGATMoco is robust to the data sparsity problem common in knowledge-aware recommendation systems, experiments are conducted on user groups with different sparsity levels. The test set is divided into four groups according to the number of user interactions. Different groups are kept with the same total amount of interactions as much as possible. Taking the ML-1m_KG dataset as an example, it is divided into groups with less than 148 interactions, less than 302 interactions, less than 543 interactions and less than 2315 interactions, separately. Figures 3 and 4 show the experimental results of NDCG@10 for different user groups on two datasets. As can be seen, ML-KDGATMoco can still learn the node representation well in the case of extremely sparse data. The performance of maintaining NDCG is better than the baseline model, and the more sparse the dataset, the more obvious this advantage is.

Fig. 3
figure 3

Performance comparison of the data sparsity on ML-1m_KG.

Fig. 4
figure 4

Performance comparison of the data sparsity on Amazon-books-KG.

Parameter discussion

The ML-KDGATMoco model contains two key hyperparameters: the dropout rate \(\rho\) of edge dropout and the temperature hyperparameter \(T\) of the InfoNCE loss function. The dropout rate of edge dropping controls the degree of edge dropout of the data enhancement module in contrastive learning. The larger the value, the greater the difference between the newly constructed graph and the original graph, and the smaller the difference. The temperature hyperparameter affects the sensitivity of the InfoNCE loss function. By changing the values of these two parameters and observing the performance changes of the model on ML-1m-KG and Amazon-books-KG datasets, the influences on the model performance can be evaluated.

The experiment results (Fig. 5) show that though the fine-tuning of these parameters can improve the performance, the overall performance of the model is affected by other factors. These factors can be model architecture, quality and complexity of training data. The performance indicator shows a trend of increasing first and then decreasing with the change of parameters. This indicates that parameter adjustment should be refined to achieve optimal performance.

Fig. 5
figure 5

Performance change of hyperparameters on two datasets. (a) Performance change of \(\rho\) parameter. (b) Performance change of \(T\) parameter.

Conclusion

By combining a KG with a recommendation system, this study explores the application possibilities of contrastive learning in knowledge-aware recommendation. Firstly, the KG and user-item interaction is regarded as a whole graph structure. Three levels of contrastive learning tasks are designed to train jointly with recommendation tasks. In addition, by introducing a momentum encoder and dynamic attention mechanism, the performance of the model is effectively improved when processing sparse data. The experimental results on two real datasets confirm the superior performance of the ML-KDGATMoco model in the field of knowledge-aware recommendation, especially its powerful ability to handle sparse datasets. In the future, the research will further focus on the new application of contrastive learning in the recommendation system based on KGs.