Graph convolutional network and self-attentive for sequential recommendation

Kaifeng Guo; Guolei Zeng

doi:10.7717/peerj-cs.1701

Graph convolutional network and self-attentive for sequential recommendation

Kaifeng Guo , Guolei Zeng

Fuzhou University, Fuzhou, Fujian, China

DOI: 10.7717/peerj-cs.1701

Published: 2023-12-01
Accepted: 2023-10-25
Received: 2023-08-16

Academic Editor: Xiangjie Kong

Subject Areas: Artificial Intelligence, Neural Networks
Keywords: Sequential recommendation, Contrastive learning, Graph convolutional network, Deep learning

Copyright: © 2023 Guo and Zeng
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Guo K, Zeng G. 2023. Graph convolutional network and self-attentive for sequential recommendation. PeerJ Computer Science 9:e1701 https://doi.org/10.7717/peerj-cs.1701

The authors have chosen to make the review history of this article public.

Abstract

Sequential recommender systems (SRS) aim to provide personalized recommendations to users in the context of large-scale datasets and complex user behavior sequences. However, the effectiveness of most existing embedding techniques in capturing the intricate relationships between items remains suboptimal, with a significant concentration of item embedding vectors that hinder the improvement of final prediction performance. Nevertheless, our study reveals that the distribution of item embeddings can be effectively dispersed through graph interaction networks and contrastive learning. In this article, we propose a graph convolutional neural network to capture the complex relationships between users and items, leveraging the learned embedding vectors of nodes to represent items. Additionally, we employ a self-attentive sequential model to predict outcomes based on the item embedding sequences of individual users. Furthermore, we incorporate instance-wise contrastive learning (ICL) and prototype contrastive learning (PCL) during the training process to enhance the effectiveness of representation learning. Broad comparative experiments and ablation studies were conducted across four distinct datasets. The experimental outcomes clearly demonstrate the superior performance of our proposed GSASRec model.

Introduction

Personalized recommendation has become a dominant and widely adopted approach in various real-world applications, empowering users with tailored item suggestions that cater to their individual interests (Cheng et al., 2016; Fayyaz et al., 2020). The core task of a recommender system revolves around predictive modeling, which aims to predict the likelihood of user-item interactions, encompassing various forms of engagement like clicks, ratings, and purchases, among others. This predictive capability serves as the foundation of effective recommendation systems, enabling them to provide users with relevant and appealing item recommendations, thereby enhancing user satisfaction and engagement. Therefore, accurately capturing user preferences is a critical aspect (Peng, Sugiyama & Mine, 2022).

Collaborative filtering (CF) has emerged as a potent solution for recommendation systems. It relies on historical user-item interactions, such as purchases or clicks, with the assumption that users with similar behavior are likely to exhibit similar preferences for items (Cheng et al., 2018; He et al., 2017). One of its key advantages is that it does not rely on explicit feature engineering or content analysis, allowing it to discover hidden patterns and relationships between users and items solely based on user interactions. This approach makes collaborative filtering a powerful method to make personalized recommendations in various domains. Extensive research on CF-based recommenders has been conducted, leading to remarkable achievements in this field (Koren, 2008; He et al., 2018; Wang et al., 2019; He et al., 2020). However, collaborative filtering methods cannot directly consider the temporal relationships of user behaviors, which means they may not capture the evolving patterns of user behavior over time, thus performing suboptimally in handling recommendation problems involving temporal dependencies. On the other hand, Sequential recommendation (SR) is a branch of recommendation systems that focuses on providing personalized recommendations by considering the temporal order of user behavior sequences. User interests and preferences are known to evolve and change gradually. To handle the temporal dependency issues in SR, researchers have developed specialized models such as BERT4Rec (Sun et al., 2019) and SASRec (Kang & McAuley, 2018). These models organize users’ actions, such as browsing, purchasing, adding to cart, and other interactions, in chronological order and employ attention mechanisms or positional encoding to gain a better understanding of how user interests evolve over time. Numerous sequential recommendation (Xie et al., 2020; Chen et al., 2022; Zhou et al., 2020; Liu et al., 2021a; Li et al., 2023) studies delve into exploring more effective ways of representing embedded representations of sequential items. One such approach is contrastive learning (Chen et al., 2020) where a positive sample sequence is obtained through sequence augmentation methods, while other sequences serve as negative samples. By encouraging the model to increase the similarity between the encodings of positive sample sequences and decrease the similarity with negative sample sequences, the model’s representational capacity is enhanced. Consequently, the model becomes better equipped to differentiate between the long-term and short-term interests and intentions of distinct users. Through this approach, the model gains a more comprehensive understanding of the intricate patterns embedded in users’ sequential behaviors, thus yielding more accurate and personalized recommendations.

However, sequential recommendation models often face challenges in directly learning the similarities between users and items, as well as item-item and user-user relationships. In contrast, collaborative filtering methods, such as multi-layer graph convolutions on user-item interaction graphs, can effectively unearth the underlying connections between items and users. For instance, users with similar behavior sequences are likely to have similar embedded representations, leading to higher similarity scores. Consequently, if two users have similar embedded representations for certain items, their overall item representations should also exhibit a higher degree of similarity.

Therefore, by combining collaborative filtering with sequential recommendation, we can address these issues. In this regard, we propose a method that utilizes user interaction graph convolutions to extract item embeddings and then employs a sequential recommendation model to predict the user’s next actions. To further enhance the model’s effectiveness, we incorporate instance contrastive learning and prototype contrastive learning to improve its representational capacity.

In summary, this article makes several contributions are:

We propose that combining interactive graphs and attention-based sequence models can complement each other’s limitations. We have empirically demonstrated that the fusion of these two techniques can indeed effectively enhance model performance.
During the training phase, we employ a multi-task learning approach by integrating instance-wise contrastive learning and prototype contrastive learning. We have verified that the combination of these two contrasting learning methods can further improve model effectiveness.
Extensive experiments are carried out on four widely-used public datasets, showcasing the consistent superiority of our proposed approach over various competitive baselines. Additionally, we conducted multiple sets of ablation experiments to validate the effectiveness of each module.

Related work

Collaborative filtering

Collaborative filtering (CF) is a popular approach in recommendation systems that involves learning latent features, or embeddings, to represent users and items. The prediction is then performed based on these embedding vectors. Matrix factorization is one of the early CF models, where users’ interaction history is not explicitly considered, and only the user ID is projected to the embedding. However, subsequent research has shown that incorporating user interaction history can improve the quality of embeddings and prediction performance.

An example of this is the utilization of user interaction history in predicting numerical ratings, as demonstrated by SVD++ (Koren, 2008). Additionally, Neural Attentive Item Similarity (NAIS) assigns varying degrees of importance to items present in the interaction history, leading to more accurate item ranking predictions (He et al., 2018). The key to these enhancements lies in leveraging the subgraph structure of a user’s interaction history, particularly considering their one-hop neighbors, which effectively enhances the process of embedding learning.

To further leverage the subgraph structure, Wang et al. (2019) propose NGCF, a state-of-the-art CF model inspired by graph convolution network (GCN) (Wu et al., 2019). NGCF adopts the propagation rule of GCN, which involves feature transformation, neighborhood aggregation, and nonlinear activation, to refine embeddings. While NGCF has shown promising results, it inherits many operations from GCN without justifying their relevance to the CF task. This design choice introduces unnecessary complexity, particularly when applied to user-item interaction graphs, where each node has only a one-hot ID without rich attribute information. LightGCN (He et al., 2020) introduces a novel approach that propagates user and item embeddings linearly onto the user-item interaction graph, leveraging the weighted summation of embeddings learned across all layers as the ultimate embedding. This method exhibits significant performance improvements over NGCF, as evidenced by our experimental results.

Sequential recommendation

Sequential recommendation has garnered significant research attention in recent years, aiming to accurately capture users’ dynamic interests by modeling their past behavior sequences. Early approaches in this field focused on utilizing Markov chains to model item-to-item transaction patterns. For instance, FPMC combined Markov chains with matrix factorization techniques to integrate sequential patterns and users’ general interests (Rendle, Freudenthaler & Schmidt-Thieme, 2010).

In light of the rise of deep learning, a multitude of deep sequential recommendation models have emerged, harnessing neural networks to capture both long-term and short-term preferences from behavioral sequences. Recurrent neural networks (RNNs) gained prominence due to their ability to encode sequential dependencies. For example, GRU4Rec employed gated recurrent units (GRUs) to model user interests (Hidasi et al., 2015). Another avenue of research delved into the use of convolutional neural networks (CNNs) for sequential recommendation (Yan et al., 2019).

The success of attention mechanisms in natural language processing tasks has motivated its adoption in sequential recommendation. Attention-based models have shown promise in capturing complex dependencies in behavior sequences. SASRec introduced the use of unidirectional attention mechanisms to assign adaptive weights to interacted items (Kang & McAuley, 2018). BERT4Rec improved upon this approach by employing bidirectional attention mechanisms with a Cloze task (Sun et al., 2019). LSAN proposed a light-weight approach with a temporal context-aware embedding and a twin-attention network (Li et al., 2021). ASReP addressed data sparsity by leveraging a attention mechanism on revised user behavior sequences (Liu et al., 2021b). DuoRec (Qiu et al., 2022) introduces innovative techniques to improve semantic preservation and address the representation degeneration problem in recommendation systems.

Contrastive learning for recommendation

Contrastive learning (CL) has garnered significant attention in various research domains such as computer vision, natural language processing, and recommender systems. In the context of recommender systems, the focus of contrastive learning lies in optimizing mutual information between positively transformed data samples while simultaneously enhancing the discriminability of negative samples. Traditional recommender systems often rely on large amounts of labeled user behavioral data, which are often difficult to obtain and may result in subpar recommendations for new users and rare items. In contrast, contrastive learning, with its label-free self-supervised learning approach, exhibits remarkable advantages in recommender systems.

Early works in contrastive learning for recommendation focused on utilizing deep neural networks (DNNs) to enhance collaborative filtering-based recommendation leveraging item attributes (Yao et al., 2020). These models utilized a two-tower architecture to compare positive and negative samples and learn effective item representations. Another line of research employed contrastive learning within graph neural networks (GCNs) to improve collaborative filtering methods using only item IDs as features (Wu et al., 2020).

In the domain of sequential recommendation, contrastive self-supervised learning (SSL) has been utilized to capture associations among items, subsequences, and characteristics found in user behavior sequences (Zhou et al., 2020). These models adopt an end-to-end training approach, incorporating contrastive SSL throughout the entire training phase. Nonetheless, this unified training methodology facilitates information sharing between the SSL and next-item prediction tasks, eliminating the need for separate fine-tuning and pre-training stages, potentially constraining overall performance enhancement. To overcome this limitation, recent studies have proposed multi-task training frameworks incorporating a contrastive objective to improve user representations (Xie et al., 2020; Liu et al., 2021a). Furthermore, a novel approach named ICLRec, presented by Chen et al. (2022), introduces clustering techniques to extract users’ intent distributions from their behavior sequences. By leveraging clustering, ICLRec identifies distinct patterns of user intent embedded within the data.

Preliminaries

Problem settings

Let V and U represent the sets of items and users, respectively. We denote a user $u \in U$ interaction sequence as $S_{u} = {v_{1}, v_{2}, . . . ., v_{T}}$ , where T is the total number of items in the sequence, and the items are ordered chronologically. Each item $v_{i} \in S_{u}$ is associated with an order index $i = 1, 2, . . ., T$ , indicating its position in the sequence. Our objective is to create a prioritized list of the top K items that user $u$ is highly likely to visit in the subsequent time step T + 1.

Proposed model

In this section, we will introduce our proposed graph convolution and self-attention model, named GSASRec. GSASRec is primarily composed of interaction graph convolution (IGC) layers and self-attention layers. We will proceed to describe each layer of the model in the order of forward propagation, along with the contrastive learning methods utilized in the model.

Embedding layer

We expound on the representation of a user, denoted as $u$ , and an item, denoted as $i$ , through their respective embedding vectors, $e_{u} \in R^{d}$ (for user $u$ ) and $e_{i} \in R^{d}$ (for item $i$ ), where $d$ signifies the embedding dimension. The described process can be the creation of a parameter matrix, which operates akin to an embedding look-up table:

$E_{u} = [e_{u_{1}}, e_{u_{2}}, . . ., e_{u_{t}}]$

$E_{i} = [e_{i_{1}}, e_{i_{2}}, . . ., e_{i_{m}}]$ where $t$ represents the total number of users, while $m$ corresponds to the total number of items. For the input sequence $S_{u} = {v_{1}, v_{2}, . . ., v_{n}}$ , data augmentation techniques such as masking, cropping, noising, and reordering are applied to obtain two augmented sequence ${S_{u}}^{'} = {{v_{1}}^{'}, {v_{2}}^{'}, . . ., {v_{n}}^{'}}$ and ${S_{u}}^{''} = {{v_{1}}^{''}, {v_{2}}^{''}, . . ., {v_{n}}^{''}}$ . Then, based on the $E_{i}$ table, we can acquire their embedding $E_{S_{u}} = {e_{v_{1}}, e_{v_{2}}, . . ., e_{v_{n}}} \in R^{n \times d}$ , $E_{{S_{u}}^{'}} = {e_{{v_{1}}^{'}}, e_{{v_{2}}^{'}}, . . ., e_{{v_{n}}^{'}}} \in R^{n \times d}$ and $E_{{S_{u}}^{''}} = {e_{{v_{1}}^{''}}, e_{{v_{2}}^{''}}, . . ., e_{{v_{n}}^{''}}} \in R^{n \times d}$ .

Interaction graph convolution layer

LightGCN (He et al., 2020) incorporates graph convolution neural networks into collaborative filtering, taking into account the latent relationships between users and items, as well as between items themselves. However, during prediction, it does not consider the temporal order of item sequences. Therefore, in this work, we leverage graph convolution neural networks to extract latent embedding information, with a focus on capturing the sequential characteristics of items, as illustrated in Fig. 1.

Based on the training data, we construct the user-item interaction matrix $R \in R^{t \times m}$ and the item-user interaction matrix $R^{T} \in R^{m \times t}$ . With these matrices in place, we define the graph convolution network as follows:

$e_{u}^{(k + 1)} = \sum_{i \in N_{u}} \frac{1}{\sqrt{| N_{u} |} \sqrt{| N_{i} |}} e_{i}^{(k)}$

$e_{i}^{(k + 1)} = \sum_{u \in N_{i}} \frac{1}{\sqrt{| N_{i} |} \sqrt{| N_{u} |}} e_{u}^{(k)}$ where $e_{i}^{(k + 1)}$ represents the updated representation of node $i$ in the $k + 1$ st iteration. The sum is taken over all the neighboring nodes $u$ of node $i$ denoted by $N_{i}$ . The term $\frac{1}{\sqrt{| N_{i} |} \sqrt{| N_{u} |}}$ is a normalization factor that accounts for the degree of nodes $i$ and $u$ , and $e_{u}^{(k)}$ is the representation of node $u$ in the $k$ -th iteration. When $k = 0$ , we initialize $e_{i}^{(0)} = e_{i} \in E_{i}$ and $e_{u}^{(0)} = e_{u} \in E_{u}$ . This update rule is used in graph convolution networks to aggregate neighboring node features and update the representation of each node in the graph.

As items undergo multiple graph convolutions, and the sequence model focuses solely on item sequences for recommendations, we extract only the item embedding representations for the subsequent layers. We aggregate the outputs of various convolution layers to obtain the graph embedding representation for item $i$ .

$e_{i}^{(k)} = \frac{1}{K} \sum_{k = 0}^{K} e_{i}^{(k)}$ where K represents the number of graph convolution layers utilized in the model.

Self-attention layer

To represent the temporal order within a sequence, we employ positional embedding. Assuming the positional embedding is represented as $P \in ℝ^{n \times d}$ , we add it to the embedding of the behavioral sequence:

$\hat{E_{p}} = [\begin{matrix} e_{v_{1}}^{(k)} + P_{1} \\ e_{v_{2}}^{(k)} + P_{2} \\ \dots \\ e_{v_{n}}^{(k)} + P_{n} \end{matrix}] {\hat{E_{p}}}^{'} = [\begin{matrix} e_{v_{1}'}^{(k)} + P_{1} \\ e_{v_{2}'}^{(k)} + P_{2} \\ \dots \\ e_{v_{n}'}^{(k)} + P_{n} \end{matrix}] {\hat{E_{p}}}^{''} = [\begin{matrix} e_{{v_{1}}^{''}}^{(k)} + P_{1} \\ e_{{v_{2}}^{''}}^{(k)} + P_{2} \\ \dots \\ e_{{v_{n}}^{''}}^{(k)} + P_{n} \end{matrix}]$ we incorporate self-attention mechanism and feed-forward network layers:

$E_{A} = A t t e n t i o n (\hat{E_{p}} W^{Q}, \hat{E_{p}} W^{K}, \hat{E_{p}} W^{V})$

$F = R e L U (E_{A} W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)}$ where the matrices $W^{Q}, W^{K}, W^{V} \in R^{d} \times d$ and the matrices $Q, K, V \in R^{n \times d}$ . $W^{(1)}$ and $W^{(2)} \in R^{d \times d}$ serve as parameter matrices, while $b^{(1)}$ and $b^{(2)} \in R^{d}$ represent bias vectors. The attention mechanism is expressed as follows:

$A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V$

Similarly, from ${\hat{E_{p}}}^{'}$ and ${\hat{E_{p}}}^{''}$ , we can obtain $F^{'}$ and $F^{''}$ .

Recommendation learning

For the output sequence $F = {f_{1}, f_{2}, . . ., f_{n}}$ of the feed-forward network (FFN), we can compute the binary cross-entropy loss at each step of the recommendation model:

$L_{R e c} = - \sum_{t \in [1, 2, \dots, n - 1]} \log (σ (f_{t} \cdot e_{v_{t + 1}})) - \log (σ (f_{n} \cdot e_{v_{\hat{y}}})) - \sum_{t \in [1, 2, \dots, n]} \sum_{j \notin S^{u}} \log (1 - σ (f_{t} \cdot e_{v_{j}}))$ where $f_{i}$ represents the output of the $i$ -th FFN of the model. $\hat{y}$ represents the index of the target item at position n+1 in the sequence within $E_{i}$ . $σ$ denotes the sigmoid function. $e_{v_{\hat{y}}}$ signifies the embedding representation of the training label.

Instance-wise contrastive learning

For a training batch $B = {F_{1}, F_{2}, . . ., F_{b}, {F_{1}}^{'}, {F_{2}}^{'}, \dots {F_{b}}^{'}, {F_{1}}^{''}, {F_{2}}^{''}, \dots, {F_{b}}^{''}}$ , where $b$ is the number of original sequences, and $3 \cdot b$ is the total number of sequences in one batch, comprising the original sequences and their two augmented sequences, we aim to maximize the similarity between $F_{i}$ and its corresponding augmented sequences ${F_{i}}^{'}$ and ${F_{i}}^{''}$ , as well as the similarity between the two augmented sequences themselves. Additionally, we seek to minimize the similarity between $F_{i}$ , ${F_{i}}^{'}$ , ${F_{i}}^{''}$ , and the other sequences in the batch, thereby achieving contrastive learning. Hence, we can compute the InfoNCE loss for the batch $B$ :

$L_{CL} (F_{i}, F_{i}^{'}) = \frac{- log e^{(sim (F_{i}, F_{i}^{'})) / τ}}{\sum_{j = 1, j \neq i}^{b} [e^{sim (F_{i}, F_{j}) / τ} + e^{sim (F_{i}, F_{j}^{'}) / τ} + e^{(sim (F_{i}, F_{j}^{''})) / τ}]}$

$L_{I C L} = \sum_{i = 1}^{b} [L_{C L} (F_{i}, F_{i}^{'}) + L_{C L} (F_{i}^{'}, F_{i}^{''}) + L_{C L} (F_{i}^{''}, F_{i})]$ where $s i m (\cdot)$ represents the tensor similarity function, which is used to calculate the similarity between tensors.

Prototype contrastive learning

Prototype contrastive learning aims to learn feature representations by comparing the similarity between samples and prototypes. This process makes the feature representations of similar samples closer while pushing those of dissimilar samples further apart, resulting in the formation of distinct clusters. Typically, this learning is conducted after multiple rounds of training, specifically when the instance contrastive learning loss approaches relative stability. We interpret the embedding encoding of a user’s entire sequence as the representation of their long-term interest. Generally, users with similar behavioral sequences exhibit close long-term interest embeddings. Hence, adopting prototype contrastive learning can bring the embedding encodings of similar behavioral sequences closer, placing them within the same category. This approach is advantageous for recommendation systems as it facilitates recommending similar items to users with shared interests.

We apply k-means clustering M times to the embedding representations of all user sequences in the data. For each iteration m $(1 \leq m \leq M)$ , we randomly select several points as the initial cluster centroids, denoted as $C = {c_{1}^{m}, c_{2}^{m}, . . ., c_{| C |}^{m}}$ . After several iterations of clustering in the $m$ -th run, we fix the cluster centroids. Subsequently, we define the function:

$g ({\bar{f}}_{i}) = \arg min_{j} d ({\bar{f}}_{i}, c_{j}^{m})$ where $d ({\bar{f}}_{i}, c_{j}^{m}) = \sqrt{\sum_{k = 1}^{d} {({\bar{f}}_{i, k} - c_{j, k}^{m})}^{2}}$ , ${\bar{f}}_{i} = \frac{1}{| F_{i} |} \sum_{f_{i} \in F_{i}} f_{i}$ , and the function $g (\cdot)$ assists in identifying the nearest cluster centroid $c_{j}$ for each averaged embedding ${\bar{f}}_{i}$ calculated as the mean of all embeddings $f_{i}$ within the set $F_{i}$ . We leverage pre-iterated cluster centers for contrastive learning and compute the loss function as follows:

$L_{P C L}^{m} ({\bar{f}}_{i}, c_{g_{({\bar{f}}_{i})}}^{m}) = - \log \frac{e^{({\bar{f}}_{i} \cdot c_{g_{({\bar{f}}_{i})}}^{m})}}{\sum_{j = 0, j \neq g_{({\bar{f}}_{i})}}^{| C |} e^{({\bar{f}}_{i} \cdot c_{j}^{m})}} .$

$L_{P C L} ({\bar{f}}_{i}) = \frac{1}{M} \sum_{m = 0}^{M} L_{P C L}^{m} ({\bar{f}}_{i}, c_{g_{({\bar{f}}_{i})}}^{m})$

Multi-task learning

To enhance model performance, data efficiency, and generalization capability, and to address challenges such as data scarcity and overfitting, we adopt a multitask learning approach, as shown in Fig. 2, to integrate recommendation, instance contrastive learning, and prototype contrastive learning tasks. Specifically, we jointly optimize the loss functions of these tasks:

Figure 2: The overview of GSASRec in the training stage.
We assume that the input sequence of examples goes through data augmentation techniques, such as introducing noise, to generate two positive sample sequences (only one is shown in the figure). In this process, we randomly replace $i_{9}$ and $i_{6}$ with $i_{7}$ and $i_{2}$ , respectively. Subsequently, the encoded sequences undergo multitask learning, involving instance contrastive learning and prototype contrastive learning.

Download full-size image

DOI: 10.7717/peerj-cs.1701/fig-2

$L = L_{R e c} + λ \cdot L_{I C L} + β \cdot L_{P C L}$ where $λ$ and $β$ are adjustable parameters used to balance the importance of the losses.

Experiments

In this section, an extensive assessment is conducted to evaluate the recommendation efficacy of our GSASRec model, designed for sequential recommendation tasks. Our evaluation entails a comprehensive analysis that includes a comparative study between GSASRec and previous sequential recommenders. Subsequently, we delve into a thorough investigation to explore the influence of crucial components and hyperparameters integrated within GSASRec’s architecture. This systematic examination aims to shed light on the model’s strengths and potential areas for further enhancement, contributing to the advancement of sequential recommendation techniques powered by deep learning methodologies.

Experimental setting

Datasets

In our investigation, we embark on a series of experiments encompassing four widely adopted benchmark datasets. These datasets have their statistical attributes meticulously summarized and displayed in Table 1. Incorporated within McAuley et al. (2015), the Amazon review dataset has been thoughtfully partitioned into three distinct subcategories, namely Sports, Beauty and Toys. Concurrently, Yelp emerges as a prominent dataset tailored for the specific task of business recommendation. Following the methodology presented in reference (Xie et al., 2020), we adopt a similar approach to preprocess the dataset, eliminating users with fewer than five interactions.

Table 1:

Statistics of experimental datasets.

Dataset	#Users	#Items	#Interactions	Density (%)
Sports	35,598	18,357	296,337	0.05
Beauty	22,363	12,101	198,502	0.07
Toys	19,412	11,924	167,597	0.07
Yelp	22,845	16,552	243,703	0.06

DOI: 10.7717/peerj-cs.1701/table-1

Note:

Density (%) = $\frac{# I n t e r a c t i o n s}{# U s e r s \times # I t e m s}$ .

Evaluation metrics

To evaluate the performance of our approach, we utilize two widely recognized Top-K metrics (NDCG@K and HR@K) as proposed by a previous work (Krichene & Rendle, 2020). The formula for NDCG@K is as follows:

$N D C G @ K = \frac{D C G @ K}{I D C G @ K}$ where $D C G @ K = \sum_{i = 1}^{K} \frac{r e l_{i}}{\log_{2} (i + 1)}$ and $r e l_{i}$ is the relevance score of the item at position i in the ranked list. IDCG@K is the maximum possible DCG@K achievable for a perfect ranking. It is calculated by sorting the items by their true relevance scores in descending order and then calculating DCG@K for this ideal ranking.

HR@K is a binary evaluation metric, commonly used for the performance evaluation of recommendation systems. The formula for HR@K is as follows:

$H R @ K = \frac{N u m b e r o f r e l e v a n t i t e m s i n r e c o m m e n d a t i o n s}{K}$ where the number of relevant items in recommendations is the number of items related to user interests in the first K recommended results.

Overall, HR@K measures the percentage of recommended items that contain at least one ground truth item within the top K positions. On the other hand, NDCG@K assesses the ranking quality by giving higher scores to hits at higher-ranked positions. These metrics provide a quantitative measure of how effective each model is at recommending relevant items within the top K positions. By comparing NDCG@K or HR@K scores, we can determine which model is better at surfacing relevant content to users. Higher scores indicate more effective recommendations. To ensure consistency, we set the value of K to 5 and 10 for both metrics.

Baseline methods

We compare GSASRec with the following baseline methods:

BPR-MF (Rendle et al., 2012) proposed a generic learning algorithm based on stochastic gradient descent with bootstrap sampling.
Caser (Tang & Wang, 2018) proposed a convolutional sequence embedding recommendation model, which effectively captures both general preferences and sequential patterns in recommendation tasks.
GRU4Rec (Hidasi et al., 2015) proposed a novel session-based recommendation model based on GRUs, which effectively captures temporal dependencies in user behavior sequences.
SASRec (Kang & McAuley, 2018) utilized self-attention mechanism for sequential recommendation.
BERT4Rec (Sun et al., 2019) adopted BERT as the sequential recommendation model.
$S^{3} R e c$ (Zhou et al., 2020) adopted a self-supervised learning approach, where items in the user behavior sequence are masked, and the masked sequence is used to predict the masked items.
CL4SRec (Xie et al., 2020) proposes the use of data augmentation in contrastive learning to enhance the effectiveness of recommendation systems.
ICLRec (Chen et al., 2022) leveraged clustering to learn user intent and validated the rationality of this approach.

Implementation

We employ various critical hyperparameters. Specifically, we configure the embedding size to 64, establish the maximum sequence length at 50, define a batch size of 256, and specify 300 epochs for training. When it comes to the contrastive learning loss during prototype computation, our learning process kicks off from epoch 160, with a learning rate set at 0.001. Our model architecture comprises three graph convolutional layers, each of which incorporates two self-attention blocks with two attention heads. We set $λ$ to 0.9 and $β$ to 0.1. Additionally, we iterate through the clustering procedure M times, with M being defined as 3. Furthermore, we harness the PyTorch framework, and our GPU is equipped with an NVIDIA GeForce RTX 3070, supported by a substantial 64 GB of computer RAM.

Overall performance

Through the analysis of Table 2, we can observe the results obtained by various methods on different datasets. we observe that incorporating sequential patterns in user behavior sequences enhances the performance of sequential models like SASRec and Caser, surpassing the non-sequential approach BPR-MF. This highlights the significance of mining sequential patterns, with GRU4Rec also exhibiting improved results over BPR-MF in the deep learning era. Furthermore, Caser, leveraging a convolutional module to stack sequential tokens as a matrix, performs on par with GRU4Rec. Moreover, SASRec stands out as the pioneer in utilizing uni-directional attention for sequence encoding, demonstrating its superiority over previous deep learning-based models by significantly improving performance. With the rise of contrastive learning techniques in recommendation systems, BERT4Rec, S3-Rec, and CL4SRec have all leveraged contrastive learning to enhance model performance, surpassing pure sequential recommendation models. However, the two-stage training strategy employed in S3-Rec obstructs information sharing between tasks, resulting in suboptimal outcomes. On the contrary, CL4SRec consistently outperforms other baselines, showcasing the efficacy of contrastive self-supervised learning in enriching sequence representations at an individual user level. The additional objective employed by CL4SRec, entailing two distinct views of the same sequence, significantly contributes to its superior performance. Subsequently, the emergence of ICLRec method combines the advantages of previous approaches and introduces user intent extraction techniques, which also rely on contrastive learning methods, resulting in significant improvements. Finally, our proposed GSASRec model achieves even greater improvements compared to ICLRec. In contrast, we enhance the model’s representational capacity by leveraging graph convolutional techniques on the user-item interaction graph. Moreover, we perform multiple prototype clustering to mitigate noise interference and introduce a data augmentation method for instance-based contrastive learning.

Table 2:

Overall performance.

Dataset	Metric	BPR	GRU4Rec	Caser	SASRec	BERT4Rec	$S^{3} R e c$	CL4Rec	ICLRec	GSASRec	Improve
Sports	HR@5	0.0101	0.0136	0.0140	0.0219	0.0177	0.0158	0.0229	0.0282	0.0306 $\pm$ 0.0012	8.51%
	HR@10	0.0194	0.0278	0.0231	0.0336	0.0326	0.0265	0.0373	0.0431	0.0462 $\pm$ 0.0008	7.19%
	NDCG@5	0.0048	0.0096	0.0086	0.0128	0.0105	0.0098	0.0131	0.0182	0.0209 $\pm$ 0.0006	14.83%
	NDCG@10	0.0063	0.0136	0.0126	0.0169	0.0155	0.0135	0.0185	0.0230	0.0258 $\pm$ 0.0005	12.17%
Beauty	HR@5	0.0134	0.0165	0.0258	0.0367	0.0194	0.0327	0.0402	0.0493	0.0518 $\pm$ 0.0011	5.07%
	HR@10	0.0301	0.0365	0.0421	0.0627	0.0401	0.0594	0.0686	0.0736	0.0788 $\pm$ 0.0013	7.06%
	NDCG@5	0.0045	0.0087	0.0131	0.0236	0.0189	0.0176	0.0231	0.0324	0.0344 $\pm$ 0.0007	6.17%
	NDCG@10	0.0058	0.0143	0.0256	0.0281	0.0254	0.0269	0.0318	0.0401	0.0426 $\pm$ 0.0010	6.23%
Yelp	HR@5	0.0131	0.0154	0.0156	0.0161	0.0186	0.0175	0.0218	0.0245	0.0257 $\pm$ 0.0004	4.90%
	HR@10	0.0246	0.0265	0.0254	0.0265	0.0292	0.0283	0.0354	0.0408	0.0429 $\pm$ 0.0012	4.91%
	NDCG@5	0.0760	0.1070	0.0097	0.0101	0.0118	0.0115	0.0131	0.0153	0.0161 $\pm$ 0.0005	5.22%
	NDCG@10	0.0119	0.0136	0.0129	0.0135	0.0173	0.0162	0.0188	0.0207	0.0216 $\pm$ 0.0008	4.34%
Toys	HR@5	0.0120	0.0098	0.0164	0.0467	0.0277	0.0144	0.0536	0.0590	0.0621 $\pm$ 0.0018	5.25%
	HR@10	0.0206	0.0177	0.0274	0.0655	0.0449	0.0553	0.0816	0.0834	0.0863 $\pm$ 0.0035	3.47%
	NDCG@5	0.0081	0.0061	0.0109	0.0310	0.0177	0.0131	0.0369	0.0406	0.0421 $\pm$ 0.0013	3.69%
	NDCG@10	0.0113	0.0173	0.0274	0.0649	0.0198	0.0371	0.0434	0.0481	0.0502 $\pm$ 0.0015	4.36%

DOI: 10.7717/peerj-cs.1701/table-2

Notes:

Bold indicates the best result among all methods, while underlining represents the highest result among previous methods.

$I m p r o v e (%) = \frac{O u r m o d e l s c o r e - h i g h e s t r e s u l t a m o n g p r e v i o u s m e t h o d s}{h i g h e s t r e s u l t a m o n g p r e v i o u s m e t h o d s}$ .

Figure 3 presents the model’s performance at each epoch. It is important to highlight that we introduced the prototype contrastive learning loss at epoch 160, as depicted in Fig. 3E. This led to a noticeable increase in the computed loss values, resulting in distinctive fluctuations and an overall upward trend in the curves, particularly evident in the Toys (Fig. 3C) and Beauty (Fig. 3B) datasets.

Figure 3: The training curves of GSASRec, which are evaluated through training loss, and testing HR@k and NDCG@k per epoch on the Sports, Beauty, Toys, and Yelp datasets.

Download full-size image

DOI: 10.7717/peerj-cs.1701/fig-3

Ablation study

Impact of parameters $λ$ and $β$

As shown in Fig. 4, to evaluate the impact of loss function weights on the model’s performance, we conducted experiments with multiple sets of $λ$ and $β$ values and assessed the model’s NDCG@10 performance on four different datasets. The results indicate that the model performs best when $λ$ is set to 0.9 and $β$ to 0.1. However, when $β$ is set to 0 or greater than 0.1, the model’s performance deteriorates. We attribute this to the introduction of the prototype contrast loss, which results in the prototype contrast loss value becoming much larger than the instance contrast loss value after a certain number of epochs. Consequently, the model overly emphasizes the prototype contrast task and does not continue to optimize the sequence recommendation task and the instance contrast task. Therefore, it is necessary to reasonably reduce the weight of the prototype contrast task.

The performance of ablation experiments on the parameters
$\lambda$λ
and
$\beta$β
. — Figure 4: The performance of ablation experiments on the parameters $λ$ and $β$ .

Download full-size image

DOI: 10.7717/peerj-cs.1701/fig-4

Impact of model components on recommendation performance

To validate the effectiveness of various model architectures and methods thoroughly, we conducted comprehensive ablation experiments on four diverse datasets, leveraging the widely accepted NDCG@10 metric for evaluation. The conducted ablation experiments involved systematically removing specific functionalities from our proposed model, GSASRec, in order to gauge their individual contributions to the overall performance.

In Fig. 5, we present the insightful results obtained from these ablation experiments. Each abbreviation in the figure represents a specific functionality removed from the GSASRec model. ‘w/o’ stands for ‘without,’ indicating the absence of the corresponding functionality. Specifically, ‘ICL’ represents Instance-wise contrastive learning, ‘PCL’ refers to prototype contrastive learning, ‘IGCL’ signifies the interaction graph convolution layer and ‘SAL’ denotes the self-attention layer, when the self-attention layer is removed, we employ the embeddings of individual items obtained from the graph convolution layer (GCL) as the objects for contrastive learning. Additionally, we employed ‘ICLRec’ as the baseline model, representing the best-performing model from the instance-wise contrastive learning methods. The outcomes of the ablation experiments are highly informative. It is evident that instance-wise contrastive learning has a substantial and positive impact on the model’s overall performance, indicating its crucial role in enhancing recommendation accuracy. Following closely is the self-attention layer, which also demonstrates its significance in contributing to improved recommendation results. Moreover, the results highlight the importance of prototype contrastive learning, particularly for the Toys dataset, where it exhibits a noteworthy influence on enhancing recommendation performance. This observation emphasizes the versatility of our proposed model across different datasets and the potential of prototype contrastive learning in addressing specific domain challenges. Furthermore, the interaction graph convolution layer stands out as a significant component in our model, consistently leading to substantial performance improvements across all the evaluated datasets. This finding underlines the efficacy of incorporating graph-based interactions to capture complex relationships between users and items, reinforcing the importance of leveraging graph-based learning methods in recommendation systems.

Figure 5: The performance of ablation experiments on the four datasets.

Download full-size image

DOI: 10.7717/peerj-cs.1701/fig-5

Impact of layer combination

In our model, we aggregate the outputs of k convolutional layers to obtain the embedding representation. The choice of different k values significantly influences the effectiveness of the model’s embedding representation. Figure 6 illustrates the effects of varying convolutional layer depths on four distinct datasets. The term ‘Number of Layer’ corresponds to the upper limit of k values in the model. The curve trends in the figure are generally consistent, with the model’s performance peaking at the third layer in most cases. However, for the Beauty dataset (Fig. 6B), the performance of the four-layer graph convolutional network slightly outperforms that of the third layer. Consequently, we can infer that the suboptimal performance for k values below three may stem from the model’s inability to fully capture the complex relationships and features present in the graph data. Shallow graph convolutional networks might be limited in their ability to propagate information among local neighbor nodes, making it difficult to capture global structures and longer-range dependencies. As the number of convolutional layers increases, the model progressively expands its receptive field, utilizing more extensive graph structure information for feature propagation and learning. Nevertheless, excessively deep networks may encounter issues of vanishing or exploding gradients, leading to a plateau in model performance beyond 3 or 4 layers.

Figure 6: The results of different graph convolution layer settings in the four datasets.

Download full-size image

DOI: 10.7717/peerj-cs.1701/fig-6

Conclusions

In this article, we present a novel sequence recommendation model that integrates interactive graph convolutional networks (GCNs) and employs various contrastive learning techniques to enhance its performance. Specifically, we leverage multiple layers of graph convolution to capture latent relationships between items in the user-item interaction graph. The outputs of these graph convolution layers are aggregated to obtain item embedding representations. Furthermore, we incorporate attention mechanisms and position embedding encoding into the sequence model, combining the advantages of interactive graph convolutions with sequence recommendation models. To further improve the model’s representation capabilities, we employ instance contrastive learning and prototype contrastive learning techniques. The introduction of these contrastive learning techniques enables our model to better capture the underlying structures and patterns in the data, leading to improved recommendation performance. We have conducted extensive comparative experiments and ablation studies to demonstrate the superiority of our proposed method.

Supplemental Information

Source code.

To ensure a fair comparison of models, the majority of the code in the files datasets.py, utils.py, modules.py, main.py, and data_augmentation.py is adapted from the code provided by S3-Rec and ICL-Rec. However, we have implemented the code for models.py and trainers.py according to the approach we have proposed.

DOI: 10.7717/peerj-cs.1701/supp-1

Download

[1] Chen T, Kornblith S, Norouzi M, Hinton G. 2020. A simple framework for contrastive learning of visual representations. In: Ill HD, Singh A, eds. Proceedings of the 37th International Conference on Machine Learning, Volume 119 of Proceedings of Machine Learning Research. PMLR. 1597-1607

[2] Chen Y, Liu Z, Li J, McAuley J, Xiong C. 2022. Intent contrastive learning for sequential recommendation.

[3] Cheng Z, Ding Y, Zhu L, Kankanhalli MS. 2018. Aspect-aware latent factor model: rating prediction with ratings and reviews. CoRR

[4] Cheng H, Koc L, Harmsen J, Shaked T, Chandra T, Aradhye H, Anderson G, Corrado G, Chai W, Ispir M, Anil R, Haque Z, Hong L, Jain V, Liu X, Shah H. 2016. Wide & deep learning for recommender systems. CoRR

[5] Fayyaz Z, Ebrahimian M, Nawara D, Ibrahim A, Kashef R. 2020. Recommendation systems: algorithms, challenges, metrics, and business opportunities. Applied Sciences 10(21):7748

[6] He X, Deng K, Wang X, Li Y, Zhang Y, Wang M. 2020. LightGCN: simplifying and powering graph convolution network for recommendation. CoRR

[7] He X, He Z, Song J, Liu Z, Jiang Y, Chua T. 2018. NAIS: neural attentive item similarity model for recommendation. CoRR

[8] He X, Liao L, Zhang H, Nie L, Hu X, Chua T. 2017. Neural collaborative filtering. CoRR

[9] Hidasi B, Karatzoglou A, Baltrunas L, Tikk D. 2015. Session-based recommendations with recurrent neural networks. ArXiv preprint

[10] Kang W, McAuley JJ. 2018. Self-attentive sequential recommendation. CoRR

[11] Koren Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model.

[12] Krichene W, Rendle S. 2020. On sampled metrics for item recommendation.

[13] Li Y, Chen T, Zhang P, Yin H. 2021. Lightweight self-attentive sequential recommendation. CoRR

[14] Li X, Sun A, Zhao M, Yu J, Zhu K, Jin D, Yu M, Yu R. 2023. Multi-intention oriented contrastive learning for sequential recommendation.

[15] Liu Z, Chen Y, Li J, Yu PS, McAuley JJ, Xiong C. 2021a. Contrastive self-supervised sequential recommendation with robust augmentation. CoRR

[16] Liu Z, Fan Z, Wang Y, Yu PS. 2021b. Augmenting sequential recommendation with pseudo-prior items via reversely pre-training transformer.

[17] McAuley JJ, Targett C, Shi Q, van den Hengel A. 2015. Image-based recommendations on styles and substitutes. CoRR

[18] Peng S, Sugiyama K, Mine T. 2022. Less is more: reweighting important spectral graph features for recommendation. ArXiv preprint

[19] Qiu R, Huang Z, Yin H, Wang Z. 2022. Contrastive learning for representation degeneration problem in sequential recommendation. CoRR

[20] Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L. 2012. BPR: bayesian personalized ranking from implicit feedback. CoRR

[21] Rendle S, Freudenthaler C, Schmidt-Thieme L. 2010. Factorizing personalized Markov chains for next-basket recommendation.

[22] Sun F, Liu J, Wu J, Pei C, Lin X, Ou W, Jiang P. 2019. BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. CoRR

[23] Tang J, Wang K. 2018. Personalized Top-N sequential recommendation via convolutional sequence embedding. CoRR

[24] Wang X, He X, Wang M, Feng F, Chua T. 2019. Neural graph collaborative filtering. CoRR

[25] Wu J, Wang X, Feng F, He X, Chen L, Lian J, Xie X. 2020. Self-supervised graph learning for recommendation. CoRR

[26] Wu F, Zhang T, de Souza AH, Fifty C, Yu T, Weinberger KQ. 2019. Simplifying graph convolutional networks. CoRR