Fairness modeling for topics with different scales in short texts

Chuangying Zhu; Yongyu Liang; Xinyuan Liang; Limiao Zhong; Fei Xie

doi:10.7717/peerj-cs.2936

Fairness modeling for topics with different scales in short texts

Chuangying Zhu¹, Yongyu Liang ¹, Xinyuan Liang¹, Limiao Zhong¹, Fei Xie²

1Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, China

2Academic Affair Office, Guilin University of Electronic Technology, Guilin, China

DOI: 10.7717/peerj-cs.2936

Published: 2025-07-23
Accepted: 2025-05-12
Received: 2025-02-03

Academic Editor: Rajeev Agrawal

Subject Areas: Data Mining and Machine Learning, Text Mining
Keywords: Topic model, Sparse data, Short text, Data dependency, Social media

Copyright: © 2025 Zhu et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Zhu C, Liang Y, Liang X, Zhong L, Xie F. 2025. Fairness modeling for topics with different scales in short texts. PeerJ Computer Science 11:e2936 https://doi.org/10.7717/peerj-cs.2936

The authors have chosen to make the review history of this article public.

Abstract

The application of topic modeling to short texts is beset by challenges such as data sparsity and an absence of contextual information. Traditional research methods tend to prioritise high-attention and popular topics, frequently overlooking the identification of emerging topics. Consequently, subjects of a minor scale are prone to being overlooked during the topic identification process. Furthermore, in the context of topic modelling, information that varies in terms of the attention it receives is not treated equally. In order to address the aforementioned issues, a fairness-oriented topic discovery approach (MixTM-G) is proposed. This approach has been designed to facilitate the discovery of topics with different levels of attention. The proposed methodology involves the integration of normalized pointwise mutual information (NPMI) within a graph model to analyse text data. This approach leverages the correlation between data points to assess the semantic relationships between words, thus addressing the limitations posed by sparse data. The employment of graph algorithms facilitates the identification of semantically related clusters within the document graph, thereby enhancing the semantic associations between sparse data. Finally, a mixed topic modeling approach (MixTM), based on bi-grams and tri-grams combinations, is proposed to further improve topic discovery by strengthening the contextual relationships between words. The experimental results demonstrate the efficacy of the proposed method in topic modelling. In comparison to conventional methods, the proposed approach exhibits superior performance in detecting small-scale topics under equivalent conditions.

Introduction

In the context of online platforms, user-generated data, manifesting in the form of brief textual communications such as tweets, comments and posts, has emerged as a pivotal resource for data mining and knowledge discovery (Laureate, Buntine & Linger, 2023). This assertion is substantiated by its applications in domains including knowledge recommendation and network management. Nevertheless, the data is often brief and limited in information, and its volume can vary significantly across different topics. In particular, data with potential value is often smaller in volume compared to trending topics. This variability necessitates the fair modelling of both mature topics and those with sparse data within a noisy, mixed-data environment. In the domain of knowledge discovery, the timely identification and rapid detection of such valuable information is of critical importance. The ability to consistently model topics of varying scale, or to map topics at different stages of development onto a common space, would enhance the detection of new knowledge and improve the prediction of knowledge evolution trends. Therefore, we focus on fair modelling of short text data.

The objective of topic modeling techniques is to automatically identify and extract latent topics from text data. Nonetheless, the majority of conventional topic modelling techniques, including Latent Dirichlet Allocation (LDA) (Blei, Ng & Jordan, 2001), generally depend on substantial text data sets to produce consistent and significant topics (Sia & Duh, 2021). In summary, issues such as data sparsity and an absence of sufficient contextual information result in a considerable reduction in the effectiveness of existing methods, thereby limiting their applicability in a wide range of real-world scenarios. Topics that attract significant attention frequently pertain to large-scale datasets; hence, traditional topic models are effective in identifying latent topics. The substantial volume of data enables the model to statistically extract more accurate and significant topic features, thereby facilitating effective topic modelling. However, emerging topics frequently encounter the challenge of limited datasets, which results in the suboptimal performance of traditional topic models in these domains. In the absence of sufficient relevant data, the model may encounter difficulties in identifying the underlying structure of emerging topics, and may even fail to effectively distinguish between emerging and traditional topics.

In large-scale environments, popular topics tend to possess significantly larger data volumes in comparison to those that are nascent or in the early stages of popularity. In order to ascertain all topics with an imbalanced data volume size, we propose a topic discovery method in order to achieve fair modelling across datasets of different scales. This method employs a graph structure to represent documents and subsequently utilises topic modelling techniques to derive latent topic representations from the documents. This approach entails the transfer of reliance from the scale of the data to the correlation of the data, thereby reducing the impact of scale differences. Specifically, normalized pointwise mutual information (NPMI) is introduced to represent the text data as a graph, with word correlations being utilised to alleviate the effects of data sparsity. The graph model has been demonstrated to effectively capture both lexical associations and semantic relationships within the text, thereby providing more reliable information for subsequent topic modelling.

Furthermore, we employ graph algorithms to identify cohesive semantic clusters within the document graph. These semantic clusters demonstrate high levels of consistency and interrelation. The utilisation of these semantic clusters has been demonstrated to facilitate a more lucid comprehension and precise portrayal of subjects within concise textual compositions. Finally, a novel modelling technique, MixTM, is proposed for extracting latent topic structures within semantic sets. This method enhances conventional models by transforming the document collection into a mixed word pair set, thereby expanding the contextual range between words. This approach has been demonstrated to alleviate data sparsity and reduce the impact of small-scale data on topic discovery. The primary contributions of this article are as follows:

A novel graph-based methodology is proposed for the discovery of topics, which effectively mitigates the issue of data scale differences between topics. This enables fair topic modelling across datasets of varying scales, and is particularly effective in the discovery of topics within small-scale data.
A mixed topic modeling approach based on bi-grams and tri-grams is proposed, which enhances the contextual relationships between words by leveraging the constraints between them.

The remainder of the article is organised as follows: In “Related Work”, the existing literature on graph-based models and topic modelling is reviewed. In “Topic Discovering Methods”, a detailed description of the MixTM-G method is provided. In “Results”, the quality of the topics generated by the model and its ability to discover topics from both objective and subjective perspectives is evaluated. Finally, in “Conclusions”, the conclusions of the article are presented.

Related work

Graph-based representation of text

In the graphical representation of text, the graph structure effectively captures the complex relationships within a document (Yao, Mao & Luo, 2019). The nodes and edges in the graph represent multi-level associations between words and documents. Depending on the specific task, the method of graph construction can be adapted. For example, Wu, Xu & Zheng (2025), Cheng et al. (2025) and Hua et al. (2024) explores these complex relationships by modelling documents as heterogeneous graphs consisting of word nodes and document nodes. The edge weights in these graphs are derived from two components: the pointwise mutual information (PMI) between words and the term frequency-inverse document frequency (TFIDF) between words and documents. By applying convolutional operations to the document graph, global word information can be effectively integrated, thereby improving model performance. Song et al. (2025) effectively captures the potential associations between documents and words by introducing edges between original short texts, edges between short texts and their associated long texts, and edges between semantically similar words, based on the document-word dichotomous graph framework.

Typically, word co-occurrence information is obtained by calculating the frequency with which words occur together. However, in short texts, the frequency of word occurrences is often low, making it difficult to obtain reliable co-occurrence statistics. PMI measures the true association between two words by comparing their joint probability with the product of their individual probabilities, thus highlighting significant semantic relationships between word pairs. We want to achieve fair modelling of data at different scales. PMI can be used to transform the text into a graph model, replacing the data dependencies typical of traditional topic models with word correlations. However, PMI is prone to extreme value problems (Bouma, 2009), which can undermine its stability. To address this, the model uses NPMI to compute word associations, thereby increasing the stability of the association measure.

Topic modeling for short texts

Short text topic modelling aims to automatically identify the underlying topics in short texts. While traditional topic modelling methods such as LDA perform well when applied to longer documents, they face significant challenges when applied to short texts. Some studies have used word embedding-based methods (Kaleem et al., 2024; Kinariwala & Deshmukh, 2023; Liu et al., 2022; Rashid et al., 2023; Reuter et al., 2024; Uddin et al., 2024), such as pre-trained models like bidirectional encoder representations from transformers (BERT) (Kenton & Toutanova, 2019), generative pretrained transformer (GPT) (Brown, 2020), which convert words and texts into vector representations. Some other studies introduce incremental models with different priors to capture richer semantic relations; Zhang et al. (2025) introduces an integrated Gaussian and logistic coding network augmentation network in the neural inference network part to improve the performance and interpretability of topic extraction; Koochemeshkian & Bouguila (2024) combines BERTopic derived embeddings with multi-granularity clustered topic modelling (MGCTM) and introduces a Generalised delicacy distribution and a Beta-Liouville distribution as priors to improve the expressiveness of the MGCTM method; Wang et al. (2024) proposes the prompted topic model (SET) based on the Wasserstein Autoencoder to capture the semantic knowledge of words; Qiu et al. (2024) the Prompted Topic Model (PTM) for topic modelling using cue learning, which bypasses the structural constraints of LDA and variational autoencoder (VAE), thus overcoming the shortcomings of traditional topic models. These models map words into continuous spaces, capturing richer semantic information.

Some studies address the issue of short text sparsity by introducing different models for detecting emerging topics in short texts. Shi, Du & Kou (2020) uses the “spike and slab” method to distinguish between emerging and ordinary topics and to track emerging events. Zhu et al. (2022) combines recurrent neural networks (RNNs) with spatio-temporal information to mitigate data sparsity. Asgari-Chenaghlu et al. (2021) uses a combination of transformers and graph mining techniques to improve topic detection in multimodal data. While these methods perform well on large datasets, they are highly dependent on data size when applied to smaller datasets. Limited data may prevent the model from fully capturing semantic relationships or detecting emerging topics, thus affecting detection performance. In this article, we replace the reliance on data size in traditional topic modelling with data correlation, allowing for more robust and fair topic modelling across datasets of different scale sizes.

Topic discovering methods

In order to ensure fair topic discovery across different data scales, a novel topic discovery approach is proposed. This is called MixTM-G and consists of two key components: graph model construction and topic modelling. The graph model is employed to mask the discrepancies in the amount of related data between entities at different scales, while the bi-grams and tri-grams mixed approach proposed in topic modelling is used to enhance the contextual dependencies in short-text data. As illustrated in Fig. 1, the overview of structures is presented. The subsequent discussion will elaborate on these two aspects.

Figure 1: The overview of our method.
The document is converted into a graph model by computing the NPMI of word pairs within each window. In this graph model, the nodes correspond to words and the weights of the edges correspond to the NPMI values. The construction of the strong semantic set is then achieved through the utilisation of graph algorithms. The topic modelling is performed using MixTM, a word-pair mixing approach, and finally the topics of the small-scale data are discovered.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-1

Graph model construction

In order to account for differences in data scale across topics, reliance on data volume is replaced by data correlation. Specifically, NPMI is introduced to measure the mutual relationship between words. The formula is as follows:

(1) $\begin{aligned} N P M I = \frac{P M I (x, y)}{- \log p (x, y)} \\ P M I = \log \frac{p (x, y)}{p (x) p (y)} \end{aligned}$ where $p (x, y)$ represents the probability of co-occurrence of $x$ and $y$ , and $p (x)$ represents the probability of occurrence of $x$ .

The value of NPMI ranges from $- 1$ to $1$ . A value of $1$ indicates that the two words have perfect co-occurrence, i.e., they always appear together (e.g., “bat” and “ball” in a sports corpus, where they are inseparable). A value of $0$ implies statistical independence, i.e., their co-occurrence is no more frequent than random chance. A value of $- 1$ represents perfect mutual exclusion, where the words never occur together (e.g., “hot” and “cold” in a corpus where it is logically impossible for them to occur together).

In practical applications (Salle & Villavicencio, 2023; Yao, Mao & Luo, 2019), for ease of analysis, the value of NPMI usually takes the range of $0$ to $1$ , with values closer to 1 indicating a stronger correlation between word pairs and values closer to 0 indicating a weaker or negligible correlation. This helps to mitigate the effects of data volume and shifts the focus to the relationships between word pairs. Building on this, we model the collection of documents as a graph.

Specifically, let $G = (V, E)$ be an undirected graph, where V is the set of all words in the document collection: $V = w_{1}, w_{2}, \dots, w_{n}$ . The edges E represent the relationships between words, with the edge weight $O_{i, j} = N P M I (w_{i}, w_{j})$ .

Using the NPMI metric, we can measure the mutual information between word pairs. If the data set for certain topics is smaller but more focused, the mutual information between word pairs will be higher, resulting in a higher NPMI value. Conversely, if noise words occur frequently in different topic clusters, the NPMI values between these word pairs will be lower. When several words have strong correlations, there will be several edges between them in the graph, forming a dense subgraph that reflects their close association as a topic. In this case, the document collection graph can be seen as an aggregation of strong semantic structures. Therefore, the task can be advanced by searching for complete subgraphs within the graph. A complete graph is one in which every node is directly connected to every other node. This high connectivity indicates a very close relationship between the nodes, which is consistent with the expectation of topic aggregation.

We use the Bron-Kerbosch algorithm (Bron & Kerbosch, 1973) to identify complete subgraphs, a classical graph algorithm designed to find cliques in an undirected graph. This algorithm efficiently finds all maximal cliques through a recursive process and is well suited to handle undirected graphs of different densities, whether dense or sparse. We input the mutual information constructed text graph as raw data to the algorithm to extract all maximal cliques. These cliques represent textual data sets with strong semantic associations at the semantic level.

Topic modeling

We propose a novel model, MixTM, which addresses the problem of data sparsity by considering a wider range of contextual information. This model is well suited to situations with limited data, as it captures subtle semantic relationships within the text and provides richer topic representations.

We construct documents as a set of mixed word pairs, assuming that the words in each pair belong to the same topic. For example, the short text “company department chip” can be represented as the word pair sets “company department”, “company chip”, “department chip”, “company department chip”. Let $α$ and $β$ be the prior parameters for the topic distribution $θ$ and the topic-word distribution $φ$ respectively, $D i r$ means Dirichlet distribution, $M u l t i$ represents the multinomial distribution, the text generation process of the model is as follows:

1.

For each topic z:
Draw topic-word distribution $φ \sim D i r (β)$
2.

Draw topic distribution $θ \sim D i r (α)$
3.

For each term in the set of word pairs B
Draw $z \sim M u l t i (θ)$
Draw $t e r m \sim M u l t i (φ_{z})$
$t e r m = (w_{i}, w_{j}) o r t e r m = (w_{i}, w_{j}, w_{k})$

The graphical representation of the model is shown in Fig. 2. By constructing documents as a mixture of bi-grams and tri-grams, MixTM provides a richer contextual representation than the bi-gram model, significantly improving the model’s ability to distinguish between different topics. In addition, this combination captures more complex relationships between words, making the prediction of the next word more constrained and contextually informed. As a result, the topic modelling process becomes more accurate and robust, particularly in short or sparse texts where limited contextual information could hinder the discovery of meaningful topics.

Figure 2: Graphical representation model of MixTM.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-2

In the document generation process described above, we need to infer the parameters: $θ$ , $φ$ , and the latent variable k. Given the simplicity and effectiveness of the Collapsed Gibbs Sampling method, we use it to estimate these parameters. The sampling formula for Gibbs sampling is as follows:

(2) $p (k_{b} | K_{- b}, B, α, β) \propto \frac{p (K, B | α, β)}{p (K_{- b}, B_{- b} | α, β)} .$

The formula represents the conditional probability distribution of the latent variable $k$ of the target word pair $b$ , given the topics of all word pairs except the word pair $b$ and the hyperparameters $a$ and $β$ of all other word pairs. The joint probability of all word pairs is denoted by $p (K, B | α, β)$ . The joint probability of all word pairs except the pair $b$ is represented by $p (K, B | α, β)$ . The joint probability of all word pairs is calculated as follows:

(3) $\begin{aligned} p (K, B | α, β) & = p (K | α) p (B | K, β) \\ = \prod_{d = 1}^{D} \frac{Δ (n_{d}^{K} + α)}{Δ (α)} \prod_{k = 1}^{K} \frac{Δ (n_{k}^{B} + β)}{Δ (β)} \end{aligned} .$

In this context, $n_{d}^{k}$ represents the frequency of each topic k in document d, and $n_{k}^{B}$ represents the number of times a word pair belongs to topic k, $β$ denotes the $B e t a$ function. The formula represents the joint generation probability of the topic structure and word distribution given the hyperparameters $α$ and $β$ . Combining the two Formulas (2) (3), the final sampling formula can be derived as:

For the pairs composed with two words ( $w_{i}$ , $w_{j}$ ):

(4) $\begin{aligned} p (k_{b} = k | K_{- b}, B, α, β) \propto & \frac{(n_{k, - b} + α)}{(\sum_{k = 1}^{K} n_{k, - b} + K α)} \\ \times \frac{(n_{k, - b}^{w_{i}} + β) (n_{k, - b}^{w_{j}} + β)}{(\sum_{w = 1}^{W} n_{k, - b}^{w} + W β) (\sum_{w = 1}^{W} n_{k, - b}^{w} + 1 + W β)} . \end{aligned}$

For the pairs composed with three words ( $w_{i}$ , $w_{j}$ , $w_{g}$ ):

(5) $\begin{aligned} p (k_{b} = k | K_{- b}, B, α, β) \propto & \frac{(n_{k, - b} + α)}{(\sum_{k = 1}^{K} n_{k, - b} + K α)} \\ \times \frac{(n_{k, - b}^{w_{i}} + β) (n_{k, - b}^{w_{j}} + β) (n_{k, - b}^{w_{g}} + β)}{(\sum_{w = 1}^{W} n_{k, - b}^{w} + W β) (\sum_{w = 1}^{W} n_{k, - b}^{w} + 1 + W β) (\sum_{w = 1}^{W} n_{k, - b}^{w} + 2 + W β)} \end{aligned}$ where $n_{k, - b}$ denotes the number of word pairs assigned to topic $k$ after excluding word pair $b$ , $n_{k, - b}^{w_{i}}$ represents the number of occurrences of word $w_{i}$ in topic $k$ .

The Gibbs sampling process commences with the random selection of an initial state and the subsequent allocation of initial values to each variable. Subsequently, the new values of the variables are recalculated using the sampling formulas. The process is iterated until a stable state is reached, at which point the topic distribution for each word is obtained. Formulas (6) and (7) are employed to calculate the parameters $θ$ and $φ$ , while $n_{k}^{w}$ signifies the number of times the word is assigned to topic k.

(6) $φ_{k, w} = \frac{n_{k}^{w} + β}{\sum_{w} n_{k}^{w} + W β}$

(7) $θ_{k} = \frac{n_{k}^{w} + α}{\sum_{w} | B | + K α} .$

It is not possible for the model to obtain the topic distribution directly from the learning process. In order to infer topics, it is assumed that the topic distribution of each document is equivalent to the expectation of the topic proportions of word pairs generated from the document.

(8) $p (k | d) = \sum_{b} p (k | b) p (b | d)$ and $p (k | b)$ can be calculated by the Bayesian formula:

(9) $\begin{aligned} p (k | b) = \frac{p (k) p (w_{i} | k) p (w_{j} | k)}{\sum_{b} p (k) p (w_{i} | k) p (w_{j} | k)} o r \\ p (k | b) = \frac{p (k) p (w_{i} | k) p (w_{j} | k) p (w_{g} | k)}{\sum_{b} p (k) p (w_{i} | k) p (w_{j} | k) p (w_{g} | k)} \end{aligned}$ where $p (k) = θ_{k}, p (w_{i} | k) = φ_{k, w_{i}}$ , $p (b | d)$ can be calculated using a simple uniform distribution:

(10) $p (b | d) = \frac{n_{d} (b)}{\sum_{b} n_{d} (b)}$

$n_{d} (b)$ denotes the frequency of word-pair b in document d.

Results

Two experiments are set up: the first is the small data topic discovery experiment, in which small data are added to public datasets to evaluate the model’s ability to discover topics. The second experiment is the semantic representation quality experiment, in which qualitative metrics are utilised to evaluate the quality of the topics generated by the method.

Dataset

QA-data (Yan et al., 2013): This dataset is collected from the Baidu Zhidao website. THUCNews (http://thuctc.thunlp.org/): This is a news text classification dataset provided by the Tsinghua NLP group. 20NG (http://qwone.com/∼jason/20Newsgroups/): The 20 Newsgroups dataset is a standard dataset that includes 20 different topics, commonly used for topic modeling classification tasks.

In the preprocessing stage, for the Chinese dataset, we use the jieba tool to remove the stop words in the dataset, remove texts with length less than 2. For the English dataset, we use the nltk tool to tokenize the texts, convert characters to lower cases, remove texts with length less than 2 and filter out illegal characters. The dataset statistics are reported in Table 1.

Table 1:

Dataset statistics.

Dataset	Number of texts	Vocabulary size	Labels
QA-data	57,686	10,195	33
THUCNews	80,006	21,052	7
20NG	16,309	1,509	20

DOI: 10.7717/peerj-cs.2936/table-1

Baselines

(1) BTM (Yan et al., 2013): A topic model specifically designed for short texts, which captures word co-occurrences in word pairs (biterms) to discover latent topics. (2) PYSTM (Niu, Zhang & Li, 2021): A topic model that uses the Pitman-Yor process for self-aggregation. (3) NQTM (Wu et al., 2020): A topic model that combines negative sampling and quantization techniques for short text topic modeling. (4) NSTM (Zhao et al., 2020): A neural topic model that learns document topics by minimizing the optimal transport (OT) distance between word distributions in documents. (5) ETM (Dieng, Ruiz & Blei, 2020): A topic model that combines traditional topic modeling with word embeddings. (6) BERTopic (Grootendorst, 2022): A topic model that generates document embeddings from pre-trained transducer-based language models, clusters these embeddings, and finally generates topic representations using a clas-based TF-IDF procedure. (7) TSCTM (Wu, Luu & Dong, 2022): a topic model that employ a new contrastive learning method with efficient and negative sampling strategies based on topic semantics.

Evaluation metrics

• Topic coherence: This metric is used to measure the quality of a topic by assessing the semantic coherence of the N words with the highest percentage in the distribution of generated topic words. A higher coherence score is indicative of greater topic consistency. In the experiment, the most widely used metrics Cv (Röder, Both & Hinneburg, 2015) and NPMI (Aletras & Stevenson, 2013) are employed to compute the consistency of the first N words: (11) $C_{V} (k) = \frac{1}{N} \sum_{i = 1}^{N} s_{\cos} (ν_{N P M I} (w_{i}), ν_{N P M I} (w_{1};_{N}))$ (12) $ν_{N P M I} (w_{i}) = {N P M I (w_{i}, w_{j})}_{j = 1, 2, \dots, N}$ (13) $ν_{N P M I} (w_{1};_{N}) = {\sum_{i = 1}^{N} N P M I (w_{i}, w_{j})}_{j = 1, 2, \dots, N}$
where $s_{c o s}$ means the cosine similarity function. For the whole corpus, we use the average Cv $(\frac{1}{k} \sum_{k = 1}^{K} C v (k))$ and average NPMI $(\frac{1}{K} \sum_{k = 1}^{K} N P M I)$ to compute the coherence.
• Purity and NMI: In the experiment, the topic with the highest probability in the document distribution is designated as the predicted label for the document. The Purity and NMI metrics are utilised to assess the clustering efficacy of the model. A higher Purity and a larger NMI indicate better clustering performance of the model. (14) $p u r i t y (Ω, C) = \frac{1}{n} \sum_{i = 1}^{K} m a x | w_{i} \cap c_{j} |$ (15) $N M I (Ω, C) = \frac{\sum_{i, j} \frac{| w_{i} \cap c_{j} |}{n} l o g \frac{| w_{i} | | c_{j} |}{n | w_{i} \cap c_{j} |}}{(\sum_{i} \frac{| w_{i} |}{n} l o g \frac{| w_{i} |}{n} + \sum_{j} \frac{| w_{j} |}{n} l o g \frac{| w_{j} |}{n}) / 2} .$

Evaluation of the MixTM in topic discovery

Qualitative metrics are used to measure the quality of topics generated by the model. The same number of topics is set for MixTM and the comparison methods. Cv, NPMI, Purity and NMI are calculated and analyzed to objectively assess and compare the semantic quality of different algorithms.

The consistency scores of all models are shown in Tables 2, 3, 4 and Figs. 3, 4, 5. For the two short text datasets, QA-data and THUCNews, the proposed model consistency demonstrates superior performance in comparison to alternative models. In relation to Cv scores, it is evident that the attainment of maximum scores is achieved in scenarios where the number of topics is minimal. However, as the number of keywords increases, a decline in Cv scores becomes observable. This decline signifies that the hybrid word pair approach introduces irrelevant words to a certain extent, thereby causing a disruption to topic consistency. Conversely, the proposed model consistently attains a higher NPMI score. The NPMI metric is a normalized measure that serves to attenuate the impact of extreme keywords on consistency results, thereby reducing the influence of low-frequency keywords and chance associations on these results. The findings suggest that the proposed model is capable of attaining enhanced and more consistent results. For the standard length text set designated 20NG, the proposed model demonstrates a high degree of consistency overall. Among the compared methods, the NQTM and NSTM methods have higher Cv scores but lower NPMI scores. This is probably due to the presence of incidentally co-occurring word pairs in the generated subject keywords, which leads to inflated Cv scores. The BERTopic method demonstrates superior performance on the 20NG dataset, likely attributable to the capacity of the pre-trained knowledge base to accurately identify the jargon and complexity of the 20NG dataset. The extensive text contains sufficient semantic information to facilitate the model’s acquisition of enhanced semantic embedding, thereby ensuring greater consistency in topic information.

Table 2:

Cv scores of QA-data.

Model	k = 50				k = 100				k = 150
Model	5	10	15	20	5	10	15	20	5	10	15	20
BTM	0.63	0.54	0.49	0.45	0.43	0.48	0.52	0.54	0.37	0.46	0.54	0.58
PYSTM	0.27	0.34	0.42	0.45	0.24	0.35	0.41	0.45	0.23	0.34	0.41	0.45
NQTM	0.32	0.51	0.61	0.67	0.27	0.52	0.64	0.70	0.30	0.54	0.71	0.73
NSTM	0.23	0.46	0.59	0.66	0.22	0.47	0.59	0.66	0.22	0.47	0.60	0.67
ETM	0.23	0.54	0.67	0.74	0.22	0.54	0.67	0.74	0.51	0.39	0.39	0.41
TSCTM	0.35	0.39	0.44	0.48	0.32	0.41	0.48	0.53	0.30	0.43	0.51	0.56
BERTopic	0.64	0.50	0.41	0.38	0.65	0.46	0.39	0.39	0.62	0.43	0.39	0.42
MixTM	0.69	0.57	0.49	0.45	0.67	0.50	0.41	0.36	0.66	0.49	0.40	0.36

DOI: 10.7717/peerj-cs.2936/table-2

Table 3:

Cv scores of THUCNews.

Model	k = 50				k = 100				k = 150
Model	5	10	15	20	5	10	15	20	5	10	15	20
BTM	0.40	0.44	0.50	0.54	0.33	0.45	0.53	0.59	0.27	0.43	0.54	0.61
PYSTM	0.31	0.28	0.32	0.34	0.31	0.28	0.31	0.34	0.29	0.29	0.31	0.34
NQTM	0.27	0.58	0.68	0.76	0.27	0.52	0.63	0.69	0.28	0.51	0.62	0.69
NSTM	0.30	0.60	0.72	0.78	0.30	0.60	0.72	0.78	0.30	0.60	0.72	0.78
ETM	0.27	0.58	0.70	0.77	0.27	0.58	0.70	0.77	0.46	0.39	0.41	0.45
TSCTM	0.30	0.51	0.62	0.68	0.28	0.52	0.64	0.70	0.27	0.54	0.65	0.71
BERTopic	0.57	0.40	0.37	0.39	0.55	0.38	0.38	0.42	0.53	0.37	0.39	0.44
MixTM	0.58	0.41	0.37	0.38	0.59	0.42	0.36	0.38	0.56	0.43	0.40	0.43

DOI: 10.7717/peerj-cs.2936/table-3

Table 4:

Cv scores of 20NG.

Model	k = 20				k = 30				k = 50
Model	5	10	15	20	5	10	15	20	5	10	15	20
BTM	0.70	0.66	0.65	0.63	0.71	0.65	0.63	0.62	0.69	0.64	0.62	0.60
PYSTM	0.66	0.62	0.64	0.69	0.62	0.59	0.61	0.66	0.60	0.55	0.57	0.62
NQTM	0.42	0.29	0.24	0.23	0.42	0.30	0.26	0.25	0.41	0.27	0.25	0.25
NSTM	0.64	0.55	0.52	0.48	0.63	0.54	0.51	0.47	0.62	0.53	0.48	0.46
ETM	0.66	0.60	0.58	0.57	0.65	0.60	0.58	0.57	0.63	0.58	0.56	0.54
TSCTM	0.76	0.71	0.68	0.65	0.71	0.64	0.61	0.58	0.68	0.60	0.56	0.52
BERTopic	0.79	0.73	0.67	0.63	0.77	0.69	0.63	0.60	0.73	0.64	0.59	0.55
MixTM	0.70	0.67	0.64	0.63	0.71	0.68	0.67	0.65	0.72	0.68	0.65	0.63

DOI: 10.7717/peerj-cs.2936/table-4

Figure 3: NPMI scores on QA-data datasets.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-3

Figure 4: NPMI scores on THUCNews datasets.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-4

Figure 5: NPMI scores on 20NG datasets.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-5

From the clustering results (Figs. 6, 7, 8), it has been demonstrated that the proposed approach enhances the efficacy of text clustering in comparison with preceding models. The BTM and the model under consideration employ word pairs for text modelling, with the size of the contextual range of the word being a determining factor. It has been established that the larger the contextual range of a word, the more co-occurring patterns it contains. This has a beneficial effect on the identification of topics and the clustering of documents. Our model extends word pair construction from two to three words. It is evident that the utilisation of a shared word pair construction window in our model results in the acquisition of enhanced word co-occurrence information. This superiority in data processing capacity leads to the model’s superior performance in comparison to BTM in terms of clustering efficacy. However, it is noteworthy that the PYSTM and NQTM models demonstrate suboptimal performance when applied to the 20NG dataset. This may be due to the fact that the plain text dataset contains a lot of repetitive information, and overfitting can result in the model failing to capture the differences between similar document labels, which can lead to all documents being clustered into the same category. The ETM approach may perform weakly on text due to insufficient embedding quality.

Figure 6: Clustering scores of QA-data dataset.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-6

Figure 7: Clustering scores of THUCNews dataset.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-7

Figure 8: Clustering scores of 20NG.

Download full-size image

DOI: 10.7717/peerj-cs.2936/fig-8

Efficiency analysis

The computational cost of the 20NG and QA-data datasets has been measured. The 20NG dataset is composed of normal-length texts, whereas the QA-data set consists of short texts. It is important to note that all models are computed in the same hardware conditions. Each model was executed for 100 iterations, and the mean of each iteration was subsequently obtained. The results have been presented in tabular form in Table 5.

Table 5:

Average time of each iteration.

Models	Average time per iteration(s)/20NG	Average time per iteration(s)/QA-data
BTM	23.51	2.14
PYSTM	9.34	29.11
NQTM	11.15	0.34
NSTM	8.56	35
ETM	1.84	29.29
BERTopic	0.32	0.83
TSCTM	0.71	11.35
MixTM	68.98	3.37

DOI: 10.7717/peerj-cs.2936/table-5

The model that has been demonstrated to demonstrate the highest computational efficiency is the BERTopic approach, in both plain text and short text. This is most likely due to the utilisation of pre-trained BERT in order to generate document embeddings, which consequently renders the subsequent clustering (UMAP+HDBSCAN) process computationally lightweight. The proposed model, due to the fact that it is reconstructing the text into a collection of mixed word pairs, takes approximately longer to train. When the length of each text is increased, the more word pairs are produced. It has been demonstrated that the mean elapsed time per iteration of the proposed model is greater for the 20NG dataset and less for training on QA-data.

Evaluation of small-scale data

In the experiment, 10 short text samples related to the topic of “Dongpo-Pork” have been added to both the QA-data and THUCNews datasets. The short-text data incorporated into the datasets includes characteristics of Dongpo-Pork, its historical origins, and everyday cooking methods. The models were trained with K = 50, 100, and 150 topics, respectively, to identify topics related to small-scale data and to display the top 15 words. The results obtained serve as evaluation metrics for the discovery of topics in small-scale data. Furthermore, the diversity metric was utilised to evaluate the quality of the generated topics.

The findings of the experiment conducted to ascertain the most effective methods of identifying topics from small-scale data are presented in Tables 6 and 7. It has been demonstrated that the quality of the semantic representation of generated topics is contingent upon the prevalence of relevant words and their elevated rankings. This suggests that the ability to identify small-scale topics is enhanced. The results indicated that NSTM and ETM were unable to identify keywords related to the “Dongpo-Pork” topic. In the NQTM and BTM models, related words appeared, but they were primarily noise words for the current topic. While the PYSTM model did identify some relevant topics, the generated topics were mixed. In contrast, the present method successfully identified relevant topic words such as “Dongpo-Pork,” “pork,” “officialdom,” “Chinese,” and “price,” and produced the most relevant words across different topics, with the highest rankings. The findings indicate that the proposed model exhibits superior performance in identifying topics from limited data sets when compared to existing methods.

Table 6:

Topics selected by the Word “Dongpo-Pork” in the QA-data collections.

Model	K	Top15 words
BTM	50	Opera, drama, dramas, culture, color, representation, makeup, art, distinction, character, stage, center, Dongpo-Pork, legacy, song
	100	Opera, drama, makeup, color, culture, Dongpo-Pork, stage, distinction, song, character, section, legacy, type, technique, image
	150	Opera, drama, dramas, culture, makeup, art, color, center, character, stage, Dongpo-Pork, substance, image, distinction, movie
PYSTM	50	Event, crisis, nutrition, Dongpo-Pork, banana, pass, pork, cuisine, sister, textbook, milk powder, Chinese, water, citizen
	100	Shop, Dongpo-Pork, mall, pork, cuisine, counter, people, thought, floor, reason, water, conviction, seller, cake, register
	150	Fraction, Dongpo-Pork, cuisine, reality, pork, tidying, decimal, addition, rat, elephant, differ, woman, people, article, fraction
NQTM	100	List, link, pass, maritime, medicine, strength, trademark, minister, Dongpo-Pork, practice, root, oil, hardware, institution, loan
MixTM	50	Number, company, message, phone, address, name, proof, boss, business, record, account, station, agent, student, Dongpo-Pork
	100	Brand, model, fancy, man, white, Dongpo-Pork, school, color, student, price, pork, method, teacher, clothes, number
	150	Dongpo-Pork, pork, office, Chinese, price, type, lead, Hangzhou, adult, citizen, China, cuisine, dish, function, tradition
MixTM-G	50	Tradition pork Dongpo-Pork price trigger soft sterilization function bright husband cuisine Hangzhou office China lead
	100	Pork tradition Dongpo-Pork price soft husband trigger function bright husband cuisine Hangzhou office China lead
	150	Pork Dongpo-Pork tradition soft sterilization function bright husband cuisine trigger price Hangzhou office China lead

DOI: 10.7717/peerj-cs.2936/table-6

Table 7:

Topics selected by the word “Dongpo-Pork” in the THUCNews collections.

Model	K	Top15 words
BTM	50	Value index consumer rate sale core annual order retail goods college house finally business Dongpo-Pork
BTM	100	Value rate sale person consume index consumer import core annual order retail Dongpo-Prok goods college house
PYSTM	150	Gum pork Dongpo-Pork tax oilfield house transport matrix win machine price base pipe retail point
MixTM	50	Earth magnet railway concept magnets people stock Xinhua impact shares Dongpo-Pork sport GuoHeng dragon price
	100	Passion mother speed jersey white number stack Dongpo-Pork advice love handsome wizards desert image Putian
	150	Pork Dongpo-Pork office price type lead home blanching adult citizen China trade dish salary tradition
MixTM-G	50	Trust member Dongpo-Pork trigger soft color appetizer cycle soup Hangzhou price handware foresight kiss lead
	100	Dish pork Dongpo-Pork soft sterilization trigger appetizer bright cuisine Hangzhou price husband fondness China lead
	150	Pork Dongpo-Pork soft dish sterilization appetizer bright cuisine Hangzhou trigger price husband fondness China lead

DOI: 10.7717/peerj-cs.2936/table-7

A comparison of MixTM and MixTM-G reveals that incorporating graph modelling enables the model to identify potential small-scale topics, even with a limited number of topics, and helps it stabilise at an optimal state. This is likely due to the fact that graph modelling effectively captures the relationships between words, thereby rendering the model more flexible in its handling of small-scale topics. Representing words and their interconnections as a graph structure enables the model to detect potential correlations and, even in conditions where data is sparse, extract features of small-scale topics. The graph structure facilitates the model’s capacity to manage noise and uncertainty, thereby ensuring more consistent learning outcomes across varying numbers of topics. This enhanced stability contributes to the model’s reliability in practical applications and facilitates more accurate reflection of latent topics in text. In summary, the integration of graph modelling into MixTM-G enhances its capacity to identify small-scale topics, while concurrently ensuring the model’s stability and efficacy across diverse contexts.

Conclusions

The study proposes a topic modelling method specifically designed for small-scale data in brief texts. This method constructs texts using a graph model and leverages the correlations between words to reduce reliance on data volume. The purpose of this is to form strong semantic clusters. The experimental results demonstrate that this method outperforms existing comparison algorithms in the topic discovery of small-scale data. Furthermore, given that short-text data in practical applications frequently manifests as data streams, future research could extend dynamic topic modelling, providing valuable references for practical applications in related fields.

Supplemental Information

Code and datasets.

DOI: 10.7717/peerj-cs.2936/supp-1

Download

[1] Aletras N, Stevenson M. 2013. Evaluating topic coherence using distributional semantics.

[2] Asgari-Chenaghlu M, Feizi-Derakhshi M-R, farzinvash L, Balafar M-A, Motamed C. 2021. TopicBERT: a cognitive approach for topic detection from multimodal post stream using BERT and memory—graph. Chaos, Solitons & Fractals 151(1):111274

[3] Blei D, Ng A, Jordan M. 2001. Latent Dirichlet allocation. Advances in Neural Information Processing Systems 14:601-608

[4] Bouma G. 2009. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL 30:31-40

[5] Bron C, Kerbosch J. 1973. Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM 16(9):575-577

[6] Brown TB. 2020. Language models are few-shot learners. ArXiv

[7] Cheng Y, Zhang Q, Shi C, Xiao L, Hao S, Hu L. 2025. CoSD: collaborative stance detection with contrastive heterogeneous topic graph learning. Knowledge-Based Systems 317(1):113399

[8] Dieng AB, Ruiz FJ, Blei DM. 2020. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8(2):439-453

[9] Grootendorst M. 2022. BERTopic: neural topic modeling with a class-based TF-IDF procedure. ArXiv

[10] Hua J, Sun D, Hu Y, Wang J, Feng S, Wang Z. 2024. Heterogeneous graph-convolution-network-based short-text classification. Applied Sciences 14(6):2279

[11] Kaleem S, Jalil Z, Nasir M, Alazab M. 2024. Word embedding empowered topic recognition in news articles. PeerJ Computer Science 10(12):e2300

[12] Kenton JDM-WC, Toutanova LK. 2019. BERT: pre-training of deep bidirectional transformers for language understanding.

[13] Kinariwala S, Deshmukh S. 2023. Short text topic modelling using local and global word-context semantic correlation. Multimedia Tools and Applications 82(17):26411-26433

[14] Koochemeshkian P, Bouguila N. 2024. Integration of neural embeddings and probabilistic models in topic modeling. Applied Artificial Intelligence 38(1):2403904

[15] Laureate CDP, Buntine W, Linger H. 2023. A systematic review of the use of topic models for short text social media analysis. Artificial Intelligence Review 56(12):14223-14255

[16] Liu W, Pang J, Du Q, Li N, Yang S. 2022. A method of short text representation fusion with weighted word embeddings and extended topic information. Sensors 22(3):1066

[17] Niu Y, Zhang H, Li J. 2021. A Pitman-Yor process self-aggregated topic model for short texts of social media. IEEE Access 9:129011–129021

[18] Qiu M, Yang W, Wei F, Chen M. 2024. A topic modeling based on prompt learning. Electronics 13(16):3212

[19] Rashid J, Kim J, Hussain A, Naseem U. 2023. WETM: a word embedding-based topic model with modified collapsed Gibbs sampling for short text. Pattern Recognition Letters 172(Jan):158-164

[20] Reuter A, Thielmann A, Weisser C, Säfken B, Kneib T. 2024. Probabilistic topic modelling with transformer representations. ArXiv

[21] Röder M, Both A, Hinneburg A. 2015. Exploring the space of topic coherence measures.

[22] Salle A, Villavicencio A. 2023. Understanding the effects of negative (and positive) pointwise mutual information on word vectors. Journal of Experimental & Theoretical Artificial Intelligence 35(8):1161-1199

[23] Shi L, Du J, Kou F. 2020. A sparse topic model for Bursty topic discovery in social networks. The International Arab Journal of Information Technology 17(5):816-824

[24] Sia S, Duh K. 2021. Adaptive mixed component LDA for low resource topic modeling.

[25] Song J, Lu X, Hong J, Wang F. 2025. External information enhancing topic model based on graph neural network. Expert Systems with Applications 263(Jan):125709

[26] Uddin F, Chen Y, Zhang Z, Huang X. 2024. Short text classification using semantically enriched topic model. Journal of Information Science 51(2):1655515241230793

[27] Wang R, Wang Y, Liu X, Huang H, Sun G. 2024. Bridging spherical mixture distributions and word semantic knowledge for neural topic modeling. Expert Systems with Applications 256(5):124850

[28] Wu X, Li C, Zhu Y, Miao Y. 2020. Short text topic modeling with topic distribution quantization and negative sampling decoder.

[29] Wu X, Luu AT, Dong X. 2022. Mitigating data sparsity for short text topic modeling by topic-semantic contrastive learning. ArXiv

[30] Wu M, Xu Z, Zheng L. 2025. Heterogeneous graph contrastive learning with adaptive data augmentation for semi-supervised short text classification. Expert Systems 42(2):e13744

[31] Yan X, Guo J, Lan Y, Cheng X. 2013. A biterm topic model for short texts.