A dynamic density-based clustering method based on K-nearest neighbor

Abstract

Many density-based clustering algorithms already proposed in the literature are capable of finding clusters with different shapes, sizes, and densities. Also, the noise points are detected well. However, many of these methods require input parameters that are static and must be defined by user. Since it is difficult for users to determine these parameters in large data sets, the proper determination of them has an effective role in the identification of a suitable clustering. Therefore, a challenge in this domain is how to reduce the number of input parameters, thereby reducing the errors caused by users’ involvement. In order to handle this challenge, a dynamic density-based clustering (DDBC) method is proposed in this paper for clustering purposes, which needs the smallest number of parameters to be set by users since many of them are determined automatically. This method has the ability to distinguish close clusters with different densities in a dynamic manner. Additionally, it can detect outliers and noises before starting the clustering process without scanning these points. Several real and artificial data sets were used to examine the efficiency of the proposed method, and its outcomes were compared to those of other algorithms in this domain . The comparative results confirmed the acceptable performance of DDBC and its higher accuracy in clustering tasks.

Transfer learning for concept drifting data streams in heterogeneous environments

Abstract

Learning in non-stationary environments remains challenging due to dynamic and unknown probability distribution. This issue is even more problematic when there is a lack of supervision data for a specific domain, making the use of labeled data from a related but different domain highly valuable. This paper addresses the streaming data classification and introduces a heterogeneous unsupervised domain adaptation method. To cover the uncertainty caused by the distribution discrepancy and concept drifting data, the proposed method prioritizes target domain data with the highest uncertainty, as they indicate changes in data distribution. It utilizes a fuzzy-based feature-level adaptation and optimizes parameters through accelerated optimization. Additionally, it employs instance selection in the source domain to identify qualified samples, further enhancing classification and adaptation. Three different settings of the proposed method have been configured, and five state-of-the-art methods have been selected as competing methods. Regarding different types of concept drift, various experiments taken from four benchmark datasets demonstrate the superiority of the proposed method in terms of accuracy and computational time. The Wilcoxon statistical test has been conducted to prove a meaningful distinction between the evaluation metrics results of the proposed method and the competing ones.

Similarity enhancement of heterogeneous networks by weighted incorporation of information

Abstract

In many real-world datasets, different aspects of information are combined, so the data is usually represented as heterogeneous graphs whose nodes and edges have different types. Learning representations in heterogeneous networks is one of the most important topics that can be utilized to extract important details from the networks with the embedding methods. In this paper, we introduce a new framework for embedding heterogeneous graphs. Our model relies on weighted heterogeneous networks with star structures that take structural and attributive similarity into account as well as semantic knowledge. The target nodes form the center of the star and the different attributes of the target nodes form the points of the star. The edge weights are calculated based on three aspects, including the natural language processing in texts, the relationship between different attributes of the dataset and the co-occurrence of each attribute pair in target nodes. We strengthen the similarities between the target nodes by examining the latent connections between the attribute nodes. We find these indirect connections by considering the approximate shortest path between the attributes. By applying the side effect of the star components to the central component, the heterogeneous network is reduced to a homogeneous graph with enhanced similarities. Thus, we can embed this homogeneous graph to capture the similar target nodes. We evaluate our framework for the clustering task and show that our method is more accurate than previous unsupervised algorithms for real-world datasets.

Mining Top-K constrained cross-level high-utility itemsets over data streams

Abstract

Cross-Level High-Utility Itemsets Mining (CLHUIM) aims to discover interesting relationships between hierarchy levels by introducing the taxonomy of items. To tackle this issue of the current CLHUIM algorithms encountering a challenge in dealing with large search spaces, researchers have proposed the concept of mining Top-K cross-level high-utility itemsets(CLHUIs). However, the results obtained by these methods often contain redundant itemsets with significant differences in hierarchy levels, and a large proportion of itemsets with higher abstraction levels, making it neglect some detailed information and unable to provide information of itemsets within the specified hierarchy range. Additionally, they are unable to handle dynamic transactional data. To address the aforementioned problems, this paper proposes Top-K Constrained Cross-Level High-Utility Itemsets Mining (TKCCLHM) algorithm to efficiently mine Top-K itemsets across different hierarchy levels over data streams. Firstly, a new hierarchical level concept is introduced to control the abstraction level of the introduced items, and Top-K itemsets are mined within a specific hierarchy range based on this concept. Secondly, a sliding window-based data structure called Sliding Window-based Utility Projection List (SUPL) is designed, which combined with transaction projection techniques to mine CLHUIs efficiently. Lastly, a Batch and Utility Hash Table (BUHT) structure capable of storing batch and (generalized) item utility information is proposed, along with a new threshold raising strategy. Extensive experiments on six datasets with taxonomy information demonstrated that the proposed algorithm exhibited significant improvements in runtime and scalability performance compared to the state-of-the-art algorithms.

JMFEEL-Net: a joint multi-scale feature enhancement and lightweight transformer network for crowd counting

Abstract

Crowd counting based on convolutional neural networks (CNNs) has made significant progress in recent years. However, the limited receptive field of CNNs makes it challenging to capture global features for comprehensive contextual modeling, resulting in insufficient accuracy in count estimation. In comparison, vision transformer (ViT)-based counting networks have demonstrated remarkable performance by exploiting their powerful global contextual modeling capabilities. However, ViT models are associated with higher computational costs and training difficulty. In this paper, we propose a novel network named JMFEEL-Net, which utilizes joint multi-scale feature enhancement and lightweight transformer to improve crowd counting accuracy. Specifically, we use a high-resolution CNN as the backbone network to generate high-resolution feature maps. In the backend network, we propose a multi-scale feature enhancement module to address the problem of low recognition accuracy caused by multi-scale variations, especially when counting small-scale objects in dense scenes. Furthermore, we introduce an improved lightweight ViT encoder to effectively model complex global contexts. We also adopt a multi-density map supervision strategy to learn crowd distribution features from feature maps of different resolutions, thereby improving the quality and training efficiency of the density maps. To validate the effectiveness of the proposed method, we conduct extensive experiments on four challenging datasets, namely ShanghaiTech Part A/B, UCF-QNRF, and JHU-Crowd++, achieving very competitive counting performance.

Sentiment analysis of tweets using text and graph multi-views learning

Abstract

With the surge of deep learning framework, various studies have attempted to address the challenges of sentiment analysis of tweets (data sparsity, under-specificity, noise, and multilingual content) through text and network-based representation learning approaches. However, limited studies on combining the benefits of textual and structural (graph) representations for sentiment analysis of tweets have been carried out. This study proposes a multi-view learning framework (end-to-end and ensemble-based) that leverages both text-based and graph-based representation learning approaches to enrich the tweet representation for sentiment classification. The efficacy of the proposed framework is evaluated over three datasets using suitable baseline counterparts. From various experimental studies, it is observed that combining both textual and structural views can achieve better performance of sentiment classification tasks than its counterparts.

Session-based recommendation with fusion of hypergraph item global and context features

Abstract

Session-based recommendation (SBR) is to predict the items that users are likely to click afterward by using their recent click history. Learning item features from existing session data to capture users’ current preferences is the main problem to be solved in session-based recommendation domain, and fusing global and local information to learn users’ preferences is an effective way to obtain this information more accurately. In this paper, we propose a session-based recommendation with fusion of hypergraph item global and context features (FHGIGC), which learns users’ current preferences by fusing item global and contextual features. Specifically, the model first constructs a global hypergraph and a local hypergraph and uses the hypergraph neural network to learn item global features and local features by relevant session information and item contextual information, respectively. Then, the learned features are fused by the attention mechanism to obtain the final item features and session features. Finally, personalized recommendations are generated for users based on the fused features. Experiments were conducted on three datasets of session-based recommendation, and the results demonstrate that the FHGIGC model can improve the accuracy of recommendations.

Discriminative boundary generation for effective outlier detection

Abstract

Outlier detection is often considered a challenge due to the inherent class imbalance in datasets, with the small number of available outliers that are insufficient to describe their overall distribution. This makes it difficult for classifiers to effectively learn the demarcation (boundary) between normal samples and outliers, which is the key for accurate detection. In this paper, we propose a novel discriminative boundary generation framework, called BoG. The framework extracts the border samples in the dataset and expands them to form the initial boundary outliers. With the adversarial training in GAN, the boundary outliers are further augmented, which, together with the boundary normal data, provides the valuable demarcation information for the classifier. Two method variants are proposed under our BoG framework to achieve a balance between detection efficiency and effectiveness. Extensive experiments show that our proposed framework achieves significant improvements compared to the existing outlier detection methods.

Mining technology trends in scientific publications: a graph propagated neural topic modeling approach

Abstract

The past decades have witnessed significant progress in scientific research, where new technologies emerge and traditional technologies constantly evolve. As a critical task in the Science of Science (SciSci), automatically mining technology trends from massive scientific publications have attracted broad research interests in various communities. While existing approaches can achieve remarkable performance, there are still many critical challenges to address, such as data sparsity, cross-document influence, and temporal dependency. To this end, in this paper, we propose a technical terms-based graph propagated neural topic model for mining technology trends in scientific publications. Specifically, we first utilize the documents’ citation relations and technical terms to construct a heterogeneous graph. Then, we design a term propagation network to spread the technical terms on the heterogeneous graph to overcome the sparseness of technical terms. In addition, we develop a dynamic embedded topic modeling method to capture the temporal dependencies for technical terms in cross-document, which can discover the distribution of technical terms over time. Finally, extensive experiments on real-world scientific datasets validate the effectiveness and interpretability of our approach compared with state-of-the-art baselines.