Data Mining and Databases – Page 2

Top-K Pseudo Labeling for Semi-Supervised Image Classification

January 16, 2023January 16, 2023 Jiang, Yi Edit

In this paper, a top-k pseudo labeling method for semi-supervised self-learning is proposed. Pseudo labeling is a key technology in semi-supervised self-learning. Briefly, the quality of the pseudo label generated largely determined the convergence of the neural network and the accuracy obtained. In this paper, the authors use a method called top-k pseudo labeling to generate pseudo label during the training of semi-supervised neural network model. The proposed labeling method helps a lot in learning features from unlabeled data. The proposed method is easy to implement and only relies on the neural network prediction and hyper-parameter k. The experiment results show that the proposed method works well with semi-supervised learning on CIFAR-10 and CIFAR-100 datasets. Also, a variant of top-k labeling for supervised learning named top-k regulation is proposed. The experiment results show that various models can achieve higher accuracy on test set when trained with top-k regulation.

Spatiotemporal Data Prediction Model Based on a Multi-Layer Attention Mechanism

January 16, 2023January 16, 2023 Jiang, Man Edit

Spatiotemporal data prediction is of great significance in the fields of smart cities and smart manufacturing. Current spatiotemporal data prediction models heavily rely on traditional spatial views or single temporal granularity, which suffer from missing knowledge, including dynamic spatial correlations, periodicity, and mutability. This paper addresses these challenges by proposing a multi-layer attention-based predictive model. The key idea of this paper is to use a multi-layer attention mechanism to model the dynamic spatial correlation of different features. Then, multi-granularity historical features are fused to predict future spatiotemporal data. Experiments on real-world data show that the proposed model outperforms six state-of-the-art benchmark methods.

Estimating the Number of Clusters in High-Dimensional Large Datasets

January 16, 2023January 16, 2023 Zhu, Xutong Edit

Clustering is a basic primer of exploratory tasks. In order to obtain valuable results, the parameters in the clustering algorithm, the number of clusters must be set appropriately. Existing methods for determining the number of clusters perform well on low-dimensional small datasets, but how to effectively determine the optimal number of clusters on large high-dimensional datasets is still a challenging problem. In this paper, the authors design a method for effectively estimating the optimal number of clusters on large-scale high-dimensional datasets that can overcome the shortcomings of existing estimation methods and accurately and quickly estimate the optimal number of clusters on large-scale high-dimensional datasets. Extensive experiments show that it (1) outperforms existing estimation methods in accuracy and efficiency, (2) generalizes across different datasets, and (3) is suitable for high-dimensional large datasets.

A New Outlier Detection Algorithm Based on Fast Density Peak Clustering Outlier Factor

January 16, 2023January 16, 2023 Zhang, ZhongPing Edit

Outlier detection is an important field in data mining, which can be used in fraud detection, fault detection, and other fields. This article focuses on the problem where the density peak clustering algorithm needs a manual parameter setting and time complexity is high; the first is to use the k nearest neighbors clustering algorithm to replace the density peak of the density estimate, which adopts the KD-Tree index data structure calculation of data objects k close neighbors. Then it adopts the method of the product of density and distance automatic selection of clustering centers. In addition, the central relative distance and fast density peak clustering outliers were defined to characterize the degree of outliers of data objects. Then, based on fast density peak clustering outliers, an outlier detection algorithm was devised. Experiments on artificial and real data sets are performed to validate the algorithm, and the validity and time efficiency of the proposed algorithm are validated when compared to several conventional and innovative algorithms.

Iterative and Semi-Supervised Design of Chatbots Using Interactive Clustering

April 1, 2022April 1, 2022 Schild, Erwan Edit

Chatbots represent a promising tool to automate the processing of requests in a business context. However, despite major progress in natural language processing technologies, constructing a dataset deemed relevant by business experts is a manual, iterative and error-prone process. To assist these experts during modelling and labelling, the authors propose an active learning methodology coined Interactive Clustering. It relies on interactions between computer-guided segmentation of data in intents, and response-driven human annotations imposing constraints on clusters to improve relevance.This article applies Interactive Clustering on a realistic dataset, and measures the optimal settings required for relevant segmentation in a minimal number of annotations. The usability of the method is discussed in terms of computation time, and the achieved compromise between business relevance and classification performance during training.In this context, Interactive Clustering appears as a suitable methodology combining human and computer initiatives to efficiently develop a useable chatbot.

Boat Detection in Marina Using Time-Delay Analysis and Deep Learning

April 1, 2022April 1, 2022 Scherrer, Romane Edit

An autonomous acoustic system based on two bottom-moored hydrophones, a two-input audio board and a small single-board computer was installed at the entrance of a marina to detect entering/exiting boat. Windowed time lagged cross-correlations are calculated by the system to find the consecutive time delays between the hydrophone signals and to compute a signal which is a function of the boats' angular trajectories. Since its installation, the single-board computer performs online prediction with a signal processing-based algorithm which achieved an accuracy of 80 %. To improve system performance, a convolutional neural network (CNN) is trained with the acquired data to perform real-time detection. Two classification tasks were considered (binary and multiclass) to both detect a boat and its direction of navigation. Finally, a trained CNN was implemented in a single-board computer to ensure that prediction can be performed in real time.

Efficient Open Domain Question Answering With Delayed Attention in Transformer-Based Models

April 1, 2022April 1, 2022 Siblini, Wissam Edit

Open Domain Question Answering (ODQA) on a large-scale corpus of documents (e.g. Wikipedia) is a key challenge in computer science. Although Transformer-based language models such as Bert have shown an ability to outperform humans to extract answers from small pre-selected passages of text, they suffer from their high complexity if the search space is much larger. The most common way to deal with this problem is to add a preliminary information retrieval step to strongly filter the corpus and keep only the relevant passages. In this article, the authors consider a more direct and complementary solution which consists in restricting the attention mechanism in Transformer-based models to allow a more efficient management of computations. The resulting variants are competitive with the original models on the extractive task and allow, in the ODQA setting, a significant acceleration of predictions and sometimes even an improvement in the quality of response.

A Method for Generating Comparison Tables From the Semantic Web

April 1, 2022April 1, 2022 Giacometti, Arnaud Edit

This paper presents Versus, which is the first automatic method for generating comparison tables from knowledge bases of the Semantic Web. For this purpose, it introduces the contextual reference level to evaluate whether a feature is relevant to compare a set of entities. This measure relies on contexts that are sets of entities similar to the compared entities. Its principle is to favor the features whose values for the compared entities are reference (or frequent) in these contexts. The proposal efficiently evaluates the contextual reference level from a public SPARQL endpoint limited by a fair-use policy. Using a new benchmark based on Wikidata, the experiments show the interest of the contextual reference level for identifying the features deemed relevant by users with high precision and recall. In addition, the proposed optimizations significantly reduce the number of required queries for properties as well as for inverse relations. Interestingly, this experimental study also show that the inverse relations bring out a large number of numerical comparison features.

Concept of Temporal Pretopology for the Analysis for Structural Changes

April 1, 2022April 1, 2022 Selmaoui-Folcher, Nazha Edit

Pretopology is a mathematical model developed from a weakening of the topological axiomatic. It was initially used in economic, social and biological sciences and next in pattern recognition and image analysis. More recently, it has been applied to the analysis of complex networks. Pretopology enables to work in a mathematical framework with weak properties, and its nonidempotent operator called pseudo-closure permits to implement iterative algorithms. It proposes a formalism that generalizes graph theory concepts and allows to model problems universally. In this paper, authors will extend this mathematical model to analyze complex data with spatiotemporal dimensions. Authors define the notion of a temporal pretopology based on a temporal function. They give an example of temporal function based on a binary relation, and construct a temporal pretopology. They define two new notions of temporal substructures which aim at representing evolution of substructures. They propose algorithms to extract these substructures. They experiment the proposition on 2 data and two economic real data.

Improvement of Data Stream Decision Trees

January 1, 2022January 1, 2022 Bahloul, Sarah Nait Edit

The classification of data streams has become a significant and active research area. The principal characteristics of data streams are a large amount of arrival data, the high speed and rate of its arrival, and the change of their nature and distribution over time. Hoeffding Tree is a method to, incrementally, build decision trees. Since its proposition in the literature, it has become one of the most popular tools of data stream classification. Several improvements have since emerged. Hoeffding Anytime Tree was recently introduced and is considered one of the most promising algorithms. It offers a higher accuracy compared to the Hoeffding Tree in most scenarios, at a small additional computational cost. In this work, the authors contribute by proposing three improvements to the Hoeffding Anytime Tree. The improvements are tested on known benchmark datasets. The experimental results show that two of the proposed variants make better usage of Hoeffding Anytime Tree’s properties. They learn faster while providing the same desired accuracy.