Bake off redux: a review and experimental evaluation of recent time series classification algorithms

Abstract

In 2017, a research paper (Bagnall et al. Data Mining and Knowledge Discovery 31(3):606-660. 2017) compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a ‘bake off’, identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, MultiROCKET+Hydra (Dempster et al. 2022) and HIVE-COTEv2 (Middlehurst et al. Mach Learn 110:3211-3243. 2021), perform significantly better than other approaches on both the current and new TSC problems.

Hyper-distance oracles in hypergraphs

Abstract

We study point-to-point distance estimation in hypergraphs, where the query is parameterized by a positive integer s, which defines the required level of overlap for two hyperedges to be considered adjacent. To answer s-distance queries, we first explore an oracle based on the line graph of the given hypergraph and discuss its limitations: The line graph is typically orders of magnitude larger than the original hypergraph. We then introduce HypED, a landmark-based oracle with a predefined size, built directly on the hypergraph, thus avoiding the materialization of the line graph. Our framework allows to approximately answer vertex-to-vertex, vertex-to-hyperedge, and hyperedge-to-hyperedge s-distance queries for any value of s. A key observation at the basis of our framework is that as s increases, the hypergraph becomes more fragmented. We show how this can be exploited to improve the placement of landmarks, by identifying the s-connected components of the hypergraph. For this latter task, we devise an efficient algorithm based on the union-find technique and a dynamic inverted index. We experimentally evaluate HypED on several real-world hypergraphs and prove its versatility in answering s-distance queries for different values of s. Our framework allows answering such queries in fractions of a millisecond while allowing fine-grained control of the trade-off between index size and approximation error at creation time. Finally, we prove the usefulness of the s-distance oracle in two applications, namely hypergraph-based recommendation and the approximation of the s-closeness centrality of vertices and hyperedges in the context of protein-protein interactions.

An academic recommender system on large citation data based on clustering, graph modeling and deep learning

Abstract

Recommendation (recommender) systems (RS) have played a significant role in both research and industry in recent years. In the area of academia, there is a need to help researchers discover the most appropriate and relevant scientific information through recommendations. Nevertheless, we argue that there is a major gap between academic state-of-the-art RS and real-world problems. In this paper, we present a novel multi-staged RS based on clustering, graph modeling and deep learning that manages to run on a full dataset (scientific digital library) in the magnitude of millions users and items (papers). We run several tests (experiments/evaluation) as a means to find the best approach regarding the tuning of our system; so, we present and compare three versions of our RS regarding recall and NDCG metrics. The results show that a multi-staged RS that utilizes a variety of techniques and algorithms is able to face real-world problems and large academic datasets. In this way, we suggest a way to close or minimize the gap between research and industry value RS.

Exploring aspect-based sentiment analysis: an in-depth review of current methods and prospects for advancement

Abstract

Aspect-based sentiment analysis (ABSA) is a natural language processing technique that seeks to recognize and extract the sentiment connected to various qualities or aspects of a specific good, service, or entity. It entails dissecting a text into its component pieces, determining the elements or aspects being examined, and then examining the attitude stated about each feature or aspect. The main objective of this research is to present a comprehensive understanding of aspect-based sentiment analysis (ABSA), such as its potential, ongoing trends and advancements, structure, practical applications, real-world implementation, and open issues. The current sentiment analysis aims to enhance granularity at the aspect level with two main objectives, including extracting aspects and polarity sentiment classification. Three main methods are designed for aspect extractions: pattern-based, machine learning and deep learning. These methods can capture both syntactic and semantic features of text without relying heavily on high-level feature engineering, which was a requirement in earlier approaches. Despite bringing traditional surveys, a comprehensive survey of the procedure for carrying out this task and the applications of ABSA are also included in this article. To fully comprehend each strategy's benefits and drawbacks, it is evaluated, compared, and investigated. To determine future directions, the ABSA’s difficulties are finally reviewed.

GroupMO: a memory-augmented meta-optimized model for group recommendation

Abstract

Group recommendation aims to suggest desired items for a group of users. Existing methods can achieve inspiring results in predicting the group preferences in data-rich groups. However, they could be ineffective in supporting cold-start groups due to their sparsity interactions, which prevents the model from understanding their intent. Although cold-start groups can be alleviated by meta-learning, we cannot apply it by using the same initialization for all groups due to their varying preferences. To tackle this problem, this paper proposes a memory-augmented meta-optimized model for group recommendation, namely GroupMO. Specifically, we adopt a clustering method to assemble the groups with similar profiles into the same cluster and design a representative group profile memory to guide the preliminary initialization of group embedding network for each group by utilizing those clusters. Besides, we also design a group shared preference memory to guide the prediction network initialization at a more refined granularity level for different groups, so that the shared knowledge can be better transferred to groups with similar preferences. Moreover, we incorporate those two memories to optimize the meta-learning process. Finally, extensive experiments on two real-world datasets demonstrate the superiority of our model.

Protecting the privacy of social network data using graph correction

Abstract

Today, the rapid development of online social networks, as well as low costs, easy communication, and quick access with minimal facilities have made social networks an attractive and very influential phenomenon among people. The users of these networks tend to share their sensitive and private information with friends and acquaintances. This has caused the data of these networks to become a very important source of information about users, their interests, feelings, and activities. Analyzing this information can be very useful in predicting the behavior of users in dealing with various issues. But publishing this data for data mining can violate the privacy of users. As a result, data privacy protection of social networks has become an important and attractive research topic. In this context, various algorithms have been proposed, all of which meet privacy requirements by making changes in the information as well as the graph structure. But due to high processing costs and long execution times, these algorithms are not very appropriate for anonymizing big data. In this research, we improved the speed of data anonymization by using the number factorization technique to select and delete the best edges in the graph correction stage. We also used the chaotic krill herd algorithm to add edges, and considering the effect of all edges together on the structure of the graph, we selected edges and added them to the graph so that it preserved the graph’s utility. The evaluation results on the real-world datasets, show the efficiency of the proposed algorithm in comparison with the state-of-the-art methods to reduce the execution time and maintain the utility of the anonymous graph.

An approach for fuzzy group decision making and consensus measure with hesitant judgments of experts

Abstract

In some actual decision-making problems, experts may be hesitant to judge the performances of alternatives, which leads to experts providing decision matrices with incomplete information. However, most existing estimation methods for incomplete information in group decision-making (GDM) neglect the hesitant judgments of experts, possibly making the group decision outcomes unreasonable. Considering the hesitation degrees of experts in decision judgments, an approach is proposed based on the triangular intuitionistic fuzzy numbers (TIFNs) and TODIM (interactive and multiple criteria decision-making) method for GDM and consensus measure. First, TIFNs are applied to handle incomplete information due to the hesitant judgments of experts. Second, considering the risk attitudes of experts, a decision-making model is proposed to rank alternatives for GDM with incomplete information. Subsequently, based on measuring the concordance between solutions, a consensus model is presented to measure the group’s and individual’s consensus degrees. Finally, an illustrative example is presented to show the detailed implementation procedure of the proposed approach. The comparisons with some existing estimation methods verify the effectiveness of the proposed approach for handling incomplete information. The impacts and necessities of experts’ hesitation degrees are discussed by a sensitivity analysis.

Range control-based class imbalance and optimized granular elastic net regression feature selection for credit risk assessment

Abstract

Credit risk, stemming from the failure of a contractual party, is a significant variable in financial institutions. Assessing credit risk involves evaluating the creditworthiness of individuals, businesses, or entities to predict the likelihood of defaulting on financial obligations. While financial institutions categorize consumers based on creditworthiness, there is no universally defined set of attributes or indices. This research proposes Range control-based class imbalance and Optimized Granular Elastic Net regression (ROGENet) for feature selection in credit risk assessment. The dataset exhibits severe class imbalance, addressed using Range-Controlled Synthetic Minority Oversampling TEchnique (RCSMOTE). The balanced data undergo Granular Elastic Net regression with hybrid Gazelle sand cat Swarm Optimization (GENGSO) for feature selection. Elastic net, ensuring sparsity and grouping for correlated features, proves beneficial for assessing credit risk. ROGENet provides a detailed perspective on credit risk evaluation, surpassing conventional methods. The oversampling feature selection enhances the accuracy of minority class by 99.4, 99, 98.6 and 97.3%, respectively.

A fuzzy rough set-based horse herd optimization algorithm for map reduce framework for customer behavior data

Abstract

A large number of association rules often minimizes the reliability of data mining results; hence, a dimensionality reduction technique is crucial for data analysis. When analyzing massive datasets, existing models take more time to scan the entire database because they discover unnecessary items and transactions that are not necessary for data analysis. For this purpose, the Fuzzy Rough Set-based Horse Herd Optimization (FRS-HHO) algorithm is proposed to be integrated with the Map Reduce algorithm to minimize query retrieval time and improve performance. The HHO algorithm minimizes the number of unnecessary items and transactions with minimal support value from the dataset to maximize fitness based on multiple objectives such as support, confidence, interestingness, and lift to evaluate the quality of association rules. The feature value of each item in the population is obtained by a Map Reduce-based fitness function to generate optimal frequent itemsets with minimum time. The Horse Herd Optimization (HHO) is employed to solve the high-dimensional optimization problems. The proposed FRS-HHO approach takes less time to execute for dimensions and has a space complexity of 38% for a total of 10 k transactions. Also, the FRS-HHO approach offers a speedup rate of 17% and a 12% decrease in input–output communication cost when compared to other approaches. The proposed FRS-HHO model enhances performance in terms of execution time, space complexity, and speed.