ACM Transactions on Knowledge Discovery from Data (TKDD) – Page 2

Optimal Algebraic Breadth-First Search for Sparse Graphs

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Paul Burkhardt

There has been a rise in the popularity of algebraic methods for graph algorithms given the development of the GraphBLAS library and other sparse matrix methods. An exemplar for these approaches is Breadth-First Search (BFS). The algebraic BFS algorithm is simply a recurrence of matrix-vector multiplications with the n × n adjacency matrix, but the many redundant operations over nonzeros ultimately lead to suboptimal performance. Therefore an optimal algebraic BFS should be of keen interest especially if it is easily integrated with existing matrix methods. Current methods, notably in the GraphBLAS, use a Sparse Matrix masked-Sparse Vector multiplication in which the input vector is kept in a sparse representation in each step of the BFS, and nonzeros in the vector are masked in subsequent steps.

Density Guarantee on Finding Multiple Subgraphs and Subtensors

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Quang-huy Duong, Heri Ramampiaro, Kjetil Nørvåg, Thu-lan Dam

Dense subregion (subgraph & subtensor) detection is a well-studied area, with a wide range of applications, and numerous efficient approaches and algorithms have been proposed. Approximation approaches are commonly used for detecting dense subregions due to the complexity of the exact methods. Existing algorithms are generally efficient for dense subtensor and subgraph detection, and can perform well in many applications. However, most of the existing works utilize the state-or-the-art greedy 2-approximation algorithm to capably provide solutions with a loose theoretical density guarantee. The main drawback of most of these algorithms is that they can estimate only one subtensor, or subgraph, at a time, with a low guarantee on its density.

Graph Neural Networks for Fast Node Ranking Approximation

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Sunil Kumar Maurya, Xin Liu, Tsuyoshi Murata

Graphs arise naturally in numerous situations, including social graphs, transportation graphs, web graphs, protein graphs, etc. One of the important problems in these settings is to identify which nodes are important in the graph and how they affect the graph structure as a whole. Betweenness centrality and closeness centrality are two commonly used node ranking measures to find out influential nodes in the graphs in terms of information spread and connectivity. Both of these are considered as shortest path based measures as the calculations require the assumption that the information flows between the nodes via the shortest paths. However, exact calculations of these centrality measures are computationally expensive and prohibitive, especially for large graphs.

Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Lorenzo De Stefani, Erisa Terolli, Eli Upfal

We introduce Tiered Sampling, a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M, which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples.

Improved Customer Lifetime Value Prediction With Sequence-To-Sequence Learning and Feature-Based Models

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Josef Bauer, Dietmar Jannach

The prediction of the Customer Lifetime Value (CLV) is an important asset for tool-supported marketing by customer relationship managers. Since standard methods based on purchase recency, frequency, and past profit and revenue statistics often have limited predictive power, advanced machine learning (ML) techniques were applied to this task in recent years. However, existing approaches are often not fully capable of modeling certain temporal patterns that can be commonly found in practice, such as periodic purchasing behavior of customers. To address these shortcomings, we propose a novel method for CLV prediction based on a combination of several ML techniques. At its core, our method consists of a tailored deep learning approach based on encoder–decoder sequence-to-sequence recurrent neural networks with augmented temporal convolutions.

Deep Graph Matching and Searching for Semantic Code Retrieval

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu, Shouling Ji

Code retrieval is to find the code snippet from a large corpus of source code repositories that highly matches the query of natural language description. Recent work mainly uses natural language processing techniques to process both query texts (i.e., human natural language) and code snippets (i.e., machine programming language), however, neglecting the deep structured features of query texts and source codes, both of which contain rich semantic information. In this article, we propose an end-to-end deep graph matching and searching (DGMS) model based on graph neural networks for the task of semantic code retrieval. To this end, we first represent both natural language query texts and programming language code snippets with the unified graph-structured data, and then use the proposed graph matching and searching model to retrieve the best matching code snippet.

Streaming Social Event Detection and Evolution Discovery in Heterogeneous Information Networks

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Hao Peng, Jianxin Li, Yangqiu Song, Renyu Yang, Rajiv Ranjan, Philip S. Yu, Lifang He

Events are happening in real world and real time, which can be planned and organized for occasions, such as social gatherings, festival celebrations, influential meetings, or sports activities. Social media platforms generate a lot of real-time text information regarding public events with different topics. However, mining social events is challenging because events typically exhibit heterogeneous texture and metadata are often ambiguous. In this article, we first design a novel event-based meta-schema to characterize the semantic relatedness of social events and then build an event-based heterogeneous information network (HIN) integrating information from external knowledge base. Second, we propose a novel Pairwise Popularity Graph Convolutional Network, named as PP-GCN, based on weighted meta-path instance similarity and textual semantic representation as inputs, to perform fine-grained social event categorization and learn the optimal weights of meta-paths in different tasks.

Utility Mining Across Multi-Dimensional Sequences

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Wensheng Gan, Jerry Chun-Wei Lin, Jiexiong Zhang, Hongzhi Yin, Philippe Fournier-Viger, Han-Chieh Chao, Philip S. Yu

Knowledge extraction from database is the fundamental task in database and data mining community, which has been applied to a wide range of real-world applications and situations. Different from the support-based mining models, the utility-oriented mining framework integrates the utility theory to provide more informative and useful patterns. Time-dependent sequence data are commonly seen in real life. Sequence data have been widely utilized in many applications, such as analyzing sequential user behavior on the Web, influence maximization, route planning, and targeted marketing. Unfortunately, all the existing algorithms lose sight of the fact that the processed data not only contain rich features (e.g., occur quantity, risk, and profit), but also may be associated with multi-dimensional auxiliary information, e.g., transaction sequence can be associated with purchaser profile information.

Adaptive Influence Maximization: If Influential Node Unwilling to Be the Seed

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Jianxiong Guo, Weili Wu

Influence maximization problem attempts to find a small subset of nodes that makes the expected influence spread maximized, which has been researched intensively before. They all assumed that each user in the seed set we select is activated successfully and then spread the influence. However, in the real scenario, not all users in the seed set are willing to be an influencer. Based on that, we consider each user associated with a probability with which we can activate her as a seed, and we can attempt to activate her many times. In this article, we study the adaptive influence maximization with multiple activations (Adaptive-IMMA) problem, where we select a node in each iteration, observe whether she accepts to be a seed, if yes, wait to observe the influence diffusion process; if no, we can attempt to activate her again with a higher cost or select another node as a seed.

Self-Adaptive Skeleton Approaches to Detect Self-Organized Coalitions From Brain Functional Networks Through Probabilistic Mixture Models

May 10, 2021May 10, 2021 ACM Transactions on Knowledge Discovery from Data (TKDD) Edit

Kai Liu, Hongbo Liu, Tomas E. Ward, Hua Wang, Yu Yang, Bo Zhang, Xindong Wu

Detecting self-organized coalitions from functional networks is one of the most important ways to uncover functional mechanisms in the brain. Determining these raises well-known technical challenges in terms of scale imbalance, outliers and hard-examples. In this article, we propose a novel self-adaptive skeleton approach to detect coalitions through an approximation method based on probabilistic mixture models. The nodes in the networks are characterized in terms of robust k-order complete subgraphs (k-clique) as essential substructures. The k-clique enumeration algorithm quickly enumerates all k-cliques in a parallel manner for a given network. Then, the cliques, from max-clique down to min-clique, of each order k, are hierarchically embedded into a probabilistic mixture model.