Pre‐trained language models: What do they know?

Pre-trained language models: What do they know?

Diagram of pretrained language models common sense capabilities and possible domains of application.


Abstract

Large language models (LLMs) have substantially pushed artificial intelligence (AI) research and applications in the last few years. They are currently able to achieve high effectiveness in different natural language processing (NLP) tasks, such as machine translation, named entity recognition, text classification, question answering, or text summarization. Recently, significant attention has been drawn to OpenAI's GPT models' capabilities and extremely accessible interface. LLMs are nowadays routinely used and studied for downstream tasks and specific applications with great success, pushing forward the state of the art in almost all of them. However, they also exhibit impressive inference capabilities when used off the shelf without further training. In this paper, we aim to study the behavior of pre-trained language models (PLMs) in some inference tasks they were not initially trained for. Therefore, we focus our attention on very recent research works related to the inference capabilities of PLMs in some selected tasks such as factual probing and common-sense reasoning. We highlight relevant achievements made by these models, as well as some of their current limitations that open opportunities for further research.

This article is categorized under: Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Artificial Intelligence

A comprehensive survey of personal knowledge graphs

A comprehensive survey of personal knowledge graphs

Classification of personal knowledge graphs based on their methods of construction.


Abstract

Information that can encapsulate a person's daily life and its different aspects provides insightful knowledge. This knowledge can prove to be more useful than general knowledge for improving personalized tasks. When it comes to storing such knowledge, personal knowledge graphs (PKGs) come in as handy saviors. PKGs are knowledge graphs which store details that are pertinent to a user but not, in general, useful to the rest of humanity. Conversational agents can access these PKGs to answer queries related to the user's day-to-day life, whereas recommender systems can harness the knowledge stored in PKGs to make personalized suggestions. Despite the immense applicability of PKGs, there has not been significant research in this area. We present an extensive review of PKGs. We categorize them according to the domains in which they are most relevant; in particular, we highlight the use of PKGs in medicine, finance, and education and research. We also categorize the different ways of constructing a PKG based on the source of data required for such constructions. Furthermore, we discuss the limitations of PKGs and suggest directions for future work.

This article is categorized under: Fundamental Concepts of Data and Knowledge > Human Centricity and User Interaction Fundamental Concepts of Data and Knowledge > Knowledge Representation Technologies > Artificial Intelligence

Research on mining software repositories to facilitate refactoring

Research on mining software repositories to facilitate refactoring

Overview of the approaches on mining software repositories.


Abstract

Software refactoring focuses on improving software quality by applying changes to the internal structure that do not alter the observable behavior. Determining which refactorings should be applied and presented to developers the most relevant and optimal refactorings is often challenging. Existing literature suggests that one of the potential sources to identify and recommend required refactorings is the past software development and evolution histories which are often archived in software repositories. In this article, we review a selection of existing literature that has attempted to propose approaches that facilitate refactoring by exploiting information mined from software repositories. Based on the reviewed papers, existing works leverage software history mining to support analysis of code smells, refactoring, and guiding software changes. First, past history information is used to detect design flaws in source code commonly referred to as code smells. Moreover, other studies analyze the evolution of code smells to establish how and when they are introduced into the code base and get resolved. Second, software repositories mining provides useful insights that can be used in predicting the need for refactoring and what specific refactoring operations are required. In addition, past history can be used in detecting and analyzing previously applied refactorings to establish software change facts, for instance, how developers refactor code and the motivation behind it. Finally, change patterns are used to predict further changes that might be required and recommend a set of files for change during a given modification task. The paper further suggests other exciting possibilities that can be pursued in the future in this research direction.

This article is categorized under: Algorithmic Development > Text Mining Application Areas > Data Mining Software Tools

A geometric framework for outlier detection in high‐dimensional data

A geometric framework for outlier detection in high-dimensional data

A geometric framework exploiting the metric structure of a data set allows to (1) conceptualize outlier detection on a general level and (2) to conduct outlier detection in a principled and canonical way in very different high-dimensional and/or non-tabular data types such as functions (A.1), graphs (B.1), or images (C.1). The framework furthermore distinguishes structural (red) and distributional (blue) outliers, which can be detected, visualized, and quantified (A.2–C.2) with simple and well-established manifold learning and outlier scoring methods such as MDS and LOF (not all graph and image observations can be plotted at once in B.1 and C.1.).


Abstract

Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework which exploits the metric structure of a data set. Our approach rests on the manifold assumption, that is, that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high dimensional data. We also suggest a novel, mathematically precise and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.

This article is categorized under: Technologies > Structure Discovery and Clustering Fundamental Concepts of Data and Knowledge > Data Concepts Technologies > Visualization

Short‐term photovoltaic power forecasting with adaptive stochastic configuration network ensemble

Short-term photovoltaic power forecasting with adaptive stochastic configuration network ensemble

Structure diagram of AE-SCN.


Abstract

The volatility and intermittency of solar energy seriously restrict the development of the photovoltaic (PV) industry. Accurate forecast of short-term PV power generation is essential for the optimal balance and dispatch of power plants in the smart grid. This article presents a machine learning approach for analyzing the volt-ampere characteristics and influential factors on PV data. A correlation analysis is employed to discover some hidden characteristic variables. Then, an adaptive ensemble method with stochastic configuration networks as base models (AE-SCN) is proposed to construct the PV prediction model, which integrates bagging and adaptive weighted data fusion algorithms. Compared with the original SCN, SCN ensemble (SCNE) and random vector functional-link network (RVFLN), linear regression model, random forest model and autoregressive integrated moving average (ARMA) model, AE-SCN performs favorably in the terms of the prediction accuracy.

This article is categorized under: Technologies > Prediction

Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting

Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting

Garbage in garbage out degrades the performance of knowledge discovery process, data mining, and machine learning workflows requiring optimal classifiers and sufficient datasets. The article suggests quantifying feature frequency distributions by fitting power law, log-normal, and exponential right-tail distributions.


Abstract

This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high-dimensional binary feature space. The results showed that the distributions fit well into two of the four long right-tail statistical distributions: log-normal, exponential, power law, and Poisson. Precisely, log-normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well-formed statistical methods provides a clear understanding of the datasets and intra-class and inter-class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand.

This article is categorized under: Technologies > Data Preprocessing Technologies > Classification Technologies > Machine Learning

A novel methodology for Arabic news classification

A novel methodology for Arabic news classification

The proposed Arabic text classification methodology.


Abstract

The automated news classification concerns the assignment of news to one or more predefined categories. The automated classified news helps the search engines to mine and categorize the type of news that the user asks for. Most of the researchers focused on the classification of English news and ignore the Arabic news due to the complexity of the Arabic morphology. This article presents a novel methodology to classify the Arabic news. It relies on the use of features extraction and the application of machine learning classifiers which are the Naive Bayes (NB), the Logistic Regression (LR), the Random Forest (RF), the Xtreme Gradient Boosting (XGB), the K-Nearest Neighbors (KNN), the Stochastic Gradient Descent (SGD), the Decision Tree (DT), and the Multi-Layer Perceptron (MLP). The methodology is applied to the Arabic news dataset provided by Mendeley. The accuracy of the classification is more than 95%.

This article is categorized under: Algorithmic Development > Text Mining Technologies > Machine Learning Technologies > Classification