Online learning for streaming data classification in nonstationary environments

Abstract

In this article, we implement the classification of nonstationary streaming data. Due to the inability to obtain full data in the context of streaming data, we adopt a strategy based on clustering structure for data classification. Specifically, this strategy involves dynamically maintaining clustering structures to update the model, thereby updating the objective function for classification. Simultaneously, incoming samples are monitored in real-time to identify the emergence of new classes or the presence of outliers. Moreover, this strategy can also deal with the concept drift problem, where the distribution of data changes with the inflow of data. Regarding the handling of novel instances, we introduce a buffer analysis mechanism to delay their processing, which in turn improves the prediction performance of the model. In the process of model updating, we also introduce a novel renewable strategy for the covariance matrix. Numerical simulations and experiments on datasets show that our method has significant advantages.

Error‐controlled feature selection for ultrahigh‐dimensional and highly correlated feature space using deep learning

Abstract

Deep learning has been at the center of analytics in recent years due to its impressive empirical success in analyzing complex data objects. Despite this success, most existing tools behave like black-box machines, thus the increasing interest in interpretable, reliable, and robust deep learning models applicable to a broad class of applications. Feature-selected deep learning has emerged as a promising tool in this realm. However, the recent developments do not accommodate ultrahigh-dimensional and highly correlated features or high noise levels. In this article, we propose a novel screening and cleaning method with the aid of deep learning for a data-adaptive multi-resolutional discovery of highly correlated predictors with a controlled error rate. Extensive empirical evaluations over a wide range of simulated scenarios and several real datasets demonstrate the effectiveness of the proposed method in achieving high power while keeping the false discovery rate at a minimum.

Marginal clustered multistate models for longitudinal progressive processes with informative cluster size

Abstract

Informative cluster size (ICS) is a phenomenon where cluster size is related to the outcome. While multistate models can be applied to characterize the unit-level transition process for clustered interval-censored data, there is a research gap addressing ICS within this framework. We propose two extensions of multistate model that account for ICS to make marginal inference: one by incorporating within-cluster resampling and another by constructing cluster-weighted score functions. We evaluate the performances of the proposed methods through simulation studies and apply them to the Veterans Affairs Dental Longitudinal Study (VADLS) to understand the effect of risk factors on periodontal disease progression. ICS occurs frequently in dental data, particularly in the study of periodontal disease, as people with fewer teeth due to the disease are more susceptible to disease progression. According to the simulation results, the mean estimates of the parameters obtained from the proposed methods are close to the true values, but methods that ignore ICS can lead to substantial bias. Our proposed methods for clustered multistate model are able to appropriately take ICS into account when making marginal inference of a typical unit from a randomly sampled cluster.

A novel two‐step extrapolation‐insertion risk model based on the Expectile under the Pareto‐type distribution

Abstract

The catastrophe loss model developed is a challenging problem in the insurance industry. In the context of Pareto-type distribution, measuring risk at the extreme right tail has become a major focus for academic research. The quantile and Expectile of distribution are found to be useful descriptors of its tail, in the same way as the median and mean are related to its central behavior. In this article, a novel two-step extrapolation-insertion method is introduced and proved its advantages of less bias and variance theoretically through asymptotic normality by modifying the existing far-right tail numerical model using the risk measures of Expectile and Expected Shortfall (ES). In addition, another solution to obtain the ES is proposed based on the fitted extreme distribution, which is demonstrated to have superior unbiased statistical properties. Uniting these two methods provides the numerical interval upper and lower bounds for capturing the real quantile-based ES commonly used in insurance. The numerical simulation and the empirical analysis results of Danish reinsurance claim data indicate that these methods offer high prediction accuracy in the applications of catastrophe risk management.

Bayesian inference for nonprobability samples with nonignorable missingness

Abstract

Nonprobability samples, especially web survey data, have been available in many different fields. However, nonprobability samples suffer from selection bias, which will yield biased estimates. Moreover, missingness, especially nonignorable missingness, may also be encountered in nonprobability samples. Thus, it is a challenging task to make inference from nonprobability samples with nonignorable missingness. In this article, we propose a Bayesian approach to infer the population based on nonprobability samples with nonignorable missingness. In our method, different Logistic regression models are employed to estimate the selection probabilities and the response probabilities; the superpopulation model is used to explain the relationship between the study variable and covariates. Further, Bayesian and approximate Bayesian methods are proposed to estimate the response model parameters and the superpopulation model parameters, respectively. Specifically, the estimating functions for the response model parameters and superpopulation model parameters are utilized to derive the approximate posterior distribution in superpopulation model estimation. Simulation studies are conducted to investigate the finite sample performance of the proposed method. The data from the Pew Research Center and the Behavioral Risk Factor Surveillance System are used to show better performance of our proposed method over the other approaches.

Modeling matrix variate time series via hidden Markov models with skewed emissions

Abstract

Data collected today have increasingly become more complex and cannot be analyzed using regular statistical methods. Matrix variate time series data is one such example where the observations in the time series are matrices. Herein, we introduce a set of three hidden Markov models using skewed matrix variate emission distributions for modeling matrix variate time series data. Compared to the hidden Markov model with matrix variate normal emissions, the proposed models present greater flexibility and are capable of modeling skewness in time series data. Parameter estimation is performed using an expectation maximization algorithm. We then look at both simulated data and salary data for public Texas universities.

Rarity updated ensemble with oversampling: An ensemble approach to classification of imbalanced data streams

Abstract

Today's ever-increasing generation of streaming data demands novel data mining approaches tailored to mining dynamic data streams. Data streams are non-static in nature, continuously generated, and endless. They often suffer from class imbalance and undergo temporal drift. To address the classification of consecutive data instances within imbalanced data streams, this research introduces a new ensemble classification algorithm called Rarity Updated Ensemble with Oversampling (RUEO). The RUEO approach is specifically designed to exhibit robustness against class imbalance by incorporating an imbalance-specific criterion to assess the efficacy of the base classifiers and employing an oversampling technique to reduce the imbalance in the training data. The RUEO algorithm was evaluated on a set of 20 data streams and compared against 14 baseline algorithms. On average, the proposed RUEO algorithm achieves an average-accuracy of 0.69 on the real-world data streams, while the chunk-based algorithms AWE, AUE, and KUE achieve average-accuracies of 0.48, 0.65, and 0.66, respectively. The statistical analysis, conducted using the Wilcoxon test, reveals a statistically significant improvement in average-accuracy for the proposed RUEO algorithm when compared to 12 out of the 14 baseline algorithms. The source code and experimental results of this research work will be publicly available at https://github.com/vkiani/RUEO.

Subsampling under distributional constraints

Abstract

Some complex models are frequently employed to describe physical and mechanical phenomena. In this setting, we have an input X$$ X $$ in a general space, and an output Y=f(X)$$ Y=f(X) $$ where f$$ f $$ is a very complicated function, whose computational cost for every new input is very high, and may be also very expensive. We are given two sets of observations of X$$ X $$, S1$$ {S}_1 $$ and S2$$ {S}_2 $$ of different sizes such that only fS1$$ f\left({S}_1\right) $$ is available. We tackle the problem of selecting a subset S3⊂S2$$ {S}_3\subset {S}_2 $$ of smaller size on which to run the complex model f$$ f $$, and such that the empirical distribution of fS3$$ f\left({S}_3\right) $$ is close to that of fS1$$ f\left({S}_1\right) $$. We suggest three algorithms to solve this problem and show their efficiency using simulated datasets and the Airfoil self-noise data set.

A deep learning approach for the comparison of handwritten documents using latent feature vectors

Abstract

Forensic questioned document examiners still largely rely on visual assessments and expert judgment to determine the provenance of a handwritten document. Here, we propose a novel approach to objectively compare two handwritten documents using a deep learning algorithm. First, we implement a bootstrapping technique to segment document data into smaller units, as a means to enhance the efficiency of the deep learning process. Next, we use a transfer learning algorithm to systematically extract document features. The unique characteristics of the document data are then represented as latent vectors. Finally, the similarity between two handwritten documents is quantified via the cosine similarity between the two latent vectors. We illustrate the use of the proposed method by implementing it on a variety of collections of handwritten documents with different attributes, and show that in most cases, we can accurately classify pairs of documents into same or different author categories.

An automated alignment algorithm for identification of the source of footwear impressions with common class characteristics

Abstract

We introduce an algorithmic approach designed to compare similar shoeprint images, with automated alignment. Our method employs the Iterative Closest Points (ICP) algorithm to attain optimal alignment, further enhancing precision through phase-only correlation. Utilizing diverse metrics to quantify similarity, we train a random forest model to predict the empirical probability that two impressions originate from the same shoe. Experimental evaluations using high-quality two-dimensional shoeprints showcase our proposed algorithm's robustness in managing dissimilarities between impressions from the same shoe, outperforming existing approaches.