Bayesian relative composite quantile regression approach of ordinal latent regression model with L1/2 regularization

Abstract

Ordinal data frequently occur in various fields such as knowledge level assessment, credit rating, clinical disease diagnosis, and psychological evaluation. The classic models including cumulative logistic regression or probit regression are often used to model such ordinal data. But these modeling approaches conditionally depict the mean characteristic of response variable on a cluster of predictive variables, which often results in non-robust estimation results. As a considerable alternative, composite quantile regression (CQR) approach is usually employed to gain more robust and relatively efficient results. In this paper, we propose a Bayesian CQR modeling approach for ordinal latent regression model. In order to overcome the recognizability problem of the considered model and obtain more robust estimation results, we advocate to using the Bayesian relative CQR approach to estimate regression parameters. Additionally, in regression modeling, it is a highly desirable task to obtain a parsimonious model that retains only important covariates. We incorporate the Bayesian L1/2$$ {L}_{1/2} $$ penalty into the ordinal latent CQR regression model to simultaneously conduct parameter estimation and variable selection. Finally, the proposed Bayesian relative CQR approach is illustrated by Monte Carlo simulations and a real data application. Simulation results and real data examples show that the suggested Bayesian relative CQR approach has good performance for the ordinal regression models.

Transfer learning under the Cox model with interval‐censored data

Abstract

Transfer learning, focusing on information borrowing to address limited sample size issues, has gained increasing attention in recent years. Our method aims to utilize data from other population groups as a complement to enhance risk factor discernment and failure time prediction among underrepresented subgroups. However, a literature gap exists in effective knowledge transfer from the source to the target for risk assessment with interval-censored data while accommodating population incomparability and privacy constraints. Our objective is to bridge this gap by developing a transfer learning approach under the Cox proportional hazards model. We introduce the tuning-free Trans-Cox-MIC algorithm, enabling adaptable information sharing in regression coefficients and baseline hazards, while ensuring computational efficiency. Our approach accommodates covariate distribution shifts, coefficient variations, and baseline hazard discrepancies. Extensive simulations showcase the method's accuracy, robustness, and efficiency. Application to the prostate cancer screening data demonstrates enhanced risk estimation precision and predictive performance in the African American population.

A treeless absolutely random forest with closed‐form estimators of expected proximities

Abstract

We introduce a simple variant of a purely random forest, called an absolute random forest (ARF) used for clustering. At every node, splits of units are determined by a randomly chosen feature and a random threshold drawn from a uniform distribution whose support, the range of the selected feature in the root node, does not change. This enables closed-form estimators of parameters, such as pairwise proximities, to be obtained without having to grow a forest. The probabilistic structure corresponding to an ARF is called a treeless absolute random forest (TARF). With high probability, the algorithm will split units whose feature vectors are far apart and keep together units whose feature vectors are similar. Thus, the underlying structure of the data drives the growth of the tree. The expected value of pairwise proximities is obtained for three pathway functions. One, a completely common pathway function, is an indicator of whether a pair of units follow the same path from the root to the leaf node. The properties of TARF-based proximity estimators for clustering and classification are compared to other methods in eight real-world datasets and in simulations. Results show substantial performance and computing efficiencies of particular value for large datasets.

Bayesian shrinkage models for integration and analysis of multiplatform high‐dimensional genomics data

Abstract

With the increasing availability of biomedical data from multiple platforms of the same patients in clinical research, such as epigenomics, gene expression, and clinical features, there is a growing need for statistical methods that can jointly analyze data from different platforms to provide complementary information for clinical studies. In this paper, we propose a two-stage hierarchical Bayesian model that integrates high-dimensional biomedical data from diverse platforms to select biomarkers associated with clinical outcomes of interest. In the first stage, we use Expectation Maximization-based approach to learn the regulating mechanism between epigenomics (e.g., gene methylation) and gene expression while considering functional gene annotations. In the second stage, we group genes based on the regulating mechanism learned in the first stage. Then, we apply a group-wise penalty to select genes significantly associated with clinical outcomes while incorporating clinical features. Simulation studies suggest that our model-based data integration method shows lower false positives in selecting predictive variables compared with existing method. Moreover, real data analysis based on a glioblastoma (GBM) dataset reveals our method's potential to detect genes associated with GBM survival with higher accuracy than the existing method. Moreover, most of the selected biomarkers are crucial in GBM prognosis as confirmed by existing literature.

Randomized multiarm bandits: An improved adaptive data collection method

Abstract

In many scientific experiments, multiarmed bandits are used as an adaptive data collection method. However, this adaptive process can lead to a dependence that renders many commonly used statistical inference methods invalid. An example of this is the sample mean, which is a natural estimator of the mean parameter but can be biased. This can cause test statistics based on this estimator to have an inflated type I error rate, and the resulting confidence intervals may have significantly lower coverage probabilities than their nominal values. To address this issue, we propose an alternative approach called randomized multiarm bandits (rMAB). This combines a randomization step with a chosen MAB algorithm, and by selecting the randomization probability appropriately, optimal regret can be achieved asymptotically. Numerical evidence shows that the bias of the sample mean based on the rMAB is much smaller than that of other methods. The test statistic and confidence interval produced by this method also perform much better than its competitors.

Compositional variable selection in quantile regression for microbiome data with false discovery rate control

Abstract

Advancement in high-throughput sequencing technologies has stimulated intensive research interests to identify specific microbial taxa that are associated with disease conditions. Such knowledge is invaluable both from the perspective of understanding biology and from the biomedical perspective of therapeutic development, as the microbiome is inherently modifiable. Despite availability of massive data, analysis of microbiome compositional data remains difficult. The nature that relative abundances of all components of a microbial community sum to one poses challenges for statistical analysis, especially in high-dimensional settings, where a common research theme is to select a small fraction of signals from amid many noisy features. Motivated by studies examining the role of microbiome in host transcriptomics, we propose a novel approach to identify microbial taxa that are associated with host gene expressions. Besides accommodating compositional nature of microbiome data, our method both achieves FDR-controlled variable selection, and captures heterogeneity due to either heteroscedastic variance or non-location-scale covariate effects displayed in the motivating dataset. We demonstrate the superior performance of our method using extensive numerical simulation studies and then apply it to real-world microbiome data analysis to gain novel biological insights that are missed by traditional mean-based linear regression analysis.

Smart data augmentation: One equation is all you need

Abstract

Class imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi-level class imbalance and allows easy fine-tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi-layer perceptron, and histogram-based gradient boosting.

The finite mixture model for the tails of distribution: Monte Carlo experiment and empirical applications

Abstract

The finite mixture model estimates regression coefficients distinct in each of the different groups of the dataset endogenously determined by this estimator. In what follows the analysis is extended beyond the mean, estimating the model in the tails of the conditional distribution of the dependent variable within each group. While the clustering reduces the overall heterogeneity, since the model is estimated for groups of similar observations, the analysis in the tails uncovers within groups heterogeneity and/or skewness. By integrating the endogenously determined clustering with the quantile regression analysis within each group, enhances the finite mixture models and focuses on the tail behavior of the conditional distribution of the dependent variable. A Monte Carlo experiment and two empirical applications conclude the analysis. In the well-known birthweight dataset, the finite mixture model identifies and computes the regression coefficients of different groups, each one with its own characteristics, both at the mean and in the tails. In the family expenditure data, the analysis of within and between groups heterogeneity provides interesting economic insights on price elasticities. The analysis in classes proves to be more efficient than the model estimated without clustering. By extending the finite mixture approach to the tails provides a more accurate investigation of the data, introducing a robust tool to unveil sources of within groups heterogeneity and asymmetry otherwise left undetected. It improves efficiency and explanatory power with respect to the standard OLS-based FMM.

eRPCA: Robust Principal Component Analysis for Exponential Family Distributions

Abstract

Robust principal component analysis (RPCA) is a widely used method for recovering low-rank structure from data matrices corrupted by significant and sparse outliers. These corruptions may arise from occlusions, malicious tampering, or other causes for anomalies, and the joint identification of such corruptions with low-rank background is critical for process monitoring and diagnosis. However, existing RPCA methods and their extensions largely do not account for the underlying probabilistic distribution for the data matrices, which in many applications are known and can be highly non-Gaussian. We thus propose a new method called RPCA for exponential family distributions (eRPCA$$ {e}^{\mathrm{RPCA}} $$), which can perform the desired decomposition into low-rank and sparse matrices when such a distribution falls within the exponential family. We present a novel alternating direction method of multiplier optimization algorithm for efficient eRPCA$$ {e}^{\mathrm{RPCA}} $$ decomposition, under either its natural or canonical parametrization. The effectiveness of eRPCA$$ {e}^{\mathrm{RPCA}} $$ is then demonstrated in two applications: the first for steel sheet defect detection and the second for crime activity monitoring in the Atlanta metropolitan area.

Non‐uniform active learning for Gaussian process models with applications to trajectory informed aerodynamic databases

Abstract

The ability to non-uniformly weight the input space is desirable for many applications, and has been explored for space-filling approaches. Increased interests in linking models, such as in a digital twinning framework, increases the need for sampling emulators where they are most likely to be evaluated. In particular, we apply non-uniform sampling methods for the construction of aerodynamic databases. This paper combines non-uniform weighting with active learning for Gaussian Processes (GPs) to develop a closed-form solution to a non-uniform active learning criterion. We accomplish this by utilizing a kernel density estimator as the weight function. We demonstrate the need and efficacy of this approach with an atmospheric entry example that accounts for both model uncertainty as well as the practical state space of the vehicle, as determined by forward modeling within the active learning loop.