Multinomial Restricted Unfolding

Abstract

For supervised classification we propose to use restricted multidimensional unfolding in a multinomial logistic framework. Where previous research proposed similar models based on squared distances, we propose to use usual (i.e., not squared) Euclidean distances. This change in functional form results in several interpretational advantages of the resulting biplot, a graphical representation of the classification model. First, the conditional probability of any class peaks at the location of the class in the Euclidean space. Second, the interpretation of the biplot is in terms of distances towards the class points, whereas in the squared distance model the interpretation is in terms of the distance towards the decision boundary. Third, the distance between two class points represents an upper bound for the estimated log-odds of choosing one of these classes over the other. For our multinomial restricted unfolding, we develop and test a Majorization Minimization algorithm that monotonically decreases the negative log-likelihood. With two empirical applications we point out the advantages of the distance model and show how to apply multinomial restricted unfolding in practice, including model selection.

Prediction of Forest Fire Risk for Artillery Military Training using Weighted Support Vector Machine for Imbalanced Data

Abstract

Since the 1953 truce, the Republic of Korea Army (ROKA) has regularly conducted artillery training, posing a risk of wildfires — a threat to both the environment and the public perception of national defense. To assess this risk and aid decision-making within the ROKA, we built a predictive model of wildfires triggered by artillery training. To this end, we combined the ROKA dataset with meteorological database. Given the infrequent occurrence of wildfires (imbalance ratio \(\approx \) 1:24 in our dataset), achieving balanced detection of wildfire occurrences and non-occurrences is challenging. Our approach combines a weighted support vector machine with a Gaussian mixture-based oversampling, effectively penalizing misclassification of the wildfires. Applied to our dataset, our method outperforms traditional algorithms (G-mean=0.864, sensitivity=0.956, specificity= 0.781), indicating balanced detection. This study not only helps reduce wildfires during artillery trainings but also provides a practical wildfire prediction method for similar climates worldwide.

Inferential Tools for Assessing Dependence Across Response Categories in Multinomial Models with Discrete Random Effects

Abstract

We propose a discrete random effects multinomial regression model to deal with estimation and inference issues in the case of categorical and hierarchical data. Random effects are assumed to follow a discrete distribution with an a priori unknown number of support points. For a K-categories response, the modelling identifies a latent structure at the highest level of grouping, where groups are clustered into subpopulations. This model does not assume the independence across random effects relative to different response categories, and this provides an improvement from the multinomial semi-parametric multilevel model previously proposed in the literature. Since the category-specific random effects arise from the same subjects, the independence assumption is seldom verified in real data. To evaluate the improvements provided by the proposed model, we reproduce simulation and case studies of the literature, highlighting the strength of the method in properly modelling the real data structure and the advantages that taking into account the data dependence structure offers.

Binary Peacock Algorithm: A Novel Metaheuristic Approach for Feature Selection

Abstract

Binary metaheuristic algorithms prove to be invaluable for solving binary optimization problems. This paper proposes a binary variant of the peacock algorithm (PA) for feature selection. PA, a recent metaheuristic algorithm, is built upon lekking and mating behaviors of peacocks and peahens. While designing the binary variant, two major shortcomings of PA (lek formation and offspring generation) were identified and addressed. Eight binary variants of PA are also proposed and compared over mean fitness to identify the best variant, called binary peacock algorithm (bPA). To validate bPA’s performance experiments are conducted using 34 benchmark datasets and results are compared with eight well-known binary metaheuristic algorithms. The results show that bPA classifies 30 datasets with highest accuracy and extracts minimum features in 32 datasets, achieving up to 99.80% reduction in the feature subset size in the dataset with maximum features. bPA attained rank 1 in Friedman rank test over all parameters.

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Abstract

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

Soft Label Guided Unsupervised Discriminative Sparse Subspace Feature Selection

Abstract

Feature selection and subspace learning are two primary methods to achieve data dimensionality reduction and discriminability enhancement. However, data label information is unavailable in unsupervised learning to guide the dimensionality reduction process. To this end, we propose a soft label guided unsupervised discriminative sparse subspace feature selection (UDS \(^2\) FS) model in this paper, which consists of two superiorities in comparison with the existing studies. On the one hand, UDS \(^2\) FS aims to find a discriminative subspace to simultaneously maximize the between-class data scatter and minimize the within-class scatter. On the other hand, UDS \(^2\) FS estimates the data label information in the learned subspace, which further serves as the soft labels to guide the discriminative subspace learning process. Moreover, the \(\ell _{2,0}\) -norm is imposed to achieve row sparsity of the subspace projection matrix, which is parameter-free and more stable compared to the \(\ell _{2,1}\) -norm. Experimental studies to evaluate the performance of UDS \(^2\) FS are performed from three aspects, i.e., a synthetic data set to check its iterative optimization process, several toy data sets to visualize the feature selection effect, and some benchmark data sets to examine the clustering performance of UDS \(^2\) FS. From the obtained results, UDS \(^2\) FS exhibits competitive performance in joint subspace learning and feature selection in comparison with some related models.

Variable Selection for Hidden Markov Models with Continuous Variables and Missing Data

Abstract

We propose a variable selection method for multivariate hidden Markov models with continuous responses that are partially or completely missing at a given time occasion. Through this procedure, we achieve a dimensionality reduction by selecting the subset of the most informative responses for clustering individuals and simultaneously choosing the optimal number of these clusters corresponding to latent states. The approach is based on comparing different model specifications in terms of the subset of responses assumed to be dependent on the latent states, and it relies on a greedy search algorithm based on the Bayesian information criterion seen as an approximation of the Bayes factor. A suitable expectation-maximization algorithm is employed to obtain maximum likelihood estimates of the model parameters under the missing-at-random assumption. The proposal is illustrated via Monte Carlo simulation and an application where development indicators collected over eighteen years are selected, and countries are clustered into groups to evaluate their growth over time.

Nonparametric Cognitive Diagnosis When Attributes Are Polytomous

Abstract

Cognitive diagnosis models provide diagnostic information on whether examinees have mastered the skills, called “attributes,” that characterize a given knowledge domain. Based on attribute mastery, distinct proficiency classes are defined to which examinees are assigned based on their item responses. Attributes are typically perceived as binary. However, polytomous attributes may yield higher precision in the assessment of examinees’ attribute mastery. Karelitz (2004) introduced the ordered-category attribute coding framework (OCAC) to accommodate polytomous attributes. Other approaches to handle polytomous attributes in cognitive diagnosis have been proposed in the literature. However, the heavy parameterization of these models often created difficulties in fitting these models. In this article, a nonparametric method for cognitive diagnosis is proposed for use with polytomous attributes, called the nonparametric polytomous attributes diagnostic classification (NPADC) method, that relies on an adaptation of the OCAC framework. The new NPADC method proposed here can be used with various cognitive diagnosis models. It does not require large sample sizes; it is computationally efficient and highly effective as is evidenced by the recovery rates of the proficiency classes observed in large-scale simulation studies. The NPADC method is also used with a real-world data set.