Using cross‐validation methods to select time series models: Promises and pitfalls

Abstract

Vector autoregressive (VAR) modelling is widely employed in psychology for time series analyses of dynamic processes. However, the typically short time series in psychological studies can lead to overfitting of VAR models, impairing their predictive ability on unseen samples. Cross-validation (CV) methods are commonly recommended for assessing the predictive ability of statistical models. However, it is unclear how the performance of CV is affected by characteristics of time series data and the fitted models. In this simulation study, we examine the ability of two CV methods, namely,10-fold CV and blocked CV, in estimating the prediction errors of three time series models with increasing complexity (person-mean, AR, and VAR), and evaluate how their performance is affected by data characteristics. We then compare these CV methods to the traditional methods using the Akaike (AIC) and Bayesian (BIC) information criteria in their accuracy of selecting the most predictive models. We find that CV methods tend to underestimate prediction errors of simpler models, but overestimate prediction errors of VAR models, particularly when the number of observations is small. Nonetheless, CV methods, especially blocked CV, generally outperform the AIC and BIC. We conclude our study with a discussion on the implications of the findings and provide helpful guidelines for practice.

When and how to use set‐exploratory structural equation modelling to test structural models: A tutorial using the R package lavaan

Abstract

Exploratory structural equation modelling (ESEM) is an alternative to the well-known method of confirmatory factor analysis (CFA). ESEM is mainly used to assess the quality of measurement models of common factors but can be efficiently extended to test structural models. However, ESEM may not be the best option in some model specifications, especially when structural models are involved, because the full flexibility of ESEM could result in technical difficulties in model estimation. Thus, set-ESEM was developed to accommodate the balance between full-ESEM and CFA. In the present paper, we show examples where set-ESEM should be used rather than full-ESEM. Rather than relying on a simulation study, we provide two applied examples using real data that are included in the OSF repository. Additionally, we provide the code needed to run set-ESEM in the free R package lavaan to make the paper practical. Set-ESEM structural models outperform their CFA-based counterparts in terms of goodness of fit and realistic factor correlation, and hence path coefficients in the two empirical examples. In several instances, effects that were non-significant (i.e., attenuated) in the CFA-based structural model become larger and significant in the set-ESEM structural model, suggesting that set-ESEM models may generate more accurate model parameters and, hence, lower Type II error rate.

Fast estimation of generalized linear latent variable models for performance and process data with ordinal, continuous, and count observed variables

Different data types often occur in psychological and educational measurement such as computer-based assessments that record performance and process data (e.g., response times and the number of actions). Modelling such data requires specific models for each data type and accommodating complex dependencies between multiple variables. Generalized linear latent variable models are suitable for modelling mixed data simultaneously, but estimation can be computationally demanding. A fast solution is to use Laplace approximations, but existing implementations of joint modelling of mixed data types are limited to ordinal and continuous data. To address this limitation, we derive an efficient estimation method that uses first- or second-order Laplace approximations to simultaneously model ordinal data, continuous data, and count data. We illustrate the approach with an example and conduct simulations to evaluate the performance of the method in terms of estimation efficiency, convergence, and parameter recovery. The results suggest that the second-order Laplace approximation achieves a higher convergence rate and produces accurate yet fast parameter estimates compared to the first-order Laplace approximation, while the time cost increases with higher model complexity. Additionally, models that consider the dependence of variables from the same stimulus fit the empirical data substantially better than models that disregarded the dependence.

Identifiability and estimability of Bayesian linear and nonlinear crossed random effects models

Abstract

Crossed random effects models (CREMs) are particularly useful in longitudinal data applications because they allow researchers to account for the impact of dynamic group membership on individual outcomes. However, no research has determined what data conditions need to be met to sufficiently identify these models, especially the group effects, in a longitudinal context. This is a significant gap in the current literature as future applications to real data may need to consider these conditions to yield accurate and precise model parameter estimates, specifically for the group effects on individual outcomes. Furthermore, there are no existing CREMs that can model intrinsically nonlinear growth. The goals of this study are to develop a Bayesian piecewise CREM to model intrinsically nonlinear growth and evaluate what data conditions are necessary to empirically identify both intrinsically linear and nonlinear longitudinal CREMs. This study includes an applied example that utilizes the piecewise CREM with real data and three simulation studies to assess the data conditions necessary to estimate linear, quadratic, and piecewise CREMs. Results show that the number of repeated measurements collected on groups impacts the ability to recover the group effects. Additionally, functional form complexity impacted data collection requirements for estimating longitudinal CREMs.

Statistical inference for agreement between multiple raters on a binary scale

Agreement studies often involve more than two raters or repeated measurements. In the presence of two raters, the proportion of agreement and of positive agreement are simple and popular agreement measures for binary scales. These measures were generalized to agreement studies involving more than two raters with statistical inference procedures proposed on an empirical basis. We present two alternatives. The first is a Wald confidence interval using standard errors obtained by the delta method. The second involves Bayesian statistical inference not requiring any specific Bayesian software. These new procedures show better statistical behaviour than the confidence intervals initially proposed. In addition, we provide analytical formulas to determine the minimum number of persons needed for a given number of raters when planning an agreement study. All methods are implemented in the R package simpleagree and the Shiny app simpleagree.

A cluster differences unfolding method for large datasets of preference ratings on an interval scale: Minimizing the mean squared centred residuals

Abstract

Clustering and spatial representation methods are often used in combination, to analyse preference ratings when a large number of individuals and/or object is involved. When analysed under an unfolding model, row-conditional linear transformations are usually most appropriate when the goal is to determine clusters of individuals with similar preferences. However, a significant problem with transformations that include both slope and intercept is the occurrence of degenerate solutions. In this paper, we propose a least squares unfolding method that performs clustering of individuals while simultaneously estimating the location of cluster centres and object locations in low-dimensional space. The method is based on minimising the mean squared centred residuals of the preference ratings with respect to the distances between cluster centres and object locations. At the same time, the distances are row-conditionally transformed with optimally estimated slope parameters. It is computationally efficient for large datasets, and does not suffer from the appearance of degenerate solutions. The performance of the method is analysed in an extensive Monte Carlo experiment. It is illustrated for a real data set and the results are compared with those obtained using a two-step clustering and unfolding procedure.

Correcting for measurement error under meta‐analysis of z‐transformed correlations

Abstract

This study mainly concerns correction for measurement error using the meta-analysis of Fisher's z-transformed correlations. The disattenuation formula of Spearman (American Journal of Psychology, 15, 1904, 72) is used to correct for individual raw correlations in primary studies. The corrected raw correlations are then used to obtain the corrected z-transformed correlations. What remains little studied, however, is how to best correct for within-study sampling error variances of corrected z-transformed correlations. We focused on three within-study sampling error variance estimators corrected for measurement error that were proposed in earlier studies and is proposed in the current study: (1) the formula given by Hedges (Test validity, Lawrence Erlbaum, 1988) assuming a linear relationship between corrected and uncorrected z-transformed correlations (linear correction), (2) one derived by the first-order delta method based on the average of corrected z-transformed correlations (stabilized first-order correction), and (3) one derived by the second-order delta method based on the average of corrected z-transformed correlations (stabilized second-order correction). Via a simulation study, we compared performance of these estimators and the sampling error variance estimator uncorrected for measurement error in terms of estimation and inference accuracy of the mean correlation as well as the homogeneity test of effect sizes. In obtaining the corrected z-transformed correlations and within-study sampling error variances, coefficient alpha was used as a common reliability coefficient estimate. The results showed that in terms of the estimated mean correlation, sampling error variances with linear correction, the stabilized first-order and second-order corrections, and no correction performed similarly in general. Furthermore, in terms of the homogeneity test, given a relatively large average sample size and normal true scores, the stabilized first-order and second-order corrections had type I error rates that were generally controlled as well as or better than the other estimators. Overall, stabilized first-order and second-order corrections are recommended when true scores are normal, reliabilities are acceptable, the number of items per psychological scale is relatively large, and the average sample size is relatively large.

Mixtures of t$$ t $$ factor analysers with censored responses and external covariates: An application to educational data from Peru

Abstract

Analysing data from educational tests allows governments to make decisions for improving the quality of life of individuals in a society. One of the key responsibilities of statisticians is to develop models that provide decision-makers with pertinent information about the latent process that educational tests seek to represent. Mixtures of t$$ t $$ factor analysers (MtFA) have emerged as a powerful device for model-based clustering and classification of high-dimensional data containing one or several groups of observations with fatter tails or anomalous outliers. This paper considers an extension of MtFA for robust clustering of censored data, referred to as the MtFAC model, by incorporating external covariates. The enhanced flexibility of including covariates in MtFAC enables cluster-specific multivariate regression analysis of dependent variables with censored responses arising from upper and/or lower detection limits of experimental equipment. An alternating expectation conditional maximization (AECM) algorithm is developed for maximum likelihood estimation of the proposed model. Two simulation experiments are conducted to examine the effectiveness of the techniques presented. Furthermore, the proposed methodology is applied to Peruvian data from the 2007 Early Grade Reading Assessment, and the results obtained from the analysis provide new insights regarding the reading skills of Peruvian students.

The effective sample size in Bayesian information criterion for level‐specific fixed and random‐effect selection in a two‐level nested model

Abstract

Popular statistical software provides the Bayesian information criterion (BIC) for multi-level models or linear mixed models. However, it has been observed that the combination of statistical literature and software documentation has led to discrepancies in the formulas of the BIC and uncertainties as to the proper use of the BIC in selecting a multi-level model with respect to level-specific fixed and random effects. These discrepancies and uncertainties result from different specifications of sample size in the BIC's penalty term for multi-level models. In this study, we derive the BIC's penalty term for level-specific fixed- and random-effect selection in a two-level nested design. In this new version of BIC, called BICE1, this penalty term is decomposed into two parts if the random-effect variance–covariance matrix has full rank: (a) a term with the log of average sample size per cluster and (b) the total number of parameters times the log of the total number of clusters. Furthermore, we derive the new version of BIC, called BICE2, in the presence of redundant random effects. We show that the derived formulae, BICE1 and BICE2, adhere to empirical values via numerical demonstration and that BICE (E indicating either E1 or E2) is the best global selection criterion, as it performs at least as well as BIC with the total sample size and BIC with the number of clusters across various multi-level conditions through a simulation study. In addition, the use of BICE1 is illustrated with a textbook example dataset.

On generating plausible values for multilevel modelling with large‐scale‐assessment data

Abstract

Large-scale assessments (LSAs) routinely employ latent regressions to generate plausible values (PVs) for unbiased estimation of the relationship between examinees' background variables and performance. To handle the clustering effect common in LSA data, multilevel modelling is a popular choice. However, most LSAs use single-level conditioning methods, resulting in a mismatch between the imputation model and the multilevel analytic model. While some LSAs have implemented special techniques in single-level latent regressions to support random-intercept modelling, these techniques are not expected to support random-slope models. To address this gap, this study proposed two new single-level methods to support random-slope estimation. The existing and proposed methods were compared to the theoretically unbiased multilevel latent regression method in terms of their ability to support multilevel models. The findings indicate that the two existing single-level methods can support random-intercept-only models. The multilevel latent regression method provided mostly adequate estimates but was limited by computational burden and did not have the best performance across all conditions. One of our proposed single-level methods presented an efficient alternative to multilevel latent regression and was able to recover acceptable estimates for all parameters. We provide recommendations for situations where each method can be applied, with some caveats.