stat.AP – Recent Articles

Quantitative knowledge retrieval from large language models

February 13, 2024February 13, 2024 cs.CL updates on arXiv.org Edit

Large language models (LLMs) have been extensively studied for their abilities to generate convincing natural language sequences, however their utility for quantitative information retrieval is less well understood. In this paper we explore the feasibility of LLMs as a mechanism for quantitative knowledge retrieval to aid data analysis tasks such as elicitation of prior distributions for Bayesian models and imputation of missing data. We present a prompt engineering framework, treating an LLM as an interface to a latent space of scientific literature, comparing responses in different contexts and domains against more established approaches. Implications and challenges of using LLMs as 'experts' are discussed.

Using Mathlink Cubes to Introduce Data Wrangling with Examples in R

February 13, 2024February 13, 2024 cs.HC updates on arXiv.org Edit

This paper explores an innovative approach to teaching data wrangling skills to students through hands-on activities before transitioning to coding. Data wrangling, a critical aspect of data analysis, involves cleaning, transforming, and restructuring data. We introduce the use of a physical tool, mathlink cubes, to facilitate a tangible understanding of data sets. This approach helps students grasp the concepts of data wrangling before implementing them in coding languages such as R. We detail a classroom activity that includes hands-on tasks paralleling common data wrangling processes such as filtering, selecting, and mutating, followed by their coding equivalents using R's `dplyr` package.

A Rational Analysis of the Speech-to-Song Illusion

February 13, 2024February 13, 2024 cs.CL updates on arXiv.org Edit

The speech-to-song illusion is a robust psychological phenomenon whereby a spoken sentence sounds increasingly more musical as it is repeated. Despite decades of research, a complete formal account of this transformation is still lacking, and some of its nuanced characteristics, namely, that certain phrases appear to transform while others do not, is not well understood. Here we provide a formal account of this phenomenon, by recasting it as a statistical inference whereby a rational agent attempts to decide whether a sequence of utterances is more likely to have been produced in a song or speech. Using this approach and analyzing song and speech corpora, we further introduce a novel prose-to-lyrics illusion that is purely text-based. In this illusion, simply duplicating written sentences makes them appear more like song lyrics. We provide robust evidence for this new illusion in both human participants and large language models.

Limits of Large Language Models in Debating Humans

February 12, 2024February 12, 2024 cs.HC updates on arXiv.org Edit

Large Language Models (LLMs) have shown remarkable promise in their ability to interact proficiently with humans. Subsequently, their potential use as artificial confederates and surrogates in sociological experiments involving conversation is an exciting prospect. But how viable is this idea? This paper endeavors to test the limits of current-day LLMs with a pre-registered study integrating real people with LLM agents acting as people. The study focuses on debate-based opinion consensus formation in three environments: humans only, agents and humans, and agents only. Our goal is to understand how LLM agents influence humans, and how capable they are in debating like humans. We find that LLMs can blend in and facilitate human productivity but are less convincing in debate, with their behavior ultimately deviating from human's. We elucidate these primary failings and anticipate that LLMs must evolve further before being viable debaters.

CFTM: Continuous time fractional topic model

February 8, 2024February 8, 2024 cs.CL updates on arXiv.org Edit

In this paper, we propose the Continuous Time Fractional Topic Model (cFTM), a new method for dynamic topic modeling. This approach incorporates fractional Brownian motion~(fBm) to effectively identify positive or negative correlations in topic and word distribution over time, revealing long-term dependency or roughness. Our theoretical analysis shows that the cFTM can capture these long-term dependency or roughness in both topic and word distributions, mirroring the main characteristics of fBm. Moreover, we prove that the parameter estimation process for the cFTM is on par with that of LDA, traditional topic models. To demonstrate the cFTM's property, we conduct empirical study using economic news articles. The results from these tests support the model's ability to identify and track long-term dependency or roughness in topics over time.

The Use of a Large Language Model for Cyberbullying Detection

February 7, 2024February 7, 2024 cs.CL updates on arXiv.org Edit

The dominance of social media has added to the channels of bullying for perpetrators. Unfortunately, cyberbullying (CB) is the most prevalent phenomenon in todays cyber world, and is a severe threat to the mental and physical health of citizens. This opens the need to develop a robust system to prevent bullying content from online forums, blogs, and social media platforms to manage the impact in our society. Several machine learning (ML) algorithms have been proposed for this purpose. However, their performances are not consistent due to high class imbalance and generalisation issues. In recent years, large language models (LLMs) like BERT and RoBERTa have achieved state-of-the-art (SOTA) results in several natural language processing (NLP) tasks. Unfortunately, the LLMs have not been applied extensively for CB detection. In our paper, we explored the use of these models for cyberbullying (CB) detection. We have prepared a new dataset (D2) from existing studies (Formspring and Twitter). Our experimental results for dataset D1 and D2 showed that RoBERTa outperformed other models.

Less than one percent of words would be affected by gender-inclusive language in German press texts

February 7, 2024February 7, 2024 cs.CL updates on arXiv.org Edit

Research on gender and language is tightly knitted to social debates on gender equality and non-discriminatory language use. Psycholinguistic scholars have made significant contributions in this field. However, corpus-based studies that investigate these matters within the context of language use are still rare. In our study, we address the question of how much textual material would actually have to be changed if non-gender-inclusive texts were rewritten to be gender-inclusive. This quantitative measure is an important empirical insight, as a recurring argument against the use of gender-inclusive German is that it supposedly makes written texts too long and complicated. It is also argued that gender-inclusive language has negative effects on language learners. However, such effects are only likely if gender-inclusive texts are very different from those that are not gender-inclusive. In our corpus-linguistic study, we manually annotated German press texts to identify the parts that would have to be changed. Our results show that, on average, less than 1% of all tokens would be affected by gender-inclusive language. This small proportion calls into question whether gender-inclusive German presents a substantial barrier to understanding and learning the language, particularly when we take into account the potential complexities of interpreting masculine generics.

InVA: Integrative Variational Autoencoder for Harmonization of Multi-modal Neuroimaging Data

February 6, 2024February 6, 2024 cs.NE updates on arXiv.org Edit

There is a significant interest in exploring non-linear associations among multiple images derived from diverse imaging modalities. While there is a growing literature on image-on-image regression to delineate predictive inference of an image based on multiple images, existing approaches have limitations in efficiently borrowing information between multiple imaging modalities in the prediction of an image. Building on the literature of Variational Auto Encoders (VAEs), this article proposes a novel approach, referred to as Integrative Variational Autoencoder (\texttt{InVA}) method, which borrows information from multiple images obtained from different sources to draw predictive inference of an image. The proposed approach captures complex non-linear association between the outcome image and input images, while allowing rapid computation. Numerical results demonstrate substantial advantages of \texttt{InVA} over VAEs, which typically do not allow borrowing information between input images. The proposed framework offers highly accurate predictive inferences for costly positron emission topography (PET) from multiple measures of cortical structure in human brain scans readily available from magnetic resonance imaging (MRI).

CFTM: Continuous time fractional topic model

February 6, 2024February 6, 2024 cs.CL updates on arXiv.org Edit

Extreme value statistics of nerve transmission delay

February 2, 2024February 2, 2024 q-bio.NC updates on arXiv.org Edit

Nerve transmission delay is an important topic in neuroscience. Spike signals fired or received at the dendrites of a neuron travel from the axon to the presynaptic cell. The spike signal triggers a chemical reaction at the synapse, wherein a presynaptic cell transfers neurotransmitters to the postsynaptic cell, and regenerates electrical signals by a chemical reaction process through ion channels and transmits it to neighboring neurons. In the context of describing the complex physiological reaction process as a stochastic process, this study aimed to show that the distribution of the maximum time interval of spike signals follows extreme order statistics. By considering the statistical variance in the time constant of the Leaky Integrate-and-Fire model, which is a deterministic time evolution model of spike signals, we enabled randomness in the time interval of spike signals. When the time constant follows an exponential distribution function, the time interval of the spike signal also follows an exponential distribution. In this case, our theory and simulations confirmed that the histogram of the maximum time interval follows the Gumbel distribution, which is one of the three types of extreme value statistics. We also confirmed that the histogram of the maximum time interval follows a Fr\'{e}chet distribution when the time interval of the spike signal follows a Pareto distribution. These findings confirm that nerve transmission delay can be described using extreme value statistics and could, therefore, be used as a new indicator for transmission delay.