A multi-modal approach for identifying schizophrenia using cross-modal attention

This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system outperforms the previous state-of-the-art multi-modal system by 8.53% in the weighted average F1 score.

Real-time Extended Reality Video Transmission Optimization Based on Frame-priority Scheduling

Extended Reality (XR) is an important service in the 5G network and in future 6G networks. In contrast to traditional video on demand services, real-time XR video is transmitted frame by frame, requiring low latency and being highly sensitive to network fluctuations. In this paper, we model the quality of experience (QoE) for real-time XR video transmission on a frame-by-frame basis. Based on the proposed QoE model, we formulate an optimization problem that maximizes QoE with constraints on wireless resources and long-term energy consumption. We utilize Lyapunov optimization to transform the original problem into a single-frame optimization problem and then allocate wireless subchannels. We propose an adaptive XR video bitrate algorithm that employs a Long Short Term Memory (LSTM) based Deep Q-Network (DQN) algorithm for video bitrate selection. Through numerical results, we show that our proposed algorithm outperforms the baseline algorithms, with the average QoE improvements of 5.9% to 80.0%.

convSeq: Fast and Scalable Method for Detecting Patterns in Spike Data

Spontaneous neural activity, crucial in memory, learning, and spatial navigation, often manifests itself as repetitive spatiotemporal patterns. Despite their importance, analyzing these patterns in large neural recordings remains challenging due to a lack of efficient and scalable detection methods. Addressing this gap, we introduce convSeq, an unsupervised method that employs backpropagation for optimizing spatiotemporal filters that effectively identify these neural patterns. Our method's performance is validated on various synthetic data and real neural recordings, revealing spike sequences with unprecedented scalability and efficiency. Significantly surpassing existing methods in speed, convSeq sets a new standard for analyzing spontaneous neural activity, potentially advancing our understanding of information processing in neural circuits.

Interpretation of Intracardiac Electrograms Through Textual Representations

Understanding the irregular electrical activity of atrial fibrillation (AFib) has been a key challenge in electrocardiography. For serious cases of AFib, catheter ablations are performed to collect intracardiac electrograms (EGMs). EGMs offer intricately detailed and localized electrical activity of the heart and are an ideal modality for interpretable cardiac studies. Recent advancements in artificial intelligence (AI) has allowed some works to utilize deep learning frameworks to interpret EGMs during AFib. Additionally, language models (LMs) have shown exceptional performance in being able to generalize to unseen domains, especially in healthcare. In this study, we are the first to leverage pretrained LMs for finetuning of EGM interpolation and AFib classification via masked language modeling. We formulate the EGM as a textual sequence and present competitive performances on AFib classification compared against other representations. Lastly, we provide a comprehensive interpretability study to provide a multi-perspective intuition of the model's behavior, which could greatly benefit the clinical use.

Examining the Influence of Digital Phantom Models in Virtual Imaging Trials for Tomographic Breast Imaging

Purpose: Digital phantoms are one of the key components of virtual imaging trials (VITs) that aim to assess and optimize new medical imaging systems and algorithms. However, these phantoms vary in their voxel resolution, appearance, and structural details. This study aims to examine whether and how variations between digital phantoms influence system optimization with digital breast tomosynthesis (DBT) as a chosen modality. Methods: We selected widely used and open-access digital breast phantoms generated with different methods. For each phantom type, we created an ensemble of DBT images to test acquisition strategies. Human observer localization ROC (LROC) was used to assess observer performance studies for each case. Noise power spectrum (NPS) was estimated to compare the phantom structural components. Further, we computed several gaze metrics to quantify the gaze pattern when viewing images generated from different phantom types. Results: Our LROC results show that the arc samplings for peak performance were approximately 2.5 degrees and 6 degrees in Bakic and XCAT breast phantoms respectively for 3-mm lesion detection tasks and indicate that system optimization outcomes from VITs can vary with phantom types and structural frequency components. Additionally, a significant correlation (p= 0.01) between gaze metrics and diagnostic performance suggests that gaze analysis can be used to understand and evaluate task difficulty in VITs.

Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities

We propose the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM), a novel probabilistic attention framework, and the Gaussian Adaptive Transformer (GAT), designed to enhance information aggregation across multiple modalities, including Speech, Text and Vision. GAAM integrates learnable mean and variance into its attention mechanism, implemented in a Multi-Headed framework enabling it to collectively model any Probability Distribution for dynamic recalibration of feature significance. This method demonstrates significant improvements, especially with highly non-stationary data, surpassing the state-of-the-art attention techniques in model performance (up to approximately +20% in accuracy) by identifying key elements within the feature space. GAAM's compatibility with dot-product-based attention models and relatively low number of parameters showcases its adaptability and potential to boost existing attention frameworks. Empirically, GAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification, thereby establishing its robustness and versatility in handling multi-modal data. Furthermore, we introduce the Importance Factor (IF), a new learning-based metric that enhances the explainability of models trained with GAAM-based methods. Overall, GAAM represents an advancement towards development of better performing and more explainable attention models across multiple modalities.

ENN: A Neural Network with DCT Adaptive Activation Functions

The expressiveness of neural networks highly depends on the nature of the activation function, although these are usually assumed predefined and fixed during the training stage. Under a signal processing perspective, in this paper we present Expressive Neural Network (ENN), a novel model in which the non-linear activation functions are modeled using the Discrete Cosine Transform (DCT) and adapted using backpropagation during training. This parametrization keeps the number of trainable parameters low, is appropriate for gradient-based schemes, and adapts to different learning tasks. This is the first non-linear model for activation functions that relies on a signal processing perspective, providing high flexibility and expressiveness to the network. We contribute with insights in the explainability of the network at convergence by recovering the concept of bump, this is, the response of each activation function in the output space. Finally, through exhaustive experiments we show that the model can adapt to classification and regression tasks. The performance of ENN outperforms state of the art benchmarks, providing above a 40% gap in accuracy in some scenarios.