GazeGPT: Augmenting Human Capabilities using Gaze-contingent Contextual AI for Smart Eyewear

Multimodal large language models (LMMs) excel in world knowledge and problem-solving abilities. Through the use of a world-facing camera and contextual AI, emerging smart accessories aim to provide a seamless interface between humans and LMMs. Yet, these wearable computing systems lack an understanding of the user's attention. We introduce GazeGPT as a new user interaction paradigm for contextual AI. GazeGPT uses eye tracking to help the LMM understand which object in the world-facing camera view a user is paying attention to. Using extensive user evaluations, we show that this gaze-contingent mechanism is a faster and more accurate pointing mechanism than alternatives; that it augments human capabilities by significantly improving their accuracy in a dog-breed classification task; and that it is consistently ranked as more natural than head- or body-driven selection mechanisms for contextual AI. Moreover, we prototype a variety of application scenarios that suggest GazeGPT could be of significant value to users as part of future AI-driven personal assistants.

Do Object Detection Localization Errors Affect Human Performance and Trust?

Bounding boxes are often used to communicate automatic object detection results to humans, aiding humans in a multitude of tasks. We investigate the relationship between bounding box localization errors and human task performance. We use observer performance studies on a visual multi-object counting task to measure both human trust and performance with different levels of bounding box accuracy. The results show that localization errors have no significant impact on human accuracy or trust in the system. Recall and precision errors impact both human performance and trust, suggesting that optimizing algorithms based on the F1 score is more beneficial in human-computer tasks. Lastly, the paper offers an improvement on bounding boxes in multi-object counting tasks with center dots, showing improved performance and better resilience to localization inaccuracy.

Sigma-lognormal modeling of speech

Human movement studies and analyses have been fundamental in many scientific domains, ranging from neuroscience to education, pattern recognition to robotics, health care to sports, and beyond. Previous speech motor models were proposed to understand how speech movement is produced and how the resulting speech varies when some parameters are changed. However, the inverse approach, in which the muscular response parameters and the subject's age are derived from real continuous speech, is not possible with such models. Instead, in the handwriting field, the kinematic theory of rapid human movements and its associated Sigma-lognormal model have been applied successfully to obtain the muscular response parameters. This work presents a speech kinematics based model that can be used to study, analyze, and reconstruct complex speech kinematics in a simplified manner. A method based on the kinematic theory of rapid human movements and its associated Sigma lognormal model are applied to describe and to parameterize the asymptotic impulse response of the neuromuscular networks involved in speech as a response to a neuromotor command. The method used to carry out transformations from formants to a movement observation is also presented. Experiments carried out with the (English) VTR TIMIT database and the (German) Saarbrucken Voice Database, including people of different ages, with and without laryngeal pathologies, corroborate the link between the extracted parameters and aging, on the one hand, and the proportion between the first and second formants required in applying the kinematic theory of rapid human movements, on the other. The results should drive innovative developments in the modeling and understanding of speech kinematics.

Detection of Auditory Brainstem Response Peaks Using Image Processing Techniques in Infants with Normal Hearing Sensitivity

Introduction: The auditory brainstem response (ABR) is measured to find the brainstem-level peripheral auditory nerve system integrity in children having normal hearing. The Auditory Evoked Potential (AEP) is generated using acoustic stimuli. Interpreting these waves requires competence to avoid misdiagnosing hearing problems. Automating ABR test labeling with computer vision may reduce human error. Method: The ABR test results of 26 children aged 1 to 20 months with normal hearing in both ears were used. A new approach is suggested for automatically calculating the peaks of waves of different intensities (in decibels). The procedure entails acquiring wave images from an Audera device using the Color Thresholder method, segmenting each wave as a single wave image using the Image Region Analyzer application, converting all wave images into waves using Image Processing (IP) techniques, and finally calculating the latency of the peaks for each wave to be used by an audiologist for diagnosing the disease. Findings: Image processing techniques were able to detect 1, 3, and 5 waves in the diagnosis field with accuracy (0.82), (0.98), and (0.98), respectively, and its precision for waves 1, 3, and 5, were respectively (0.32), (0.97) and (0.87). This evaluation also worked well in the thresholding part and 82.7 % correctly detected the ABR waves. Conclusion: Our findings indicate that the audiology test battery suite can be made more accurate, quick, and error-free by using technology to automatically detect and label ABR waves.

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.

ReAlnet: Achieving More Human Brain-Like Vision via Human Neural Representational Alignment

Despite the remarkable strides made in artificial intelligence, current object recognition models still lag behind in emulating the mechanism of visual information processing in human brains. Recent studies have highlighted the potential of using neural data to mimic brain processing; however, these often reply on invasive neural recordings from non-human subjects, leaving a critical gap in our understanding of human visual perception and the development of more human brain-like vision models. Addressing this gap, we present, for the first time, "Re(presentational)Al(ignment)net", a vision model aligned with human brain activity based on non-invasive EEG recordings, demonstrating a significantly higher similarity to human brain representations. Our innovative image-to-brain multi-layer encoding alignment framework not only optimizes multiple layers of the model, marking a substantial leap in neural alignment, but also enables the model to efficiently learn and mimic human brain's visual representational patterns across object categories and different neural data modalities. Furthermore, we discover that alignment with human brain representations improves the model's adversarial robustness. Our findings suggest that ReAlnet sets a new precedent in the field, bridging the gap between artificial and human vision, and paving the way for more brain-like artificial intelligence systems.

Evaluation in Neural Style Transfer: A Review

The field of Neural Style Transfer (NST) has witnessed remarkable progress in the past few years, with approaches being able to synthesize artistic and photorealistic images and videos of exceptional quality. To evaluate such results, a diverse landscape of evaluation methods and metrics is used, including authors' opinions based on side-by-side comparisons, human evaluation studies that quantify the subjective judgements of participants, and a multitude of quantitative computational metrics which objectively assess the different aspects of an algorithm's performance. However, there is no consensus regarding the most suitable and effective evaluation procedure that can guarantee the reliability of the results. In this review, we provide an in-depth analysis of existing evaluation techniques, identify the inconsistencies and limitations of current evaluation methods, and give recommendations for standardized evaluation practices. We believe that the development of a robust evaluation framework will not only enable more meaningful and fairer comparisons among NST methods but will also enhance the comprehension and interpretation of research findings in the field.

Efficient Gesture Recognition on Spiking Convolutional Networks Through Sensor Fusion of Event-Based and Depth Data

As intelligent systems become increasingly important in our daily lives, new ways of interaction are needed. Classical user interfaces pose issues for the physically impaired and are partially not practical or convenient. Gesture recognition is an alternative, but often not reactive enough when conventional cameras are used. This work proposes a Spiking Convolutional Neural Network, processing event- and depth data for gesture recognition. The network is simulated using the open-source neuromorphic computing framework LAVA for offline training and evaluation on an embedded system. For the evaluation three open source data sets are used. Since these do not represent the applied bi-modality, a new data set with synchronized event- and depth data was recorded. The results show the viability of temporal encoding on depth information and modality fusion, even on differently encoded data, to be beneficial to network performance and generalization capabilities.

A Tournament of Transformation Models: B-Spline-based vs. Mesh-based Multi-Objective Deformable Image Registration

The transformation model is an essential component of any deformable image registration approach. It provides a representation of physical deformations between images, thereby defining the range and realism of registrations that can be found. Two types of transformation models have emerged as popular choices: B-spline models and mesh models. Although both models have been investigated in detail, a direct comparison has not yet been made, since the models are optimized using very different optimization methods in practice. B-spline models are predominantly optimized using gradient-descent methods, while mesh models are typically optimized using finite-element method solvers or evolutionary algorithms. Multi-objective optimization methods, which aim to find a diverse set of high-quality trade-off registrations, are increasingly acknowledged to be important in deformable image registration. Since these methods search for a diverse set of registrations, they can provide a more complete picture of the capabilities of different transformation models, making them suitable for a comparison of models. In this work, we conduct the first direct comparison between B-spline and mesh transformation models, by optimizing both models with the same state-of-the-art multi-objective optimization method, the Multi-Objective Real-Valued Gene-pool Optimal Mixing Evolutionary Algorithm (MO-RV-GOMEA). The combination with B-spline transformation models, moreover, is novel. We experimentally compare both models on two different registration problems that are both based on pelvic CT scans of cervical cancer patients, featuring large deformations. Our results, on three cervical cancer patients, indicate that the choice of transformation model can have a profound impact on the diversity and quality of achieved registration outcomes.

Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling

As deep neural networks are more commonly deployed in high-stakes domains, their lack of interpretability makes uncertainty quantification challenging. We investigate the effects of presenting conformal prediction sets$\unicode{x2013}$a method for generating valid confidence sets in distribution-free uncertainty quantification$\unicode{x2013}$to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-$1$ and Top-$k$ predictions for AI-advised image labeling. We find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-$1$ and Top-$k$ displays for easy images, prediction sets excel at assisting humans in labeling out-of-distribution (OOD) images especially when the set size is small. Our results empirically pinpoint the practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.