Supervised Deep Sparse Coding Networks for Image Classification

In this paper, we propose a novel deep sparse coding network (SCN) capable of efficiently adapting its own regularization parameters for a given application. The network is trained end-to-end with a supervised task-driven learning algorithm via error backpropagation. During training, the network learns both the dictionaries and the regularization parameters of each sparse coding layer so that the reconstructive dictionaries are smoothly transformed into increasingly discriminative representations. In addition, the adaptive regularization also offers the network more flexibility to adjust sparsity levels. Furthermore, we have devised a sparse coding layer utilizing a “skinny” dictionary. Integral to computational efficiency, these skinny dictionaries compress the high-dimensional sparse codes into lower dimensional structures. The adaptivity and discriminability of our 15-layer SCN are demonstrated on six benchmark datasets, namely Cifar-10, Cifar-100, STL-10, SVHN, MNIST, and ImageNet, most of which are considered difficult for sparse coding models. Experimental results show that our architecture overwhelmingly outperforms traditional one-layer sparse coding architectures while using much fewer parameters. Moreover, our multilayer architecture exploits the benefits of depth with sparse coding's characteristic ability to operate on smaller datasets. In such data-constrained scenarios, our technique demonstrates a highly competitive performance compared with the deep neural networks.

Neural Multimodal Cooperative Learning Toward Micro-Video Understanding

The prevailing characteristics of micro-videos result in the less descriptive power of each modality. The micro-video representations, several pioneer efforts proposed, are limited in implicitly exploring the consistency between different modality information but ignore the complementarity. In this paper, we focus on how to explicitly separate the consistent features and the complementary features from the mixed information and harness their combination to improve the expressiveness of each modality. Toward this end, we present a neural multimodal cooperative learning (NMCL) model to split the consistent component and the complementary component by a novel relation-aware attention mechanism. Specifically, the computed attention score can be used to measure the correlation between the features extracted from different modalities. Then, a threshold is learned for each modality to distinguish the consistent and complementary features according to the score. Thereafter, we integrate the consistent parts to enhance the representations and supplement the complementary ones to reinforce the information in each modality. As to the problem of redundant information, which may cause overfitting and is hard to distinguish, we devise an attention network to dynamically capture the features which closely related the category and output a discriminative representation for prediction. The experimental results on a real-world micro-video dataset show that the NMCL outperforms the state-of-the-art methods. Further studies verify the effectiveness and cooperative effects brought by the attentive mechanism.

Cross-View Gait Recognition by Discriminative Feature Learning

Recently, deep learning-based cross-view gait recognition has become popular owing to the strong capacity of convolutional neural networks (CNNs). Current deep learning methods often rely on loss functions used widely in the task of face recognition, e.g., contrastive loss and triplet loss. These loss functions have the problem of hard negative mining. In this paper, a robust, effective, and gait-related loss function, called angle center loss (ACL), is proposed to learn discriminative gait features. The proposed loss function is robust to different local parts and temporal window sizes. Different from center loss which learns a center for each identity, the proposed loss function learns multiple sub-centers for each angle of the same identity. Only the largest distance between the anchor feature and the corresponding cross-view sub-centers is penalized, which achieves better intra-subject compactness. We also propose to extract discriminative spatial–temporal features by local feature extractors and a temporal attention model. A simplified spatial transformer network is proposed to localize the suitable horizontal parts of the human body. Local gait features for each horizontal part are extracted and then concatenated as the descriptor. We introduce long short-term memory (LSTM) units as the temporal attention model to learn the attention score for each frame, e.g., focusing more on discriminative frames and less on frames with bad quality. The temporal attention model shows better performance than the temporal average pooling or gait energy images (GEI). By combing the three aspects, we achieve state-of-the-art results on several cross-view gait recognition benchmarks.

Accurate Pedestrian Detection by Human Pose Regression

Pedestrian detection with high detection and localization accuracy is increasingly important for many practical applications. Due to the flexible structure of the human body, it is hard to train a template-based pedestrian detector that achieves a high detection rate and a good localization accuracy simultaneously. In this paper, we utilize human pose estimation to improve the detection and localization accuracy of pedestrian detection. We design two kinds of pose-indexed features that can considerably improve the discriminability of the detector. In addition to employing a two-stage pipeline to carry out these two tasks, we unify pose estimation and pedestrian detection into a cascaded decision forest in which they can cooperate sufficiently. To prevent irregular positive examples, such as truncated ones, from distracting the pedestrian detection and the pose regression, we clean the positive training data by realigning the bounding boxes and rejecting the wrong positive samples. Experimental results on the Caltech test dataset demonstrate the effectiveness of our proposed method. Our detector achieves 11.1% MR−2, outperforming all existing detectors without using the convolutional neural network (CNN). Moreover, our method can be assembled with other detectors based on CNNs to improve detection and localization performance. By collaborating with the recent CNN-based method, our detector achieves 5.5% MR−2 on the Caltech test dataset, outperforming the state-of-the-art methods.

Unsupervised Monocular Depth Estimation From Light Field Image

Learning based depth estimation from light field has made significant progresses in recent years. However, most existing approaches are under the supervised framework, which requires vast quantities of ground-truth depth data for training. Furthermore, accurate depth maps of light field are hardly available except for a few synthetic datasets. In this paper, we exploit the multi-orientation epipolar geometry of light field and propose an unsupervised monocular depth estimation network. It predicts depth from the central view of light field without any ground-truth information. Inspired by the inherent depth cues and geometry constraints of light field, we then introduce three novel unsupervised loss functions: photometric loss, defocus loss and symmetry loss. We have evaluated our method on a public 4D light field synthetic dataset. As the first unsupervised method published in the 4D Light Field Benchmark website, our method can achieve satisfactory performance in most error metrics. Comparison experiments with two state-of-the-art unsupervised methods demonstrate the superiority of our method. We also prove the effectiveness and generality of our method on real-world light-field images.

Heterogeneous Multireference Alignment for Images With Application to 2D Classification in Single Particle Reconstruction

Motivated by the task of 2D classification in single particle reconstruction by cryo-electron microscopy (cryo-EM), we consider the problem of heterogeneous multireference alignment of images. In this problem, the goal is to estimate a (typically small) set of target images from a (typically large) collection of observations. Each observation is a rotated, noisy version of one of the target images. For each individual observation, neither the rotation nor which target image has been rotated are known. As the noise level in cryo-EM data is high, clustering the observations and estimating individual rotations is challenging. We propose a framework to estimate the target images directly from the observations, completely bypassing the need to cluster or register the images. The framework consists of two steps. First, we estimate rotation-invariant features of the images, such as the bispectrum. These features can be estimated to any desired accuracy, at any noise level, provided sufficiently many observations are collected. Then, we estimate the images from the invariant features. Numerical experiments on synthetic cryo-EM datasets demonstrate the effectiveness of the method. Ultimately, we outline future developments required to apply this method to experimental data.

Optical-Flow Based Nonlinear Weighted Prediction for SDR and Backward Compatible HDR Video Coding

Tone Mapping Operators (TMO) designed for videos can be classified into two categories. In a first approach, TMOs are temporal filtered to reduce temporal artifacts and provide a Standard Dynamic Range (SDR) content with improved temporal consistency. This however does not improve the SDR coding Rate Distortion (RD) performances. A second approach is to design the TMO with the goal of optimizing the SDR coding rate-distortion performances. This second category of methods may lead to SDR videos altering the artistic intent compared with the produced HDR content. In this paper, we combine the benefits of the two approaches by introducing new Weighted Prediction (WP) methods inside the HEVC SDR codec. As a first step, we demonstrate the interest of the WP methods compared to TMO optimized for RD performances. Then we present the newly introduced WP algorithm and WP modes. The WP algorithm consists in performing a global motion compensation between frames using an optical flow, and the new modes are based on non linear functions in contrast with the literature using only linear functions. The contribution of each novelty is studied independently and in a second time they are all put in competition to maximize the RD performances. Tests were made for HDR backward compatible compression but also for SDR compression only. In both cases, the proposed WP methods improve the RD performances while maintaining the SDR temporal coherency.

Learning No-Reference Quality Assessment of Multiply and Singly Distorted Images With Big Data

Previous research on no-reference (NR) quality assessment of multiply-distorted images focused mainly on three distortion types (white noise, Gaussian blur, and JPEG compression), while in practice images can be contaminated by many other common distortions due to the various stages of processing. Although MUSIQUE (MUltiply- and Singly-distorted Image QUality Estimator) [Zhang et al., TIP 2018] is a successful NR algorithm, this approach is still limited to the three distortion types. In this paper, we extend MUSIQUE to MUSIQUE-II to blindly assess the quality of images corrupted by five distortion types (white noise, Gaussian blur, JPEG compression, JPEG2000 compression, and contrast change) and their combinations. The proposed MUSIQUE-II algorithm builds upon the classification and parameter-estimation framework of its predecessor by using more advanced models and a more comprehensive set of distortion-sensitive features. Specifically, MUSIQUE-II relies on a three-layer classification model to identify 19 distortion types. To predict the five distortion parameter values, MUSIQUE-II extracts an additional 14 contrast features and employs a multi-layer probability-weighting rule. Finally, MUSIQUE-II employs a new most-apparent-distortion strategy to adaptively combine five quality scores based on outputs of three classification models. Experimental results tested on three multiply-distorted and six singly-distorted image quality databases show that MUSIQUE-II yields not only a substantial improvement in quality predictive performance as compared with its predecessor, but also highly competitive performance relative to other state-of-the-art FR/NR IQA algorithms.

High-Order Feature Learning for Multi-Atlas Based Label Fusion: Application to Brain Segmentation With MRI

Multi-atlas based segmentation methods have shown their effectiveness in brain regions-of-interesting (ROIs) segmentation, by propagating labels from multiple atlases to a target image based on the similarity between patches in the target image and multiple atlas images. Most of the existing multi-atlas based methods use image intensity features to calculate the similarity between a pair of image patches for label fusion. In particular, using only low-level image intensity features cannot adequately characterize the complex appearance patterns (e.g., the high-order relationship between voxels within a patch) of brain magnetic resonance (MR) images. To address this issue, this paper develops a high-order feature learning framework for multi-atlas based label fusion, where high-order features of image patches are extracted and fused for segmenting ROIs of structural brain MR images. Specifically, an unsupervised feature learning method (i.e., means-covariances restricted Boltzmann machine, mcRBM) is employed to learn high-order features (i.e., mean and covariance features) of patches in brain MR images. Then, a group-fused sparsity dictionary learning method is proposed to jointly calculate the voting weights for label fusion, based on the learned high-order and the original image intensity features. The proposed method is compared with several state-of-the-art label fusion methods on ADNI, NIREP and LONI-LPBA40 datasets. The Dice ratio achieved by our method is 88.30%, 88.83%, 79.54% and 81.02% on left and right hippocampus on the ADNI, NIREP and LONI-LPBA40 datasets, respectively, while the best Dice ratio yielded by the other methods are 86.51%, 87.39%, 78.48% and 79.65% on three datasets, respectively.

Deep Retinal Image Segmentation With Regularization Under Geometric Priors

Vessel segmentation of retinal images is a key diagnostic capability in ophthalmology. This problem faces several challenges including low contrast, variable vessel size and thickness, and presence of interfering pathology such as micro-aneurysms and hemorrhages. Early approaches addressing this problem employed hand-crafted filters to capture vessel structures, accompanied by morphological post-processing. More recently, deep learning techniques have been employed with significantly enhanced segmentation accuracy. We propose a novel domain enriched deep network that consists of two components: 1) a representation network that learns geometric features specific to retinal images, and 2) a custom designed computationally efficient residual task network that utilizes the features obtained from the representation layer to perform pixel-level segmentation. The representation and task networks are jointly learned for any given training set. To obtain physically meaningful and practically effective representation filters, we propose two new constraints that are inspired by expected prior structure on these filters: 1) orientation constraint that promotes geometric diversity of curvilinear features, and 2) a data adaptive noise regularizer that penalizes false positives. Multi-scale extensions are developed to enable accurate detection of thin vessels. Experiments performed on three challenging benchmark databases under a variety of training scenarios show that the proposed prior guided deep network outperforms state of the art alternatives as measured by common evaluation metrics, while being more economical in network size and inference time.