ME R-CNN: Multi-Expert R-CNN for Object Detection

We introduce Multi-Expert Region-based Convolutional Neural Network (ME R-CNN) which is equipped with multiple experts (ME) where each expert is learned to process a certain type of regions of interest (RoIs). This architecture better captures the appearance variations of the RoIs caused by different shapes, poses, and viewing angles. In order to direct each RoI to the appropriate expert, we devise a novel “learnable” network, which we call, expert assignment network (EAN). EAN automatically learns the optimal RoI-expert relationship even without any supervision of expert assignment. As the major components of ME R-CNN, ME and EAN, are mutually affecting each other while tied to a shared network, neither an alternating nor a naive end-to-end optimization is likely to fail. To address this problem, we introduce a practical training strategy which is tailored to optimize ME, EAN, and the shared network in an end-to-end fashion. We show that both of the architectures provide considerable performance increase over the baselines on PASCAL VOC 07, 12, and MS COCO datasets.

Cross-View Gait Recognition by Discriminative Feature Learning

Recently, deep learning-based cross-view gait recognition has become popular owing to the strong capacity of convolutional neural networks (CNNs). Current deep learning methods often rely on loss functions used widely in the task of face recognition, e.g., contrastive loss and triplet loss. These loss functions have the problem of hard negative mining. In this paper, a robust, effective, and gait-related loss function, called angle center loss (ACL), is proposed to learn discriminative gait features. The proposed loss function is robust to different local parts and temporal window sizes. Different from center loss which learns a center for each identity, the proposed loss function learns multiple sub-centers for each angle of the same identity. Only the largest distance between the anchor feature and the corresponding cross-view sub-centers is penalized, which achieves better intra-subject compactness. We also propose to extract discriminative spatial–temporal features by local feature extractors and a temporal attention model. A simplified spatial transformer network is proposed to localize the suitable horizontal parts of the human body. Local gait features for each horizontal part are extracted and then concatenated as the descriptor. We introduce long short-term memory (LSTM) units as the temporal attention model to learn the attention score for each frame, e.g., focusing more on discriminative frames and less on frames with bad quality. The temporal attention model shows better performance than the temporal average pooling or gait energy images (GEI). By combing the three aspects, we achieve state-of-the-art results on several cross-view gait recognition benchmarks.

Collective Affinity Learning for Partial Cross-Modal Hashing

In the past decade, various unsupervised hashing methods have been developed for cross-modal retrieval. However, in real-world applications, it is often the incomplete case that every modality of data may suffer from some missing samples. Most existing works assume that every object appears in both modalities, hence they may not work well for partial multi-modal data. To address this problem, we propose a novel Collective Affinity Learning Method (CALM), which collectively and adaptively learns an anchor graph for generating binary codes on partial multi-modal data. In CALM, we first construct modality-specific bipartite graphs collectively, and derive a probabilistic model to figure out complete data-to-anchor affinities for each modality. Theoretical analysis reveals its ability to recover missing adjacency information. Moreover, a robust model is proposed to fuse these modality-specific affinities by adaptively learning a unified anchor graph. Then, the neighborhood information from the learned anchor graph acts as feedback, which guides the previous affinity reconstruction procedure. To solve the formulated optimization problem, we further develop an effective algorithm with linear time complexity and fast convergence. Last, Anchor Graph Hashing (AGH) is conducted on the fused affinities for cross-modal retrieval. Experimental results on benchmark datasets show that our proposed CALM consistently outperforms the existing methods.

Deep Tone Mapping Operator for High Dynamic Range Images

A computationally fast tone mapping operator (TMO) that can quickly adapt to a wide spectrum of high dynamic range (HDR) content is quintessential for visualization on varied low dynamic range (LDR) output devices such as movie screens or standard displays. Existing TMOs can successfully tone-map only a limited number of HDR content and require an extensive parameter tuning to yield the best subjective-quality tone-mapped output. In this paper, we address this problem by proposing a fast, parameter-free and scene-adaptable deep tone mapping operator (DeepTMO) that yields a high-resolution and high-subjective quality tone mapped output. Based on conditional generative adversarial network (cGAN), DeepTMO not only learns to adapt to vast scenic-content (e.g., outdoor, indoor, human, structures, etc.) but also tackles the HDR related scene-specific challenges such as contrast and brightness, while preserving the fine-grained details. We explore 4 possible combinations of Generator-Discriminator architectural designs to specifically address some prominent issues in HDR related deep-learning frameworks like blurring, tiling patterns and saturation artifacts. By exploring different influences of scales, loss-functions and normalization layers under a cGAN setting, we conclude with adopting a multi-scale model for our task. To further leverage on the large-scale availability of unlabeled HDR data, we train our network by generating targets using an objective HDR quality metric, namely Tone Mapping Image Quality Index (TMQI). We demonstrate results both quantitatively and qualitatively, and showcase that our DeepTMO generates high-resolution, high-quality output images over a large spectrum of real-world scenes. Finally, we evaluate the perceived quality of our results by conducting a pair-wise subjective study which confirms the versatility of our method.

Few-Shot Deep Adversarial Learning for Video-Based Person Re-Identification

Video-based person re-identification (re-ID) refers to matching people across camera views from arbitrary unaligned video footages. Existing methods rely on supervision signals to optimise a projected space under which the distances between inter/intra-videos are maximised/minimised. However, this demands exhaustively labelling people across camera views, rendering them unable to be scaled in large networked cameras. Also, it is noticed that learning effective video representations with view invariance is not explicitly addressed for which features exhibit different distributions otherwise. Thus, matching videos for person re-ID demands flexible models to capture the dynamics in time-series observations and learn view-invariant representations with access to limited labeled training samples. In this paper, we propose a novel few-shot deep learning approach to video-based person re-ID, to learn comparable representations that are discriminative and view-invariant. The proposed method is developed on the variational recurrent neural networks (VRNNs) and trained adversarially to produce latent variables with temporal dependencies that are highly discriminative yet view-invariant in matching persons. Through extensive experiments conducted on three benchmark datasets, we empirically show the capability of our method in creating view-invariant temporal features and state-of-the-art performance achieved by our method.

Edge-Sensitive Human Cutout With Hierarchical Granularity and Loopy Matting Guidance

Human parsing and matting play important roles in various applications, such as dress collocation, clothing recommendation, and image editing. In this paper, we propose a lightweight hybrid model that unifies the fully-supervised hierarchical-granularity parsing task and the unsupervised matting one. Our model comprises two parts, the extensible hierarchical semantic segmentation block using CNN and the matting module composed of guided filters. Given a human image, the segmentation block stage-1 first obtains a primitive segmentation map to separate the human and the background. The primitive segmentation is then fed into stage-2 together with the original image to give a rough segmentation of human body. This procedure is repeated in the stage-3 to acquire a refined segmentation. The matting module takes as input the above estimated segmentation maps and produces the matting map, in a fully unsupervised manner. The obtained matting map is then in turn fed back to the CNN in the first block for refining the semantic segmentation results.

Perceptual Evaluation for Multi-Exposure Image Fusion of Dynamic Scenes

A common approach to high dynamic range (HDR) imaging is to capture multiple images of different exposures followed by multi-exposure image fusion (MEF) in either radiance or intensity domain. A predominant problem of this approach is the introduction of the ghosting artifacts in dynamic scenes with camera and object motion. While many MEF methods (often referred to as deghosting algorithms) have been proposed for reduced ghosting artifacts and improved visual quality, little work has been dedicated to perceptual evaluation of their deghosting results. Here we first construct a database that contains 20 multi-exposure sequences of dynamic scenes and their corresponding fused images by nine MEF algorithms. We then carry out a subjective experiment to evaluate fused image quality, and find that none of existing objective quality models for MEF provides accurate quality predictions. Motivated by this, we develop an objective quality model for MEF of dynamic scenes. Specifically, we divide the test image into static and dynamic regions, measure structural similarity between the image and the corresponding sequence in the two regions separately, and combine quality measurements of the two regions into an overall quality score. Experimental results show that the proposed method significantly outperforms the state-of-the-art. In addition, we demonstrate the promise of the proposed model in parameter tuning of MEF methods. The subjective database and the MATLAB code of the proposed model are made publicly available at https://github.com/h4nwei/MEF-SSIMd.

Homologous Component Analysis for Domain Adaptation

Covariate shift assumption-based domain adaptation approaches usually utilize only one common transformation to align marginal distributions and make conditional distributions preserved. However, one common transformation may cause loss of useful information, such as variances and neighborhood relationship in both source and target domains. To address this problem, we propose a novel method called homologous component analysis (HCA) where we try to find two totally different but homologous transformations to align distributions with side information and make conditional distributions preserved. As it is hard to find a closed-form solution to the corresponding optimization problem, we solve them by means of the alternating direction minimizing method (ADMM) in the context of Stiefel manifolds. We also provide a generalization error bound for domain adaptation in the semi-supervised case, and two transformations can help to decrease this upper bound more than only one common transformation does. Extensive experiments on synthetic and real data show the effectiveness of the proposed method by comparing its classification accuracy with the state-of-the-art methods, and the numerical evidence on chordal distance and Frobenius distance shows that resulting optimal transformations are different.

Deep Cascade Model-Based Face Recognition: When Deep-Layered Learning Meets Small Data

Sparse representation based classification (SRC), nuclear-norm matrix regression (NMR), and deep learning (DL) have achieved a great success in face recognition (FR). However, there still exist some intrinsic limitations among them. SRC and NMR based coding methods belong to one-step model, such that the latent discriminative information of the coding error vector cannot be fully exploited. DL, as a multi-step model, can learn powerful representation, but relies on large-scale data and computation resources for numerous parameters training with complicated back-propagation. Straightforward training of deep neural networks from scratch on small-scale data is almost infeasible. Therefore, in order to develop efficient algorithms that are specifically adapted for small-scale data, we propose to derive the deep models of SRC and NMR. Specifically, in this paper, we propose an end-to-end deep cascade model (DCM) based on SRC and NMR with hierarchical learning, nonlinear transformation and multi-layer structure for corrupted face recognition. The contributions include four aspects. First, an end-to-end deep cascade model for small-scale data without back-propagation is proposed. Second, a multi-level pyramid structure is integrated for local feature representation. Third, for introducing nonlinear transformation in layer-wise learning, softmax vector coding of the errors with class discrimination is proposed. Fourth, the existing representation methods can be easily integrated into our DCM framework. Experiments on a number of small-scale benchmark FR datasets demonstrate the superiority of the proposed model over state-of-the-art counterparts. Additionally, a perspective that deep-layered learning does not have to be convolutional neural network with back-propagation optimization is consolidated. The demo code is available in https://github.com/liuji93/DCM

Tunable VVC Frame Partitioning Based on Lightweight Machine Learning

Block partition structure is a critical module in video coding scheme to achieve significant gap of compression performance. Under the exploration of the future video coding standard, named Versatile Video Coding (VVC), a new Quad Tree Binary Tree (QTBT) block partition structure has been introduced. In addition to the QT block partitioning defined in High Efficiency Video Coding (HEVC) standard, new horizontal and vertical BT partitions are enabled, which drastically increases the encoding time compared to HEVC. In this paper, we propose a lightweight and tunable QTBT partitioning scheme based on a Machine Learning (ML) approach. The proposed solution uses Random Forest classifiers to determine for each coding block the most probable partition modes. To minimize the encoding loss induced by misclassification, risk intervals for classifier decisions are introduced in the proposed solution. By varying the size of risk intervals, tunable trade-off between encoding complexity reduction and coding loss is achieved. The proposed solution implemented in the JEM-7.0 software offers encoding complexity reductions ranging from 30% to 70% in average for only 0.7% to 3.0% Bjøntegaard Delta Rate (BD-BR) increase in Random Access (RA) coding configuration, with very slight overhead induced by Random Forest. The proposed solution based on Random Forest classifiers is also efficient to reduce the complexity of the Multi-Type Tree (MTT) partitioning scheme under the VTM-5.0 software, with complexity reductions ranging from 25% to 61% in average for only 0.4% to 2.2% BD-BR increase.