Successful training of deep neural networks with noisy labels is an essential capability as most real-world datasets contain some amount of mislabeled data. Left unmitigated, label noise can sharply degrade typical supervised learning approaches. In this paper, we present robust temporal ensembling (RTE), which combines robust loss with semi-supervised regularization methods to achieve noise-robust learning. We demonstrate that RTE achieves state-of-the-art performance across the CIFAR-10, CIFAR-100, ImageNet, WebVision, and Food-101N datasets, while forgoing the recent trend of label filtering and/or fixing. Finally, we show that RTE also retains competitive corruption robustness to unforeseen input noise using CIFAR-10-C, obtaining a mean corruption error (mCE) of 13.50% even in the presence of an 80% noise ratio, versus 26.9% mCE with standard methods on clean data.
In recent years, facial makeup transfer has attracted growing attention due to its efficiency and flexibility in transferring makeup styles between different faces. Although recent works have achieved realistic results, most of them fail to handle heavy makeup styles with multiple colors and subtle details. Hence we propose a novel GAN model to handle heavy makeup transfer, while maintaining the robustness to different poses and expressions. Firstly, a Makeup Multi-Extraction Network is introduced to learn region-wise makeup features from multiple layers. Then, a key transferring module called Detailed Region-Adaptive Normalization is proposed to fuse different levels of makeup styles in an adaptive way, making great improvement to the quality of heavy makeup transfer. With the outputs from the two components, Makeup Transfer Network is used to perform makeup transfer. To evaluate the efficacy of our proposed method, we collected a new makeup dataset containing a wide range of heavy styles. Experiments show that our method achieves state-of-the-art results both on light and heavy makeup styles, and is robust to different poses and expressions.
We propose Generative Probabilistic Image Colorization, a diffusion-based generative process that trains a sequence of probabilistic models to reverse each step of noise corruption. Given a line-drawing image as input, our method suggests multiple candidate colorized images. Therefore, our method accounts for the ill-posed nature of the colorization problem. We conducted comprehensive experiments investigating the colorization of line-drawing images, report the influence of a score-based MCMC approach that corrects the marginal distribution of estimated samples, and further compare different combinations of models and the similarity of their generated images. Despite using only a relatively small training dataset, we experimentally develop a method to generate multiple diverse colorization candidates which avoids mode collapse and does not require any additional constraints, losses, or re-training with alternative training conditions. Our proposed approach performed well not only on color-conditional image generation tasks using biased initial values, but also on some practical image completion and inpainting tasks.
Most recent methods used for crowd counting are based on the convolutional neural network (CNN), which has a strong ability to extract local features. But CNN inherently fails in modeling the global context due to the limited receptive fields. However, the transformer can model the global context easily. In this paper, we propose a simple approach called CCTrans to simplify the design pipeline. Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information, a pyramid feature aggregation (PFA) model to combine low-level and high-level features, an efficient regression head with multi-scale dilated convolution (MDC) to predict density maps. Besides, we tailor the loss functions for our pipeline. Without bells and whistles, extensive experiments demonstrate that our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting. Moreover, we currently rank No.1 on the leaderboard of NWPU-Crowd. Our code will be made available.
Modulating the spectral dimension of light has numerous applications in computational imaging. While there are many techniques for achieving this, there are few, if any, for implementing a spatially-varying and programmable spectral filter. This paper provides an optical design for implementing such a capability. Our key insight is that spatially-varying spectral modulation can be implemented using a liquid crystal spatial light modulator since it provides an array of liquid crystal cells, each of which can be purposed to act as a programmable spectral filter array. Relying on this insight, we provide an optical schematic and an associated lab prototype for realizing the capability, as well as address the associated challenges at implementation using optical and computational innovations. We show a number of unique operating points with our prototype including single- and multi-image hyperspectral imaging, as well as its application in material identification.
A deep hashing model typically has two main learning objectives: to make the learned binary hash codes discriminative and to minimize a quantization error. With further constraints such as bit balance and code orthogonality, it is not uncommon for existing models to employ a large number (>4) of losses. This leads to difficulties in model training and subsequently impedes their effectiveness. In this work, we propose a novel deep hashing model with only a single learning objective. Specifically, we show that maximizing the cosine similarity between the continuous codes and their corresponding binary orthogonal codes can ensure both hash code discriminativeness and quantization error minimization. Further, with this learning objective, code balancing can be achieved by simply using a Batch Normalization (BN) layer and multi-label classification is also straightforward with label smoothing. The result is an one-loss deep hashing model that removes all the hassles of tuning the weights of various losses. Importantly, extensive experiments show that our model is highly effective, outperforming the state-of-the-art multi-loss hashing models on three large-scale instance retrieval benchmarks, often by significant margins. Code is available at
Coordinate-based Multilayer Perceptron (MLP) networks, despite being capable of learning neural implicit representations, are not performant for internal image synthesis applications. Convolutional Neural Networks (CNNs) are typically used instead for a variety of internal generative tasks, at the cost of a larger model. We propose Neural Knitwork, an architecture for neural implicit representation learning of natural images that achieves image synthesis by optimizing the distribution of image patches in an adversarial manner and by enforcing consistency between the patch predictions. To the best of our knowledge, this is the first implementation of a coordinate-based MLP tailored for synthesis tasks such as image inpainting, super-resolution, and denoising. We demonstrate the utility of the proposed technique by training on these three tasks. The results show that modeling natural images using patches, rather than pixels, produces results of higher fidelity. The resulting model requires 80% fewer parameters than alternative CNN-based solutions while achieving comparable performance and training time.
Vision transformers have become one of the most important models for computer vision tasks. While they outperform earlier convolutional networks, the complexity quadratic to $N$ is one of the major drawbacks when using traditional self-attention algorithms. Here we propose the UFO-ViT(Unit Force Operated Vision Trnasformer), novel method to reduce the computations of self-attention by eliminating some non-linearity. Modifying few of lines from self-attention, UFO-ViT achieves linear complexity without the degradation of performance. The proposed models outperform most transformer-based models on image classification and dense prediction tasks through most capacity regime.
The infrared small-dim target detection is one of the key techniques in the infrared search and tracking system. Since the local regions which similar to infrared small-dim targets spread over the whole background, exploring the interaction information amongst image features in large-range dependencies to mine the difference between the target and background is crucial for robust detection. However, existing deep learning-based methods are limited by the locality of convolutional neural networks, which impairs the ability to capture large-range dependencies. To this end, we propose a new infrared small-dim target detection method with the transformer. We adopt the self-attention mechanism of the transformer to learn the interaction information of image features in a larger range. Additionally, we design a feature enhancement module to learn more features of small-dim targets. After that, we adopt a decoder with the U-Net-like skip connection operation to get the detection result. Extensive experiments on two public datasets show the obvious superiority of the proposed method over state-of-the-art methods.
This paper proposes a fusion method of modalities extracted from video through a three-stream network with spatio-temporal and temporal convolutions for fine-grained action classification in sport. It is applied to TTStroke-21 dataset which consists of untrimmed videos of table tennis games. The goal is to detect and classify table tennis strokes in the videos, the first step of a bigger scheme aiming at giving feedback to the players for improving their performance. The three modalities are raw RGB data, the computed optical flow and the estimated pose of the player. The network consists of three branches with attention blocks. Features are fused at the latest stage of the network using bilinear layers. Compared to previous approaches, the use of three modalities allows faster convergence and better performances on both tasks: classification of strokes with known temporal boundaries and joint segmentation and classification. The pose is also further investigated in order to offer richer feedback to the athletes.