# 2021年09月23日 arxiv 视觉论文速递

Homography augumented momentum constrastive learning for SAR image retrieval

Deep learning-based image retrieval has been emphasized in computer vision. Representation embedding extracted by deep neural networks (DNNs) not only aims at containing semantic information of the image, but also can manage large-scale image retrieval tasks. In this work, we propose a deep learning-based image retrieval approach using homography transformation augmented contrastive learning to perform large-scale synthetic aperture radar (SAR) image search tasks. Moreover, we propose a training method for the DNNs induced by contrastive learning that does not require any labeling procedure. This may enable tractability of large-scale datasets with relative ease. Finally, we verify the performance of the proposed method by conducting experiments on the polarimetric SAR image datasets.

CondNet: Conditional Classifier for Scene Segmentation

The fully convolutional network (FCN) has achieved tremendous success in dense visual recognition tasks, such as scene segmentation. The last layer of FCN is typically a global classifier (1×1 convolution) to recognize each pixel to a semantic label. We empirically show that this global classifier, ignoring the intra-class distinction, may lead to sub-optimal results.

SemCal: Semantic LiDAR-Camera Calibration using Neural MutualInformation Estimator

This paper proposes SemCal: an automatic, targetless, extrinsic calibration algorithm for a LiDAR and camera system using semantic information. We leverage a neural information estimator to estimate the mutual information (MI) of semantic information extracted from each sensor measurement, facilitating semantic-level data association. By using a matrix exponential formulation of the $se(3)$ transformation and a kernel-based sampling method to sample from camera measurement based on LiDAR projected points, we can formulate the LiDAR-Camera calibration problem as a novel differentiable objective function that supports gradient-based optimization methods. We also introduce a semantic-based initial calibration method using 2D MI-based image registration and Perspective-n-Point (PnP) solver. To evaluate performance, we demonstrate the robustness of our method and quantitatively analyze the accuracy using a synthetic dataset. We also evaluate our algorithm qualitatively on an urban dataset (KITTI360) and an off-road dataset (RELLIS-3D) benchmark datasets using both hand-annotated ground truth labels as well as labels predicted by the state-of-the-art deep learning models, showing improvement over recent comparable calibration approaches.

Skeleton-Graph: Long-Term 3D Motion Prediction From 2D Observations Using Deep Spatio-Temporal Graph CNNs

Several applications such as autonomous driving, augmented reality and virtual reality requires a precise prediction of the 3D human pose. Recently, a new problem was introduced in the field to predict the 3D human poses from an observed 2D poses. We propose Skeleton-Graph, a deep spatio-temporal graph CNN model that predicts the future 3D skeleton poses in a single pass from the 2D ones. Unlike prior works, Skeleton-Graph focuses on modeling the interaction between the skeleton joints by exploiting their spatial configuration. This is being achieved by formulating the problem as a graph structure while learning a suitable graph adjacency kernel. By the design, Skeleton-Graph predicts the future 3D poses without divergence on the long-term unlike prior works. We also introduce a new metric that measures the divergence of predictions on the long-term. Our results show an FDE improvement of at least 27% and an ADE of 4% on both the GTA-IM and PROX datasets respectively in comparison with prior works. Also, we are 88% and 93% less divergence on the long-term motion prediction in comparison with prior works on both GTA-IM and PROX datasets.

Oriented Object Detection in Aerial Images Based on Area Ratio of Parallelogram

Rotated object detection is a challenging task in aerial images as the object in aerial images are displayed in arbitrary directions and usually densely packed. Although considerable progress has been made, there are still challenges that existing regression-based rotation detectors suffer the problem of discontinuous boundaries, which is directly caused by angular periodicity or corner ordering. In this paper, we propose a simple effective framework to address the above challenges. Instead of directly regressing the five parameters (coordinates of the central point, width, height, and rotation angle) or the four vertices, we use the area ratio of parallelogram (ARP) to accurately describe a multi-oriented object. Specifically, we regress coordinates of center point, height and width of minimum circumscribed rectangle of oriented object and three area ratios {\lambda}_1, {\lambda}_2 and {\lambda}_3. This may facilitate the offset learning and avoid the issue of angular periodicity or label points sequence for oriented objects. To further remedy the confusion issue nearly horizontal objects, we employ the area ratio between the object and its horizontal bounding box (minimum circumscribed rectangle) to guide the selection of horizontal or oriented detection for each object. We also propose a rotation efficient IoU loss (R-EIoU) to connect the horizontal bounding box with the three area ratios and improve the accurate for the rotating bounding box. Experimental results on three remote sensing datasets including HRSC2016, DOTA and UCAS-AOD and scene text including ICDAR2015 show that our method achieves superior detection performance compared with many state-of-the-art approaches. The code and model will be coming with paper published.

3D Point Cloud Completion with Geometric-Aware Adversarial Augmentation

KDFNet: Learning Keypoint Distance Field for 6D Object Pose Estimation

We present KDFNet, a novel method for 6D object pose estimation from RGB images. To handle occlusion, many recent works have proposed to localize 2D keypoints through pixel-wise voting and solve a Perspective-n-Point (PnP) problem for pose estimation, which achieves leading performance. However, such voting process is direction-based and cannot handle long and thin objects where the direction intersections cannot be robustly found. To address this problem, we propose a novel continuous representation called Keypoint Distance Field (KDF) for projected 2D keypoint locations. Formulated as a 2D array, each element of the KDF stores the 2D Euclidean distance between the corresponding image pixel and a specified projected 2D keypoint. We use a fully convolutional neural network to regress the KDF for each keypoint. Using this KDF encoding of projected object keypoint locations, we propose to use a distance-based voting scheme to localize the keypoints by calculating circle intersections in a RANSAC fashion. We validate the design choices of our framework by extensive ablation experiments. Our proposed method achieves state-of-the-art performance on Occlusion LINEMOD dataset with an average ADD(-S) accuracy of 50.3% and TOD dataset mug subset with an average ADD accuracy of 75.72%. Extensive experiments and visualizations demonstrate that the proposed method is able to robustly estimate the 6D pose in challenging scenarios including occlusion.

Survey on Semantic Stereo Matching / Semantic Depth Estimation

Stereo matching is one of the widely used techniques for inferring depth from stereo images owing to its robustness and speed. It has become one of the major topics of research since it finds its applications in autonomous driving, robotic navigation, 3D reconstruction, and many other fields. Finding pixel correspondences in non-textured, occluded and reflective areas is the major challenge in stereo matching. Recent developments have shown that semantic cues from image segmentation can be used to improve the results of stereo matching. Many deep neural network architectures have been proposed to leverage the advantages of semantic segmentation in stereo matching. This paper aims to give a comparison among the state of art networks both in terms of accuracy and in terms of speed which are of higher importance in real-time applications.

StereOBJ-1M: Large-scale Stereo Image Dataset for 6D Object Pose Estimation

We present a large-scale stereo RGB image object pose estimation dataset named the $\textbf{StereOBJ-1M}$ dataset. The dataset is designed to address challenging cases such as object transparency, translucency, and specular reflection, in addition to the common challenges of occlusion, symmetry, and variations in illumination and environments. In order to collect data of sufficient scale for modern deep learning models, we propose a novel method for efficiently annotating pose data in a multi-view fashion that allows data capturing in complex and flexible environments. Fully annotated with 6D object poses, our dataset contains over 396K frames and over 1.5M annotations of 18 objects recorded in 183 scenes constructed in 11 different environments. The 18 objects include 8 symmetric objects, 7 transparent objects, and 8 reflective objects. We benchmark two state-of-the-art pose estimation frameworks on StereOBJ-1M as baselines for future work. We also propose a novel object-level pose optimization method for computing 6D pose from keypoint predictions in multiple images.

Bayesian Confidence Calibration for Epistemic Uncertainty Modelling

Modern neural networks have found to be miscalibrated in terms of confidence calibration, i.e., their predicted confidence scores do not reflect the observed accuracy or precision. Recent work has introduced methods for post-hoc confidence calibration for classification as well as for object detection to address this issue. Especially in safety critical applications, it is crucial to obtain a reliable self-assessment of a model. But what if the calibration method itself is uncertain, e.g., due to an insufficient knowledge base?