1. 磐创AI首页
  2. arxiv

2021年09月22日 arxiv 视觉论文速递

BabelCalib: A Universal Approach to Calibrating Central Cameras

Existing calibration methods occasionally fail for large field-of-view cameras due to the non-linearity of the underlying problem and the lack of good initial values for all parameters of the used camera model. This might occur because a simpler projection model is assumed in an initial step, or a poor initial guess for the internal parameters is pre-defined. A lot of the difficulties of general camera calibration lie in the use of a forward projection model. We side-step these challenges by first proposing a solver to calibrate the parameters in terms of a back-projection model and then regress the parameters for a target forward model. These steps are incorporated in a robust estimation framework to cope with outlying detections. Extensive experiments demonstrate that our approach is very reliable and returns the most accurate calibration parameters as measured on the downstream task of absolute pose estimation on test sets. The code is released at

FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Future Medical Imaging

The recent advancements in artificial intelligence (AI) combined with the extensive amount of data generated by today’s clinical systems, has led to the development of imaging AI solutions across the whole value chain of medical imaging, including image reconstruction, medical image segmentation, image-based diagnosis and treatment planning. Notwithstanding the successes and future potential of AI in medical imaging, many stakeholders are concerned of the potential risks and ethical implications of imaging AI solutions, which are perceived as complex, opaque, and difficult to comprehend, utilise, and trust in critical clinical applications. Despite these concerns and risks, there are currently no concrete guidelines and best practices for guiding future AI developments in medical imaging towards increased trust, safety and adoption. To bridge this gap, this paper introduces a careful selection of guiding principles drawn from the accumulated experiences, consensus, and best practices from five large European projects on AI in Health Imaging. These guiding principles are named FUTURE-AI and its building blocks consist of (i) Fairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness and (vi) Explainability. In a step-by-step approach, these guidelines are further translated into a framework of concrete recommendations for specifying, developing, evaluating, and deploying technically, clinically and ethically trustworthy AI solutions into clinical practice.

Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR

Self-supervised monocular depth prediction provides a cost-effective solution to obtain the 3D location of each pixel. However, the existing approaches usually lead to unsatisfactory accuracy, which is critical for autonomous robots. In this paper, we propose a novel two-stage network to advance the self-supervised monocular dense depth learning by leveraging low-cost sparse (e.g. 4-beam) LiDAR. Unlike the existing methods that use sparse LiDAR mainly in a manner of time-consuming iterative post-processing, our model fuses monocular image features and sparse LiDAR features to predict initial depth maps. Then, an efficient feed-forward refine network is further designed to correct the errors in these initial depth maps in pseudo-3D space with real-time performance. Extensive experiments show that our proposed model significantly outperforms all the state-of-the-art self-supervised methods, as well as the sparse-LiDAR-based methods on both self-supervised monocular depth prediction and completion tasks. With the accurate dense depth prediction, our model outperforms the state-of-the-art sparse-LiDAR-based method (Pseudo-LiDAR++) by more than 68% for the downstream task monocular 3D object detection on the KITTI Leaderboard.

Superquadric Object Representation for Optimization-based Semantic SLAM

Introducing semantically meaningful objects to visual Simultaneous Localization And Mapping (SLAM) has the potential to improve both the accuracy and reliability of pose estimates, especially in challenging scenarios with significant view-point and appearance changes. However, how semantic objects should be represented for an efficient inclusion in optimization-based SLAM frameworks is still an open question. Superquadrics(SQs) are an efficient and compact object representation, able to represent most common object types to a high degree, and typically retrieved from 3D point-cloud data. However, accurate 3D point-cloud data might not be available in all applications. Recent advancements in machine learning enabled robust object recognition and semantic mask measurements from camera images under many different appearance conditions. We propose a pipeline to leverage such semantic mask measurements to fit SQ parameters to multi-view camera observations using a multi-stage initialization and optimization procedure. We demonstrate the system’s ability to retrieve randomly generated SQ parameters from multi-view mask observations in preliminary simulation experiments and evaluate different initialization stages and cost functions.

Real-Time Trash Detection for Modern Societies using CCTV to Identifying Trash by utilizing Deep Convolutional Neural Network

To protect the environment from trash pollution, especially in societies, and to take strict action against the red-handed people who throws the trash. As modern societies are developing and these societies need a modern solution to make the environment clean. Artificial intelligence (AI) evolution, especially in Deep Learning, gives an excellent opportunity to develop real-time trash detection using CCTV cameras. The inclusion of this project is real-time trash detection using a deep model of Convolutional Neural Network (CNN). It is used to obtain eight classes mask, tissue papers, shoppers, boxes, automobile parts, pampers, bottles, and juices boxes. After detecting the trash, the camera records the video of that person for ten seconds who throw trash in society. The challenging part of this paper is preparing a complex custom dataset that took too much time. The dataset consists of more than 2100 images. The CNN model was created, labeled, and trained. The detection time accuracy and average mean precision (mAP) benchmark both models’ performance. In experimental phase the mAP performance and accuracy of the improved CNN model was superior in all aspects. The model is used on a CCTV camera to detect trash in real-time.

R2D: Learning Shadow Removal to Enhance Fine-Context Shadow Detection

Current shadow detection methods perform poorly when detecting shadow regions that are small, unclear or have blurry edges. To tackle this problem, we propose a new method called Restore to Detect (R2D), where we show that when a deep neural network is trained for restoration (shadow removal), it learns meaningful features to delineate the shadow masks as well. To make use of this complementary nature of shadow detection and removal tasks, we train an auxiliary network for shadow removal and propose a complementary feature learning block (CFL) to learn and fuse meaningful features from shadow removal network to the shadow detection network. For the detection network in R2D, we propose a Fine Context-aware Shadow Detection Network (FCSD-Net) where we constraint the receptive field size and focus on low-level features to learn fine context features better. Experimental results on three public shadow detection datasets (ISTD, SBU and UCF) show that our proposed method R2D improves the shadow detection performance while being able to detect fine context better compared to the other recent methods.

Parameter Decoupling Strategy for Semi-supervised 3D Left Atrium Segmentation

Consistency training has proven to be an advanced semi-supervised framework and achieved promising results in medical image segmentation tasks through enforcing an invariance of the predictions over different views of the inputs. However, with the iterative updating of model parameters, the models would tend to reach a coupled state and eventually lose the ability to exploit unlabeled data. To address the issue, we present a novel semi-supervised segmentation model based on parameter decoupling strategy to encourage consistent predictions from diverse views. Specifically, we first adopt a two-branch network to simultaneously produce predictions for each image. During the training process, we decouple the two prediction branch parameters by quadratic cosine distance to construct different views in latent space. Based on this, the feature extractor is constrained to encourage the consistency of probability maps generated by classifiers under diversified features. In the overall training process, the parameters of feature extractor and classifiers are updated alternately by consistency regularization operation and decoupling operation to gradually improve the generalization performance of the model. Our method has achieved a competitive result over the state-of-the-art semi-supervised methods on the Atrial Segmentation Challenge dataset, demonstrating the effectiveness of our framework. Code is available at

Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker’s mouth. The use of the video signal requires extracting visual features, which are then combined with the acoustic features to build an AV-ASR system [1]. This is traditionally done with some form of 3D convolutional network (e.g. VGG) as widely used in the computer vision community. Recently, image transformers [2] have been introduced to extract visual features useful for image classification tasks. In this work, we propose to replace the 3D convolutional visual front-end with a video transformer front-end. We train our systems on a large-scale dataset composed of YouTube videos and evaluate performance on the publicly available LRS3-TED set, as well as on a large set of YouTube videos. On a lip-reading task, the transformer-based front-end shows superior performance compared to a strong convolutional baseline. On an AV-ASR task, the transformer front-end performs as well as (or better than) the convolutional baseline. Fine-tuning our model on the LRS3-TED training set matches previous state of the art. Thus, we experimentally show the viability of the convolution-free model for AV-ASR.

Modeling Annotation Uncertainty with Gaussian Heatmaps in Landmark Localization

In landmark localization, due to ambiguities in defining their exact position, landmark annotations may suffer from large observer variabilities, which result in uncertain annotations. To model the annotation ambiguities of the training dataset, we propose to learn anisotropic Gaussian parameters modeling the shape of the target heatmap during optimization. Furthermore, our method models the prediction uncertainty of individual samples by fitting anisotropic Gaussian functions to the predicted heatmaps during inference. Besides state-of-the-art results, our experiments on datasets of hand radiographs and lateral cephalograms also show that Gaussian functions are correlated with both localization accuracy and observer variability. As a final experiment, we show the importance of integrating the uncertainty into decision making by measuring the influence of the predicted location uncertainty on the classification of anatomical abnormalities in lateral cephalograms.

Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions

Personality computing has become an emerging topic in computer vision, due to the wide range of applications it can be used for. However, most works on the topic have focused on analyzing the individual, even when applied to interaction scenarios, and for short periods of time. To address these limitations, we present the Dyadformer, a novel multi-modal multi-subject Transformer architecture to model individual and interpersonal features in dyadic interactions using variable time windows, thus allowing the capture of long-term interdependencies. Our proposed cross-subject layer allows the network to explicitly model interactions among subjects through attentional operations. This proof-of-concept approach shows how multi-modality and joint modeling of both interactants for longer periods of time helps to predict individual attributes. With Dyadformer, we improve state-of-the-art self-reported personality inference results on individual subjects on the UDIVA v0.5 dataset.