Seminar No. 6

Akihiro Sugimoto
Deeply Supervised 3D Recurrent FCN for Salient Object Detection in Videos
National Institute of Informatics (NII), Japan
Monday 2017-09-25 at 11:00
CIIRC Seminar Room B-670, floor 6 of Building B

Abstract
This talk presents a novel end-to-end 3D fully convolutional network for salient object detection in videos. The proposed network uses 3D filters in the spatio-temporal domain to directly learn both spatial and temporal information to have 3D deep features, and transfers the 3D deep features to pixel-level saliency prediction, outputting saliency voxels. In the network, we combine the refinement at each layer and deep supervision to efficiently and accurately detect salient object boundaries. The refinement module recurrently enhances to learn contextual information into the feature map. Applying deeply-supervised learning to hidden layers, on the other hand, improves details of the intermediate saliency voxel, and thus the saliency voxel is progressively refined to become finer and finer. Intensive experiments using publicly available benchmark datasets confirm that our network outperforms state-of-the-art methods. The proposed saliency model also effectively works for video object segmentation.

Seminar No. 5

Viktor Larsson
Building Polynomial Solvers for Computer Vision Applications
Lund University, Sweden
Thursday 2017-08-31 at 11:00
CIIRC Seminar Room B-670, floor 6 of Building B

Abstract
In the first part of the talk, I will give a brief overview of how polynomial equation systems are typically solved in Computer Vision. These equation system often come from minimal problems, which are fundamental building blocks in most Structure-from-Motion pipelines.
In the second part, I will present two recent papers on methods for constructing polynomial solvers. The first paper is about automatically generating the socalled elimination templates. The second paper extends the method to also handle saturated ideals. This allows us to essentially add additional constraints that some polynomials should be non-zero. Both papers are joint work with Kalle Åström and Magnus Oskarsson.

References
[1] Larsson V., Åström K, Oskarsson M., Efficient Solvers for Minimal Problems by Syzygy-Based Reduction, (CVPR), 2017. [http://www.maths.lth.se/matematiklth/personal/viktorl/papers/larsson2017efficient.pdf]
[2] Larsson V., Åström K, Oskarsson M., Polynomial Solvers for Saturated Ideals, (ICCV), 2017.

Seminar No. 4

Tomas Mikolov
Neural Networks for Natural Language Processing
Facebook AI Research
Friday 2017-08-25 at 11:00
CIIRC Seminar Room B-670, floor 6 of Building B

Abstract
Artificial neural networks are currently very successful in various machine learning tasks that involve natural language. In this talk, I will describe recurrent neural network language models, as well as their most frequent applications to speech recognition and machine translation. I will also talk about distributed word representations, their interesting properties, and efficient ways how to compute them and use in tasks such as text classification. Finally, I will describe our latest efforts to create a novel dataset that could be used to develop machines that can truly communicate with human users in natural language.

Short bio:
Tomáš Mikolov is a research scientist at Facebook AI Research group since 2014. Previously he has been member of the Google Brain team, where he developed and implemented efficient algorithms for computing distributed representations of words (the word2vec project). He obtained his PhD from the Brno University of Technology in 2012 for his work on recurrent neural network-based language models (RNNLM). His long term research goal is to develop intelligent machines that people can communicate with and use to accomplish complex tasks.

Seminar No. 3

Workshop on learnable representations for geometric matching
Monday 2017-08-21, 14:00-18:00
CIIRC Seminar Room B-670, floor 6 of Building B

  • 14:00-15:00 Torsten Sattler (ETH Zurich) Hard Matching Problems in 3D Vision
  • 15:00-15:20 Coffee break, discussion
  • 15:20-16:20 Eric Brachmann (TU Dresden) Scene Coordinate Regression: From Random Forests to End-to-End Learning
  • 16:20-16:40 Coffee break, discussion
  • 16:40-17:40 Ignacio Rocco (Inria) Convolutional neural network architecture for geometric matching
  • 17:40-18:00 Discussion
Speaker: Torsten Sattler, ETH Zurich
Title: Hard Matching Problems in 3D Vision
Abstract: Estimating correspondences, i.e., data association, is a fundamental step of each 3D Computer Vision pipeline. For example, 2D-3D matches between pixels in an image and 3D points in a 3D scene model are required for camera pose computation and thus for visual localization. Existing approaches for correspondence estimation, e.g., based on local image descriptors such as SIFT, have been shown to work well for a range of viewing conditions. Still, existing solutions are rather limited in challenging scenes.
This talk will focus on data association in challenging scenarios. We first discuss the impact of day-night changes on visual localization, demonstrating that state-of-the-art algorithms perform severely worse compared to the day-day scenario typically considered in the literature. Next, we discuss ongoing work aiming at boosting the performance of local descriptors in this scenario via a dense-sparse feature detection and matching pipeline. A key idea in this work is to use pre-trained convolutional neural networks to obtain descriptors that contain mid-level semantic information compared to the low-level information utilized by SIFT. Based on the intuition that semantic information provides a higher form of invariance, the second part of the talk considers exploiting semantic (image) segmentations in the context of visual localization and visual SLAM.
Speaker: Eric Brachmann, TU Dresden
Title: Scene Coordinate Regression: From Random Forests to End-to-End Learning
Abstract: For decades, estimation of accurate 6D camera poses relied on hand-crafted sparse feature pipelines and geometric processing. Motivated by recent successes, some authors ask the question whether camera localization can be cast as a learning problem. Despite some success, the accuracy of unconstrained CNN architectures trained for this task is still inferior compared to traditional approaches. In this talk, we discuss an alternative line of research, which tries to combine geometric processing with constrained machine learning in the form of scene coordinate regression. We discuss how random forests or CNNs can be trained to substitute sparse feature detection and matching. Furthermore, we show how to train camera localization pipelines end-to-end using a novel, differentiable formulation of RANSAC. We will close the talk with some thoughts about open problems in learning camera localization.
Speaker: Ignacio Rocco, Inria Paris
Title: Convolutional neural network architecture for geometric matching
Abstract: We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging Proposal Flow dataset.

Seminar No. 2

Joe Kileel. Using Computational Algebra for Computer Vision
Princeton
Wednesday 2017-06-07 at 15:00
CIIRC Seminar Room B-670, floor 6 of Building B

Abstract
Scene reconstruction is a fundamental task in computer vision: given multiple images from different angles, create a 3D model of a world scene.  Nowadays self-driving cars need to do 3D reconstruction in real-time, to navigate their surroundings.  Large-scale photo-tourism is also a popular application.  In this talk, we will explain how key subroutines in reconstruction algorithms amount to solving polynomial systems, with special geometric structure.  We will answer a question of Sameer Agarwal (Google Research) about recovering the motion of two calibrated cameras.  Next, we will quantify the “algebraic complexity” of polynomial systems arising from three calibrated cameras.  In terms of multi-view geometry, we deal with essential matrices and trifocal tensors.  The first part applies tools like resultants from algebra, while the second part will offer an introduction to numerical homotopy continuation methods.  Those wondering “if algebraic geometry is good for anything practical” are especially encouraged to attend.

References
[1] G. Floystad, J. Kileel, G. Ottaviani: “The Chow form of the essential variety in computer vision,” J. Symbolic Comput., to appear. [https://arxiv.org/pdf/1604.04372]
[2] J. Kileel: “Minimal problems for the calibrated trifocal variety,” SIAM Appl. Alg. Geom., to appear. [https://arxiv.org/pdf/1611.05947]

Seminar No. 1

Torsten Sattler. Camera Localization
ETH Zurich
Thursday 2017-05-11 at 11:00
CIIRC Lecture Hall A1001 of Building A (Jugoslavskych partyzanu 3)

Abstract
Estimating the position and orientation of a camera in a scene based on images is an essential part of many (3D) Computer Vision and Robotics algorithms such as Structure-from-Motion, Simultaneous Localization and Mapping (SLAM), and visual localization. Camera localization has applications in navigation for autonomous vehicles/robots, Augmented and Virtual Reality, and 3D mapping. Furthermore, there are strong relations to camera calibration and visual place recognition. In this talk, I will give an overview over past and current efforts on robust, efficient, and accurate camera localization. I will begin the talk showing that classical localization approaches haven’t been made obsolete by deep learning. Following a local feature-based approach, the talk will discuss how to adapt such methods for real-time visual localization on mobile devices with limited computational capabilities and approaches that scale to large (city-scale) scenes, including the challenges encountered at large-scale. The final part of the talk will discuss open problems in the areas of camera localization and 3D mapping, both in terms of problems we are currently working on as well as interesting long-term goals.

Short bio:
Torsten Sattler received a PhD in Computer Science from RWTH Aachen University, Germany, in 2013 under the supervision of Prof. Bastian Leibe and Prof. Leif Kobbelt. In December 2013, he joined the Computer Vision and Geometry Group of Prof. Marc Pollefeys at ETH Zurich, Switzerland, where he currently is a senior researcher and Marc Pollefeys’ deputy while Prof. Pollefeys is on leave from ETH. His research interests include (large-scale) image-based localization using Structure-from-Motion point clouds, real-time localization and SLAM on mobile devices and for robotics, 3D mapping, Augmented & Virtual Reality, (multi-view) stereo, image retrieval and efficient spatial verification, camera calibration and pose estimation. Torsten has worked on dense sensing for self-driving cars as part of the V-Charge project. He is currently involved in enabling semantic SLAM and re-localization for gardening robots (as part of a EU Horizon 2020 project where he leads the efforts on a workpackage), research for Google’s Tango project, where he leads CVG’s research efforts, and in work on self-driving cars.