###### List of past seminars (follow links for details):

*2017-12-07* Federica Arrigoni **Synchronization Problems in Computer Vision**

*2017-11-23* Antonín Šulc **Lightfield Analysis for non-Lambertian Scenes**

*2017-10-19* Jana Košecká **Semantic Understanding for Robot Perception**

*2017-10-19* Mircea Cimpoi **Deep Filter Banks for Texture Recognition**

*2017-10-18* Wolfgang Förstner **Evaluation of Estimation Results within Structure from Motion Problems**

*2017-10-17* Ludovic Magerand **Projective Structure-from-Motion and Rolling Shutter Pose Estimation**

*2017-09-25* Akihiro Sugimoto **Deeply Supervised 3D Recurrent FCN for Salient Object Detection in Videos**

*2017-08-31* Viktor Larsson **Building Polynomial Solvers for Computer Vision Applications**

*2017-08-25* Tomas Mikolov **Neural Networks for Natural Language Processing**

*2017-08-21* Torsten Sattler, Eric Brachmann, Ignacio Rocco *Workshop on learnable representations for geometric matching*

*2017-06-07* Joe Kileel **Using Computational Algebra for Computer Vision**

*2017-05-11* Torsten Sattler **Camera Localization**

#### Seminar R4I No. 1

Federica ArrigoniSynchronization Problems in Computer VisionUniversity of Udine, Italy Thursday 2017-12-07 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Consider a network of nodes where each node is characterized by an unknown state, and suppose that pairs of nodes can measure the ratio (or difference) between their states. The goal of “synchronization” is to infer the unknown states from the pairwise measures. Typically, states are represented by elements of a group, such as the Symmetric Group or the Special Euclidean Group. The former can represent local labels of a set of features, which refer to the multi-view matching application, whereas the latter can represent camera reference frames, in which case we are in the context of structure from motion, or local coordinates where 3D points are represented, in which case we are dealing with multiple point-set registration. A related problem is that of “bearing-only network localization” where each node is located at a fixed (unknown) position in 3-space and pairs of nodes can measure the direction of the line joining their locations. We are interested in global techniques where all the measures are considered at once, as opposed to incremental approaches that grow a solution by adding pieces iteratively.

#### Seminar No. 11

Antonín ŠulcLightfield Analysis for non-Lambertian ScenesUniversity of Konstanz, Germany Thursday 2017-11-23 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

In most natural scenes, we can see objects composed of non-Lambertian materials, whose appearance changes if we change viewpoint. In many computer vision tasks, we consider these as something undesirable and treat them as outliers. However, if we can record data from multiple dense viewpoints, such as with a light-field camera, we have a chance to not only deal with them but also extract additional information about the scene.

In this talk, I will show the capabilities of the light-field paradigm on various problems. Key ideas are a linear algorithm for structure from motion to generate refocusable panoramas and depth estimation for multi-layered objects which are semitransparent or partially reflective. Using these, I will show that we can decompose such scenes and further perform a robust volumetric reconstruction. Finally, I will consider decomposition of light fields into reflectance, natural illumination and geometry, a problem known as inverse rendering.

#### Seminar No. 10

Jana KošeckáSemantic Understanding for Robot PerceptionGeorge Mason University, Fairfax, VA, USA Monday 2017-10-30 at 16:00 Czech Technical University, Karlovo namesti, G-205 |

**Abstract**

Advancements in robotic navigation, mapping, object search and recognition rest to a large extent on robust, efficient and scalable semantic understanding of the surrounding environment. In recent years we have developed several approaches for capturing geometry and semantics of environment from video, RGB-D data, or just simply a single RGB image, focusing on indoors and outdoors environments relevant for robotics applications.

I will demonstrate our work on detailed semantic parsing and 3D structure recovery using deep convolutional neural networks (CNNs) and object detection and object pose recovery from single RGB image. The applicability of the presented techniques for autonomous driving, service robotics, mapping and augmented reality applications will be discussed.

#### Seminar No. 9

Mircea CimpoiDeep Filter Banks for Texture RecognitionCIIRC, Czech Technical University, Prague Thursday 2017-10-19 at 11:00 CIIRC Seminar Room A-303, floor 3 of Building A |

**Abstract**

This talk will be about texture and material recognition from images, and revisiting classical texture representations in the context of deep learning. The results were presented in CVPR 2015 and IJCV 2016. Visual textures are ubiquitous and play an important role in image understanding because they convey significant semantics of images, and because texture representations that pool local image descriptors in an order-less manner have had a tremendous impact in various practical applications. In the talk, we will revisit classic texture representations, including bag-of-visual-words and the Fisher vectors, in the context of deep learning and show that these have excellent efficiency and generalization properties if the convolutional layers of a deep model are used as filter banks. We obtain in this manner state-of-the-art performance in numerous datasets well beyond textures, an efficient method to apply deep features to image regions, as well as benefit in transferring features from one domain to another.

**References**

[1] Cimpoi, M., Maji, S., Kokkinos, I., Vedaldi, A., Deep Filter Banks for Texture Recognition, Description, and Segmentation, IJCV (2016) 118:65

[2] Cimpoi, M., Maji, S., and Vedaldi, A., Deep Filter Banks for Texture Recognition and Segmentation, CVPR (2015)

#### Seminar No. 8

Wolfgang FörstnerEvaluation of Estimation Results within Structure from Motion ProblemsUniversity of Bonn, Germany Wednesday 2017-10-18 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Parameter estimation is the core of many geometric problems within structure from motion. The number of parameters ranges from a few, e.g., for pose estimation or triangulation, to huge numbers, such as in bundle adjustment or surface reconstruction. The ability for planning, self-diagnosis, and evaluation is critical for successful project management. Uncertainty of observed and estimated quantities needs to be available, faithful, and realistic. The talk presents methods (1) to critically check the faithfulness of the result of estimation procedures, (2) to evaluate suboptimal estimation procedures, and (3) to evaluate and compare competing procedures w.r.t. their precision in the presence of rank deficiencies. Evaluating bundle adjustment results is taken as one example problem.

**References**

[1] T. Dickscheid, T. Läbe, and W. Förstner, Benchmarking Automatic Bundle Adjustment Results, in 21st Congress of the International Society for Photogrammetry and Remote Sensing (ISPRS), Beijing, China, 2008, p. 7–12, Part B3a.

[2] W. Förstner and K. Khoshelham, Efficient and Accurate Registration of Point Clouds with Plane to Plane Correspondences, in 3rd International Workshop on Recovering 6D Object Pose, 2017.

[3] W. Förstner and B. P. Wrobel, Photogrammetric Computer Vision — Statistics, Geometry, Orientation and Reconstruction, Springer, 2016.

[4] T. Läbe, T. Dickscheid, and W. Förstner, On the Quality of Automatic Relative Orientation Procedures, in 21st Congress of the International Society for Photogrammetry and Remote Sensing (ISPRS), Beijing, China, 2008, p. 37–42 Part B3b-1.

#### Seminar No. 7

Ludovic MagerandProjective Structure-from-Motion and Rolling Shutter Pose EstimationCIIRC, Czech Technical University, Prague Tuesday 2017-10-17 at 11:00 CIIRC Seminar Room A-303, floor 3 of Building A |

**Abstract**

This talk is divided in two parts, the first one will be a presentation of an ICCV’17 paper about a practical solution to the Projective Structure from Motion (PSfM) problem able to deal efficiently with missing data (up to 98%), outliers and, for the first time, large scale 3D reconstruction scenarios. This is achieved by embedding the projective depths into the projective parameters of the points and views to improve computational speed. To do so and to ensure a valid reconstruction, an extension of the linear constraints from the Generalized Projective Reconstruction Theorem is used. With an incremental approach, views and points are added robustly to an initial solvable sub-problem until completion of the underlying factorization.

The second part of the talk will presents my PhD thesis “Dynamic pose estimation with CMOS cameras using sequential acquisition”. CMOS cameras are cheap and can acquire images at very high frame rate thanks to an acquisition mode called Rolling Shutter which sequentially expose the scan-line. This makes them very interesting in the context of very high-speed robotic but it comes with what was long seen as a drawback: when an object (or the camera itself) moves in the scene, distortions appear in the image. These rolling shutter effects actually contain information on the motion and can become another advantage for high-speed robotic by extending the usual pose estimation to also estimate the motion parameters. Two methods achieving this will be presented, one assumes a non-uniform motion model and the second one a projection model suitable for polynomial optimization.

#### Seminar No. 6

Akihiro SugimotoDeeply Supervised 3D Recurrent FCN for Salient Object Detection in VideosNational Institute of Informatics (NII), Japan Monday 2017-09-25 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

This talk presents a novel end-to-end 3D fully convolutional network for salient object detection in videos. The proposed network uses 3D filters in the spatio-temporal domain to directly learn both spatial and temporal information to have 3D deep features, and transfers the 3D deep features to pixel-level saliency prediction, outputting saliency voxels. In the network, we combine the refinement at each layer and deep supervision to efficiently and accurately detect salient object boundaries. The refinement module recurrently enhances to learn contextual information into the feature map. Applying deeply-supervised learning to hidden layers, on the other hand, improves details of the intermediate saliency voxel, and thus the saliency voxel is progressively refined to become finer and finer. Intensive experiments using publicly available benchmark datasets confirm that our network outperforms state-of-the-art methods. The proposed saliency model also effectively works for video object segmentation.

#### Seminar No. 5

Viktor LarssonBuilding Polynomial Solvers for Computer Vision ApplicationsLund University, Sweden Thursday 2017-08-31 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

In the first part of the talk, I will give a brief overview of how polynomial equation systems are typically solved in Computer Vision. These equation system often come from minimal problems, which are fundamental building blocks in most Structure-from-Motion pipelines.

In the second part, I will present two recent papers on methods for constructing polynomial solvers. The first paper is about automatically generating the socalled elimination templates. The second paper extends the method to also handle saturated ideals. This allows us to essentially add additional constraints that some polynomials should be non-zero. Both papers are joint work with Kalle Åström and Magnus Oskarsson.

**References**

[1] Larsson V., Åström K, Oskarsson M., Efficient Solvers for Minimal Problems by Syzygy-Based Reduction, (CVPR), 2017. [http://www.maths.lth.se/matematiklth/personal/viktorl/papers/larsson2017efficient.pdf]

[2] Larsson V., Åström K, Oskarsson M., Polynomial Solvers for Saturated Ideals, (ICCV), 2017.

#### Seminar No. 4

Tomas MikolovNeural Networks for Natural Language ProcessingFacebook AI Research Friday 2017-08-25 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Artificial neural networks are currently very successful in various machine learning tasks that involve natural language. In this talk, I will describe recurrent neural network language models, as well as their most frequent applications to speech recognition and machine translation. I will also talk about distributed word representations, their interesting properties, and efficient ways how to compute them and use in tasks such as text classification. Finally, I will describe our latest efforts to create a novel dataset that could be used to develop machines that can truly communicate with human users in natural language.

**Short bio**:

Tomáš Mikolov is a research scientist at Facebook AI Research group since 2014. Previously he has been member of the Google Brain team, where he developed and implemented efficient algorithms for computing distributed representations of words (the word2vec project). He obtained his PhD from the Brno University of Technology in 2012 for his work on recurrent neural network-based language models (RNNLM). His long term research goal is to develop intelligent machines that people can communicate with and use to accomplish complex tasks.

#### Seminar No. 3

**Workshop on learnable representations for geometric matching**

Monday 2017-08-21, 14:00-18:00

CIIRC Seminar Room B-670, floor 6 of Building B

- 14:00-15:00 Torsten Sattler (ETH Zurich) Hard Matching Problems in 3D Vision
- 15:00-15:20 Coffee break, discussion
- 15:20-16:20 Eric Brachmann (TU Dresden) Scene Coordinate Regression: From Random Forests to End-to-End Learning
- 16:20-16:40 Coffee break, discussion
- 16:40-17:40 Ignacio Rocco (Inria) Convolutional neural network architecture for geometric matching
- 17:40-18:00 Discussion

Speaker: Torsten Sattler, ETH Zurich |

Title: Hard Matching Problems in 3D Vision |

Abstract: Estimating correspondences, i.e., data association, is a fundamental step of each 3D Computer Vision pipeline. For example, 2D-3D matches between pixels in an image and 3D points in a 3D scene model are required for camera pose computation and thus for visual localization. Existing approaches for correspondence estimation, e.g., based on local image descriptors such as SIFT, have been shown to work well for a range of viewing conditions. Still, existing solutions are rather limited in challenging scenes. This talk will focus on data association in challenging scenarios. We first discuss the impact of day-night changes on visual localization, demonstrating that state-of-the-art algorithms perform severely worse compared to the day-day scenario typically considered in the literature. Next, we discuss ongoing work aiming at boosting the performance of local descriptors in this scenario via a dense-sparse feature detection and matching pipeline. A key idea in this work is to use pre-trained convolutional neural networks to obtain descriptors that contain mid-level semantic information compared to the low-level information utilized by SIFT. Based on the intuition that semantic information provides a higher form of invariance, the second part of the talk considers exploiting semantic (image) segmentations in the context of visual localization and visual SLAM. |

Speaker: Eric Brachmann, TU Dresden |

Title: Scene Coordinate Regression: From Random Forests to End-to-End Learning |

Abstract: For decades, estimation of accurate 6D camera poses relied on hand-crafted sparse feature pipelines and geometric processing. Motivated by recent successes, some authors ask the question whether camera localization can be cast as a learning problem. Despite some success, the accuracy of unconstrained CNN architectures trained for this task is still inferior compared to traditional approaches. In this talk, we discuss an alternative line of research, which tries to combine geometric processing with constrained machine learning in the form of scene coordinate regression. We discuss how random forests or CNNs can be trained to substitute sparse feature detection and matching. Furthermore, we show how to train camera localization pipelines end-to-end using a novel, differentiable formulation of RANSAC. We will close the talk with some thoughts about open problems in learning camera localization. |

Speaker: Ignacio Rocco, Inria Paris |

Title: Convolutional neural network architecture for geometric matching |

Abstract: We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging Proposal Flow dataset. |

#### Seminar No. 2

Joe Kileel.Princeton Using Computational Algebra for Computer VisionWednesday 2017-06-07 at 15:00CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract
**Scene reconstruction is a fundamental task in computer vision: given multiple images from different angles, create a 3D model of a world scene. Nowadays self-driving cars need to do 3D reconstruction in real-time, to navigate their surroundings. Large-scale photo-tourism is also a popular application. In this talk, we will explain how key subroutines in reconstruction algorithms amount to solving polynomial systems, with special geometric structure. We will answer a question of Sameer Agarwal (Google Research) about recovering the motion of two calibrated cameras. Next, we will quantify the “algebraic complexity” of polynomial systems arising from three calibrated cameras. In terms of multi-view geometry, we deal with essential matrices and trifocal tensors. The first part applies tools like resultants from algebra, while the second part will offer an introduction to numerical homotopy continuation methods. Those wondering “if algebraic geometry is good for anything practical” are especially encouraged to attend.

**References**

[1] G. Floystad, J. Kileel, G. Ottaviani: “The Chow form of the essential variety in computer vision,” J. Symbolic Comput., to appear. [https://arxiv.org/pdf/1604.04372]

[2] J. Kileel: “Minimal problems for the calibrated trifocal variety,” SIAM Appl. Alg. Geom., to appear. [https://arxiv.org/pdf/1611.05947]

#### Seminar No. 1

Torsten Sattler. Camera LocalizationETH ZurichThursday 2017-05-11 at 11:00 CIIRC Lecture Hall A1001 of Building A (Jugoslavskych partyzanu 3) |

**Abstract**

Estimating the position and orientation of a camera in a scene based on images is an essential part of many (3D) Computer Vision and Robotics algorithms such as Structure-from-Motion, Simultaneous Localization and Mapping (SLAM), and visual localization. Camera localization has applications in navigation for autonomous vehicles/robots, Augmented and Virtual Reality, and 3D mapping. Furthermore, there are strong relations to camera calibration and visual place recognition. In this talk, I will give an overview over past and current efforts on robust, efficient, and accurate camera localization. I will begin the talk showing that classical localization approaches haven’t been made obsolete by deep learning. Following a local feature-based approach, the talk will discuss how to adapt such methods for real-time visual localization on mobile devices with limited computational capabilities and approaches that scale to large (city-scale) scenes, including the challenges encountered at large-scale. The final part of the talk will discuss open problems in the areas of camera localization and 3D mapping, both in terms of problems we are currently working on as well as interesting long-term goals.

**Short bio**:

Torsten Sattler received a PhD in Computer Science from RWTH Aachen University, Germany, in 2013 under the supervision of Prof. Bastian Leibe and Prof. Leif Kobbelt. In December 2013, he joined the Computer Vision and Geometry Group of Prof. Marc Pollefeys at ETH Zurich, Switzerland, where he currently is a senior researcher and Marc Pollefeys’ deputy while Prof. Pollefeys is on leave from ETH. His research interests include (large-scale) image-based localization using Structure-from-Motion point clouds, real-time localization and SLAM on mobile devices and for robotics, 3D mapping, Augmented & Virtual Reality, (multi-view) stereo, image retrieval and efficient spatial verification, camera calibration and pose estimation. Torsten has worked on dense sensing for self-driving cars as part of the V-Charge project. He is currently involved in enabling semantic SLAM and re-localization for gardening robots (as part of a EU Horizon 2020 project where he leads the efforts on a workpackage), research for Google’s Tango project, where he leads CVG’s research efforts, and in work on self-driving cars.