###### List of past seminars (follow links for details):

*2019-01-11* Guillem Alenya **Challenges in robotics for human environments and specially for textile manipulation**

*2019-01-11* Kimotishi Yamazaki **Vision-Based Cloth Manipulation by Autonomous Robots**

*2018-10-25* David Fouhey **Understanding how to get to places and do things** *(video)*

*2018-10-24* Yuriy Kaminskyi **Semantic segmentation for indoor localization**

*2018-08-16* Viktor Larsson **Orthographic-Perspective Epipolar Geometry, Optimal Trilateration and Non-Linear Variable Projection for Time-of-Arrival**

*2018-07-24* Torsten Sattler **Challenges in Long-Term Visual Localization**

*2018-05-25* Alexei Efros **Self-supervision, Meta-supervision, Curiosity: Making Computers Study Harder** *(video)*

*2018-05-22* Di Meng **Cameras behind glass – polynomial constraints on image projection**

*2018-02-23* Oles Dobosevych **On Ukrainian characters MNIST dataset**

*2018-02-22* F. Bach, P.-Y. Massé, N. Mansard, J. Šivic, T. Werner *AIME@CZ – Czech Workshop on Applied Mathematics in Engineering*

*2017-12-07* Federica Arrigoni **Synchronization Problems in Computer Vision**

*2017-11-23* Antonín Šulc **Lightfield Analysis for non-Lambertian Scenes**

*2017-10-30* Jana Košecká **Semantic Understanding for Robot Perception**

*2017-10-19* Mircea Cimpoi **Deep Filter Banks for Texture Recognition**

*2017-10-18* Wolfgang Förstner **Evaluation of Estimation Results within Structure from Motion Problems**

*2017-10-17* Ludovic Magerand **Projective Structure-from-Motion and Rolling Shutter Pose Estimation**

*2017-09-25* Akihiro Sugimoto **Deeply Supervised 3D Recurrent FCN for Salient Object Detection in Videos**

*2017-08-31* Viktor Larsson **Building Polynomial Solvers for Computer Vision Applications**

*2017-08-25* Tomas Mikolov **Neural Networks for Natural Language Processing**

*2017-08-21* Torsten Sattler, Eric Brachmann, Ignacio Rocco *Workshop on learnable representations for geometric matching*

*2017-06-07* Joe Kileel **Using Computational Algebra for Computer Vision**

*2017-05-11* Torsten Sattler **Camera Localization**

#### Seminar R4I No. 3

Guillem AlenyaVision-Based Cloth Manipulation by Autonomous RobotsUniversitat Polytechnica de Catalunya, Spain Friday 2019-01-11 at 10:15CIIRC Seminar Room B-633, floor 6 of Building B |

**Abstract**

The Perception and Manipulation at IRI (Institute of Robotics) group focuses on enhancing the perception, learning, and planning capabilities of robots to achieve higher degrees of autonomy and user-friendliness during everyday manipulation tasks. Some topics addressed are the geometric interpretation of perceptual information, construction of 3D object models, action selection and planning, reinforcement learning, and teaching by demonstration. We will discuss challenges and current developments primarily in the inclusion of robots in everyday environments, and in the manipulation of textiles.

#### Seminar R4I No. 2

Kimotishi YamazakiVision-Based Cloth Manipulation by Autonomous RobotsShinshu University, Faculty of Engineering, Nagano, Japan Friday 2019-01-11 at 9:30CIIRC Seminar Room B-633, floor 6 of Building B |

**Abstract**

In this talk, we will introduce topics about manipulation of cloth by autonomous robots. Cloth is a deformable object, and its shape is drastically changed by adding manipulation. We will mainly explain sensor information processing, knowledge representation, and recognition methods to successfully manipulate such object.

#### Seminar No. 19

David FouheyUnderstanding how to get to places and do thingsUniversity of Michigan, MI, USA Thursday 2018-10-25 at 11:00CIIRC Seminar Room B-670, floor 6 of Building B |

**Video**

https://www.youtube.com/watch?v=99Mep3Cw9-Q

**Abstract**

What does it mean to understand an image or video? One common answer in computer vision has been that understanding means naming things: this part of the image corresponds to a refrigerator and that to a person, for instance. While important, this ability is not enough: humans can effortlessly reason about the rich world that images depict and what they can do in it. For example, if a friend shows you the way to their kitchen for you to get something, they won’t worry that you’ll get lost walking back (navigation) or that you’d have trouble figuring out how to open their refrigerator or cabinets. While both are an ordinary feat for humans (or even a dog or cat), they are currently far beyond the abilities of computers.

In my talk, I’ll discuss my efforts towards bridging this gap. In the first part, I’ll discuss the task of navigation, getting from one place to another. In particular, our goal is to take a single demonstration of a path and retrace it, either forwards or backwards, under noisy actuation and a changing environment. Rather than build an explicit model of the world, we learn a network that attends to a sequence of memories in order to make decisions. In the second part, I will discuss how to scalably gather data of humans interacting with the world, resulting in a new dataset of human interactions, VLOG, as well as and what we can learn from this data.

**Bio**

David Fouhey is starting as an assistant professor at the University of Michigan in January 2019 and is currently a visitor at INRIA Paris. His research interests include computer vision and machine learning, with a particular focus on scene understanding. He received a Ph.D. in robotics in 2016 from Carnegie Mellon University where he was supported by NSF and NDSEG fellowships, and was then a postdoctoral fellow at UC Berkeley. He has spent time at the University of Oxford’s Visual Geometry Group and at Microsoft Research. More information is here: http://web.eecs.umich.edu/~fouhey/

#### Seminar No. 18

Yuriy KaminskyiSemantic segmentation for indoor localizationUkrainian Catholic University, Lviv, Ukraine Wednesday 2018-10-24 at 16:00CIIRC IMPACT Room B-641, floor 6 of Building B |

**Abstract**

The seminar will be a progress report on the ongoing indoor localization and navigation project. It will briefly cover the problem and its motivation. The main goal of the talk is to show different approaches to segmentation (both instance and semantic) and approaches that may help to improve the existing solutions. The talk will also cover different segmentation methods and present their results on the InLoc dataset.

#### Seminar No. 17

Viktor LarssonOrthographic-Perspective Epipolar Geometry, Optimal Trilateration and Non-Linear Variable Projection for Time-of-ArrivalLund University, Sweden Thursday 2018-08-16 at 11:00CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

In this three part talk, I will briefly discuss some current work in progress. The first topic relates to the epipolar geometry of one perspective camera and one orthographic, with applications in RADAR-to-Camera calibration. The second part is about position estimation using distances to known 3D points. Finally, I will discuss applying the variable projection method to the non-separable time-of-arrival problem. Preliminary experiments show greatly improved convergence compared to both joint and alternating optimization methods.

#### Seminar No. 16

Torsten SattlerChallenges in Long-Term Visual LocalizationETH Zurich, CH Tuesday 2018-07-24 at 11:00CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Visual localization is the problem of estimating the position and orientation from which an image was taken with respect to a 3D model of a known scene. This problem has important applications, including autonomous vehicles (including self-driving cars and other robots) and augmented / mixed / virtual reality. While multiple solutions to the visual localization problem exist both in the Robotics and Computer Vision communities for accurate camera pose estimation, they typically assume that the scene does not change over time. However, this assumption is often invalid in practice, both in indoor and outdoor environments. This talk thus briefly discusses the challenges encountered when trying to localize images over a longer period of time. Next, we show how a combination of 3D scene geometry and higher-level scene understanding can help to enable visual localization in conditions where both classical and recently proposed learning-based approaches struggle.

**Bio**

Torsten Sattler received a PhD in Computer Science from RWTH Aachen University, Germany, in 2013 under the supervision of Prof. Bastian Leibe and Prof. Leif Kobbelt. In December 2013, he joined the Computer Vision and Geometry Group of Prof. Marc Pollefeys at ETH Zurich, Switzerland, where he currently is a senior researcher and Marc Pollefeys’ deputy while Prof. Pollefeys is on leave from ETH. His research interests include (large-scale) image-based localization using Structure-from-Motion point clouds, real-time localization and SLAM on mobile devices and for robotics, 3D mapping, Augmented & Virtual Reality, (multi-view) stereo, image retrieval and efficient spatial verification, camera calibration and pose estimation. Torsten has worked on dense sensing for self-driving cars as part of the V-Charge project. He is currently involved in enabling semantic SLAM and re-localization for gardening robots (as part of a EU Horizon 2020 project where he leads the efforts on a workpackage), research for Google’s Tango project, where he leads CVG’s research efforts, and in work on self-driving cars.

#### Seminar No. 15

Alexei EfrosSelf-supervision, Meta-supervision, Curiosity: Making Computers Study HarderUC Berkeley, CA, USA Friday 2018-05-25 at 11:00CIIRC Seminar Room A-1001, floor 10 of Building A |

**Video**

https://www.youtube.com/watch?v=_V-WpE8cmpc

**Abstract**

Computer vision has made impressive gains through the use of deep learning models, trained with large-scale labeled data. However, labels require expertise and curation and are expensive to collect. Even worse, direct semantic supervision often leads the learning algorithms “cheating” and taking shortcuts, instead of actually doing the work. In this talk, I will briefly summarize several of my group’s efforts to combat this using self-supervision, meta-supervision, and curiosity — all ways of using the data as its own supervision. These lead to practical applications in image synthesis (such as pix2pix and cycleGAN), image forensics, audio-visual source separation, etc.

**Bio**

Alexei Efros is a professor of Electrical Engineering and Computer Sciences at UC Berkeley. Before 2013, he was nine years on the faculty of Carnegie Mellon University, and has also been affiliated with École Normale Supérieure/INRIA and University of Oxford. His research is in the area of computer vision and computer graphics, especially at the intersection of the two. He is particularly interested in using data-driven techniques to tackle problems where large quantities of unlabeled visual data are readily available. Efros received his PhD in 2003 from UC Berkeley. He is a recipient of the Sloan Fellowship (2008), Guggenheim Fellowship (2008), SIGGRAPH Significant New Researcher Award (2010), 3 Helmholtz Test-of-Time Prizes (1999, 2003, 2005), and the ACM Prize in Computing (2016).

**Web**

https://www.ciirc.cvut.cz/alexei-efros-na-ciirc/

#### Seminar No. 14

Di MengCameras behind glass – polynomial constraints on image projectionUniversity of Burgundy, France Tuesday 2018-05-22 at 11:00CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Advanced Driver Assistance System (ADAS) is used for autonomous car and self-driving cars. Cameras in the system are commonly used to achieve functions such as pedestrian detection, guideboard detection or obstacle avoidance ect. They are calibrated but equipped inside the car behind the windshield. The windshield has the effect of refracting light rays which can cause disparity when mapping the points in space onto the pixels in image. Little disparity in pixel wise would be large errors meters away. So that we model the camera with three shapes of windshield to get a more precise camera calibration for automotive application.

The talk is divided into two main sections. The first section presents the optical model of three types of glass slab, which are planar slab, non-parallel surfaces slab and spherical slab. Polynomial projection equations are formulated. The second section takes the camera parameters into account and provides the process how to map the points in space onto the pixels in image with windshield in between.

#### Seminar No. 13

Oles DobosevychOn Ukrainian characters MNIST datasetUkrainian Catholic University, Lviv, Ukraine Friday 2018-02-23 at 14:00CIIRC Seminar Room B-671, floor 6 of Building B |

**Abstract**

Modified National Institute of Standards and Technology dataset (MNIST dataset) of handwritten digits is the most known dataset that is widely used as a benchmark for validating various ideas in Machine Learning. We present a newly created dataset of 32 handwritten Ukrainian letters, which is divided into 72 different style subclasses, with 2000 examples in each class. We also suggest a recognition model for these symbols and explain why approaches working well for MNIST dataset do not succeed in our case. Finally, we discuss several real-world applications of our model that can help to save paper, time and money.

#### Seminar No. 12

**AIME@CZ – Czech Workshop on Applied Mathematics in Engineering**

Organized by Didier Henrion and Tomáš Pajdla

**Thursday 2018-02-22, 9:45-18:00**

CIIRC Seminar Room **B-670 (9:45-13:30) and B-671 (14:30-16:00)**, floor 6 of Building B

- 9:45-10:30 Francis Bach (Inria/ENS Paris, FR)
**Linearly-convergent stochastic gradient algorithms** - 11:00-11:45 Pierre-Yves Massé
**Online Optimisation of Time Varying Systems** - 12:15-13:00 Nicolas Mansard (LAAS-CNRS Univ. Toulouse, FR)
**Why we need a memory of movement** - 14:30-15:00 Josef Šivic (Inria/ENS Paris, FR and CIIRC CTU Prague, CZ)
**Joint Discovery of Object States and Manipulation Actions** - 15:15-15:45 Tomáš Werner (CTU Prague, CZ)
**Solving LP Relaxations of Some Hard Problems Is Hard** - 16:30-18:00
*Demos and visit of CIIRC*

Title: Linearly-convergent stochastic gradient algorithms |

Speaker: Francis Bach |

Title: Online Optimisation of Time Varying Systems |

Speaker: Pierre-Yves Massé |

Dynamical systems are a wide ranging framework which may model time varying settings, spanning from engineering (e.g., cars) to machine learning (e.g., recurrent neural networks), for instance. The correct behaviour of these systems is often dependent on the choice of a parameter (e.g., the gear ratio or the wheel in the case of cars, or the weights in the case of neural networks) which the user has to choose. Finding the best possible parameter is called optimising, or training, the system.Abstract:Many real life issues require this training to occur online, with immediate processing of the inputs received by the system (e.g. the returns about the surroundings of the sensors of a car, or the successive frames of a video fed to a neural network). We present a proof of convergence for classical online optimisation algorithms used to train these systems, such as the “Real Time Recurrent Learning” (RTRL) or “Truncated Backpropagation Through Time” (TBTT) algorithms. These algorithms avoid time consuming computations by storing information about the past, in the form of a time dependent tensor. However, the memory required to do so may be huge, preventing their use on even moderately large systems. The “No Back Track” (NBT) algorithm, and its implementation friendly “Unbiased Online Recurrent Optimisation” (UORO) variant are general principle algorithms which approximate the aforementioned tensor by a random, rank-one, unbiased tensor, thus decisively reducing the storage costs but preserving the crucial unbiasedness property allowing convergence. We prove that, with arbitrarily large propability, the NBT algorithm converges to the same local optimum as the RTRL or TBTT algorithms. We might conclude by quickly presenting the “Learning the Learning Rate” (LLR) algorithm, which adapts online the step size of a gradient descent, by conducting a gradient descent on this very step. It thus reduces the sensitivity of the descent to the numerical choice of the step size, which is a well documented practical implementation issue. |

Title: Why we need a memory of movement |

Speaker: Nicolas Mansard |

Title: Joint Discovery of Object States and Manipulation Actions |

Speaker: Josef Šivic (Inria/ENS Paris, FR and CIIRC CTU Prague, CZ) |

Many human activities involve object manipulations aiming to modify the object state. Examples of common state changes include full/empty bottle, open/closed door, and attached/detached car wheel. In this work, we seek to automatically discover the states of objects and the associated manipulation actions. Given a set of videos for a particular task, we propose a joint model that learns to identify object states and to localize state-modifying actions. Our model is formulated as a discriminative clustering cost with constraints. We assume a consistent temporal order for the changes in object states and manipulation actions, and introduce new optimization techniques to learn model parameters without additional supervision. We demonstrate successful discovery of manipulation actions and corresponding object states on a new dataset of videos depicting real-life object manipulations. We show that our joint formulation results in an improvement of object state discovery by action recognition and vice versa.Abstract:Joint work with Jean-Baptiste Alayrac, Ivan Laptev and Simon Lacoste-Julien. |

Title: Solving LP Relaxations of Some Hard Problems Is Hard |

Speaker: Tomas Werner (CTU Prague, CZ) |

I will present our result that solving linear programming (LP) relaxations of a number of classical NP-hard combinatorial optimization problems (set cover/packing, facility location, maximum satisfiability, maximum independent set, multiway cut, 3-D matching, weighted CSP) is as hard as solving the general LP problem. Precisely, these LP relaxations are LP-complete under (nearly) linear-time reductions, assuming sparse encoding of instances. In polyhedral terms, this means that every polytope is a scaled coordinate projection of the optimal set of each LP relaxation, computable in (nearly) linear time. For some of the LP relaxations (exact cover, 3-D matching, weighted CSP), a stronger result holds: every polytope is a scaled coordinate projection of their feasible set, which implies that the corresponding reduction is approximation-preserving. Besides, the considered LP relaxations are P-complete under log-space reductions, therefore also hard to parallelize. These results pose a fundamental limitation on designing very efficient algorithms to compute exact or even approximate solutions to the LP relaxations, because finding such an algorithm might improve the complexity of the best known general LP solvers, which is unlikely.Abstract:Joint work with Daniel Prusa. |

#### Seminar R4I No. 1

Federica ArrigoniSynchronization Problems in Computer VisionUniversity of Udine, Italy Thursday 2017-12-07 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Consider a network of nodes where each node is characterized by an unknown state, and suppose that pairs of nodes can measure the ratio (or difference) between their states. The goal of “synchronization” is to infer the unknown states from the pairwise measures. Typically, states are represented by elements of a group, such as the Symmetric Group or the Special Euclidean Group. The former can represent local labels of a set of features, which refer to the multi-view matching application, whereas the latter can represent camera reference frames, in which case we are in the context of structure from motion, or local coordinates where 3D points are represented, in which case we are dealing with multiple point-set registration. A related problem is that of “bearing-only network localization” where each node is located at a fixed (unknown) position in 3-space and pairs of nodes can measure the direction of the line joining their locations. We are interested in global techniques where all the measures are considered at once, as opposed to incremental approaches that grow a solution by adding pieces iteratively.

#### Seminar No. 11

Antonín ŠulcLightfield Analysis for non-Lambertian ScenesUniversity of Konstanz, Germany Thursday 2017-11-23 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

In most natural scenes, we can see objects composed of non-Lambertian materials, whose appearance changes if we change viewpoint. In many computer vision tasks, we consider these as something undesirable and treat them as outliers. However, if we can record data from multiple dense viewpoints, such as with a light-field camera, we have a chance to not only deal with them but also extract additional information about the scene.

In this talk, I will show the capabilities of the light-field paradigm on various problems. Key ideas are a linear algorithm for structure from motion to generate refocusable panoramas and depth estimation for multi-layered objects which are semitransparent or partially reflective. Using these, I will show that we can decompose such scenes and further perform a robust volumetric reconstruction. Finally, I will consider decomposition of light fields into reflectance, natural illumination and geometry, a problem known as inverse rendering.

#### Seminar No. 10

Jana KošeckáSemantic Understanding for Robot PerceptionGeorge Mason University, Fairfax, VA, USA Monday 2017-10-30 at 16:00 Czech Technical University, Karlovo namesti, G-205 |

**Abstract**

Advancements in robotic navigation, mapping, object search and recognition rest to a large extent on robust, efficient and scalable semantic understanding of the surrounding environment. In recent years we have developed several approaches for capturing geometry and semantics of environment from video, RGB-D data, or just simply a single RGB image, focusing on indoors and outdoors environments relevant for robotics applications.

I will demonstrate our work on detailed semantic parsing and 3D structure recovery using deep convolutional neural networks (CNNs) and object detection and object pose recovery from single RGB image. The applicability of the presented techniques for autonomous driving, service robotics, mapping and augmented reality applications will be discussed.

#### Seminar No. 9

Mircea CimpoiDeep Filter Banks for Texture RecognitionCIIRC, Czech Technical University, Prague Thursday 2017-10-19 at 11:00 CIIRC Seminar Room A-303, floor 3 of Building A |

**Abstract**

This talk will be about texture and material recognition from images, and revisiting classical texture representations in the context of deep learning. The results were presented in CVPR 2015 and IJCV 2016. Visual textures are ubiquitous and play an important role in image understanding because they convey significant semantics of images, and because texture representations that pool local image descriptors in an order-less manner have had a tremendous impact in various practical applications. In the talk, we will revisit classic texture representations, including bag-of-visual-words and the Fisher vectors, in the context of deep learning and show that these have excellent efficiency and generalization properties if the convolutional layers of a deep model are used as filter banks. We obtain in this manner state-of-the-art performance in numerous datasets well beyond textures, an efficient method to apply deep features to image regions, as well as benefit in transferring features from one domain to another.

**References**

[1] Cimpoi, M., Maji, S., Kokkinos, I., Vedaldi, A., Deep Filter Banks for Texture Recognition, Description, and Segmentation, IJCV (2016) 118:65

[2] Cimpoi, M., Maji, S., and Vedaldi, A., Deep Filter Banks for Texture Recognition and Segmentation, CVPR (2015)

#### Seminar No. 8

Wolfgang FörstnerEvaluation of Estimation Results within Structure from Motion ProblemsUniversity of Bonn, Germany Wednesday 2017-10-18 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Parameter estimation is the core of many geometric problems within structure from motion. The number of parameters ranges from a few, e.g., for pose estimation or triangulation, to huge numbers, such as in bundle adjustment or surface reconstruction. The ability for planning, self-diagnosis, and evaluation is critical for successful project management. Uncertainty of observed and estimated quantities needs to be available, faithful, and realistic. The talk presents methods (1) to critically check the faithfulness of the result of estimation procedures, (2) to evaluate suboptimal estimation procedures, and (3) to evaluate and compare competing procedures w.r.t. their precision in the presence of rank deficiencies. Evaluating bundle adjustment results is taken as one example problem.

**References**

[1] T. Dickscheid, T. Läbe, and W. Förstner, Benchmarking Automatic Bundle Adjustment Results, in 21st Congress of the International Society for Photogrammetry and Remote Sensing (ISPRS), Beijing, China, 2008, p. 7–12, Part B3a.

[2] W. Förstner and K. Khoshelham, Efficient and Accurate Registration of Point Clouds with Plane to Plane Correspondences, in 3rd International Workshop on Recovering 6D Object Pose, 2017.

[3] W. Förstner and B. P. Wrobel, Photogrammetric Computer Vision — Statistics, Geometry, Orientation and Reconstruction, Springer, 2016.

[4] T. Läbe, T. Dickscheid, and W. Förstner, On the Quality of Automatic Relative Orientation Procedures, in 21st Congress of the International Society for Photogrammetry and Remote Sensing (ISPRS), Beijing, China, 2008, p. 37–42 Part B3b-1.

#### Seminar No. 7

Ludovic MagerandProjective Structure-from-Motion and Rolling Shutter Pose EstimationCIIRC, Czech Technical University, Prague Tuesday 2017-10-17 at 11:00 CIIRC Seminar Room A-303, floor 3 of Building A |

**Abstract**

This talk is divided in two parts, the first one will be a presentation of an ICCV’17 paper about a practical solution to the Projective Structure from Motion (PSfM) problem able to deal efficiently with missing data (up to 98%), outliers and, for the first time, large scale 3D reconstruction scenarios. This is achieved by embedding the projective depths into the projective parameters of the points and views to improve computational speed. To do so and to ensure a valid reconstruction, an extension of the linear constraints from the Generalized Projective Reconstruction Theorem is used. With an incremental approach, views and points are added robustly to an initial solvable sub-problem until completion of the underlying factorization.

The second part of the talk will presents my PhD thesis “Dynamic pose estimation with CMOS cameras using sequential acquisition”. CMOS cameras are cheap and can acquire images at very high frame rate thanks to an acquisition mode called Rolling Shutter which sequentially expose the scan-line. This makes them very interesting in the context of very high-speed robotic but it comes with what was long seen as a drawback: when an object (or the camera itself) moves in the scene, distortions appear in the image. These rolling shutter effects actually contain information on the motion and can become another advantage for high-speed robotic by extending the usual pose estimation to also estimate the motion parameters. Two methods achieving this will be presented, one assumes a non-uniform motion model and the second one a projection model suitable for polynomial optimization.

#### Seminar No. 6

Akihiro SugimotoDeeply Supervised 3D Recurrent FCN for Salient Object Detection in VideosNational Institute of Informatics (NII), Japan Monday 2017-09-25 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

This talk presents a novel end-to-end 3D fully convolutional network for salient object detection in videos. The proposed network uses 3D filters in the spatio-temporal domain to directly learn both spatial and temporal information to have 3D deep features, and transfers the 3D deep features to pixel-level saliency prediction, outputting saliency voxels. In the network, we combine the refinement at each layer and deep supervision to efficiently and accurately detect salient object boundaries. The refinement module recurrently enhances to learn contextual information into the feature map. Applying deeply-supervised learning to hidden layers, on the other hand, improves details of the intermediate saliency voxel, and thus the saliency voxel is progressively refined to become finer and finer. Intensive experiments using publicly available benchmark datasets confirm that our network outperforms state-of-the-art methods. The proposed saliency model also effectively works for video object segmentation.

#### Seminar No. 5

Viktor LarssonBuilding Polynomial Solvers for Computer Vision ApplicationsLund University, Sweden Thursday 2017-08-31 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

In the first part of the talk, I will give a brief overview of how polynomial equation systems are typically solved in Computer Vision. These equation system often come from minimal problems, which are fundamental building blocks in most Structure-from-Motion pipelines.

In the second part, I will present two recent papers on methods for constructing polynomial solvers. The first paper is about automatically generating the socalled elimination templates. The second paper extends the method to also handle saturated ideals. This allows us to essentially add additional constraints that some polynomials should be non-zero. Both papers are joint work with Kalle Åström and Magnus Oskarsson.

**References**

[1] Larsson V., Åström K, Oskarsson M., Efficient Solvers for Minimal Problems by Syzygy-Based Reduction, (CVPR), 2017. [http://www.maths.lth.se/matematiklth/personal/viktorl/papers/larsson2017efficient.pdf]

[2] Larsson V., Åström K, Oskarsson M., Polynomial Solvers for Saturated Ideals, (ICCV), 2017.

#### Seminar No. 4

Tomas MikolovNeural Networks for Natural Language ProcessingFacebook AI Research Friday 2017-08-25 at 11:00 CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract**

Artificial neural networks are currently very successful in various machine learning tasks that involve natural language. In this talk, I will describe recurrent neural network language models, as well as their most frequent applications to speech recognition and machine translation. I will also talk about distributed word representations, their interesting properties, and efficient ways how to compute them and use in tasks such as text classification. Finally, I will describe our latest efforts to create a novel dataset that could be used to develop machines that can truly communicate with human users in natural language.

**Short bio**:

Tomáš Mikolov is a research scientist at Facebook AI Research group since 2014. Previously he has been member of the Google Brain team, where he developed and implemented efficient algorithms for computing distributed representations of words (the word2vec project). He obtained his PhD from the Brno University of Technology in 2012 for his work on recurrent neural network-based language models (RNNLM). His long term research goal is to develop intelligent machines that people can communicate with and use to accomplish complex tasks.

#### Seminar No. 3

**Workshop on learnable representations for geometric matching**

Monday 2017-08-21, 14:00-18:00

CIIRC Seminar Room B-670, floor 6 of Building B

- 14:00-15:00 Torsten Sattler (ETH Zurich) Hard Matching Problems in 3D Vision
- 15:00-15:20 Coffee break, discussion
- 15:20-16:20 Eric Brachmann (TU Dresden) Scene Coordinate Regression: From Random Forests to End-to-End Learning
- 16:20-16:40 Coffee break, discussion
- 16:40-17:40 Ignacio Rocco (Inria) Convolutional neural network architecture for geometric matching
- 17:40-18:00 Discussion

Speaker: Torsten Sattler, ETH Zurich |

Title: Hard Matching Problems in 3D Vision |

Abstract: Estimating correspondences, i.e., data association, is a fundamental step of each 3D Computer Vision pipeline. For example, 2D-3D matches between pixels in an image and 3D points in a 3D scene model are required for camera pose computation and thus for visual localization. Existing approaches for correspondence estimation, e.g., based on local image descriptors such as SIFT, have been shown to work well for a range of viewing conditions. Still, existing solutions are rather limited in challenging scenes. This talk will focus on data association in challenging scenarios. We first discuss the impact of day-night changes on visual localization, demonstrating that state-of-the-art algorithms perform severely worse compared to the day-day scenario typically considered in the literature. Next, we discuss ongoing work aiming at boosting the performance of local descriptors in this scenario via a dense-sparse feature detection and matching pipeline. A key idea in this work is to use pre-trained convolutional neural networks to obtain descriptors that contain mid-level semantic information compared to the low-level information utilized by SIFT. Based on the intuition that semantic information provides a higher form of invariance, the second part of the talk considers exploiting semantic (image) segmentations in the context of visual localization and visual SLAM. |

Speaker: Eric Brachmann, TU Dresden |

Title: Scene Coordinate Regression: From Random Forests to End-to-End Learning |

Abstract: For decades, estimation of accurate 6D camera poses relied on hand-crafted sparse feature pipelines and geometric processing. Motivated by recent successes, some authors ask the question whether camera localization can be cast as a learning problem. Despite some success, the accuracy of unconstrained CNN architectures trained for this task is still inferior compared to traditional approaches. In this talk, we discuss an alternative line of research, which tries to combine geometric processing with constrained machine learning in the form of scene coordinate regression. We discuss how random forests or CNNs can be trained to substitute sparse feature detection and matching. Furthermore, we show how to train camera localization pipelines end-to-end using a novel, differentiable formulation of RANSAC. We will close the talk with some thoughts about open problems in learning camera localization. |

Speaker: Ignacio Rocco, Inria Paris |

Title: Convolutional neural network architecture for geometric matching |

Abstract: We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging Proposal Flow dataset. |

#### Seminar No. 2

Joe Kileel.Princeton Using Computational Algebra for Computer VisionWednesday 2017-06-07 at 15:00CIIRC Seminar Room B-670, floor 6 of Building B |

**Abstract
**Scene reconstruction is a fundamental task in computer vision: given multiple images from different angles, create a 3D model of a world scene. Nowadays self-driving cars need to do 3D reconstruction in real-time, to navigate their surroundings. Large-scale photo-tourism is also a popular application. In this talk, we will explain how key subroutines in reconstruction algorithms amount to solving polynomial systems, with special geometric structure. We will answer a question of Sameer Agarwal (Google Research) about recovering the motion of two calibrated cameras. Next, we will quantify the “algebraic complexity” of polynomial systems arising from three calibrated cameras. In terms of multi-view geometry, we deal with essential matrices and trifocal tensors. The first part applies tools like resultants from algebra, while the second part will offer an introduction to numerical homotopy continuation methods. Those wondering “if algebraic geometry is good for anything practical” are especially encouraged to attend.

**References**

[1] G. Floystad, J. Kileel, G. Ottaviani: “The Chow form of the essential variety in computer vision,” J. Symbolic Comput., to appear. [https://arxiv.org/pdf/1604.04372]

[2] J. Kileel: “Minimal problems for the calibrated trifocal variety,” SIAM Appl. Alg. Geom., to appear. [https://arxiv.org/pdf/1611.05947]

#### Seminar No. 1

Torsten Sattler. Camera LocalizationETH ZurichThursday 2017-05-11 at 11:00 CIIRC Lecture Hall A1001 of Building A (Jugoslavskych partyzanu 3) |

**Abstract**

Estimating the position and orientation of a camera in a scene based on images is an essential part of many (3D) Computer Vision and Robotics algorithms such as Structure-from-Motion, Simultaneous Localization and Mapping (SLAM), and visual localization. Camera localization has applications in navigation for autonomous vehicles/robots, Augmented and Virtual Reality, and 3D mapping. Furthermore, there are strong relations to camera calibration and visual place recognition. In this talk, I will give an overview over past and current efforts on robust, efficient, and accurate camera localization. I will begin the talk showing that classical localization approaches haven’t been made obsolete by deep learning. Following a local feature-based approach, the talk will discuss how to adapt such methods for real-time visual localization on mobile devices with limited computational capabilities and approaches that scale to large (city-scale) scenes, including the challenges encountered at large-scale. The final part of the talk will discuss open problems in the areas of camera localization and 3D mapping, both in terms of problems we are currently working on as well as interesting long-term goals.

**Short bio**:

Torsten Sattler received a PhD in Computer Science from RWTH Aachen University, Germany, in 2013 under the supervision of Prof. Bastian Leibe and Prof. Leif Kobbelt. In December 2013, he joined the Computer Vision and Geometry Group of Prof. Marc Pollefeys at ETH Zurich, Switzerland, where he currently is a senior researcher and Marc Pollefeys’ deputy while Prof. Pollefeys is on leave from ETH. His research interests include (large-scale) image-based localization using Structure-from-Motion point clouds, real-time localization and SLAM on mobile devices and for robotics, 3D mapping, Augmented & Virtual Reality, (multi-view) stereo, image retrieval and efficient spatial verification, camera calibration and pose estimation. Torsten has worked on dense sensing for self-driving cars as part of the V-Charge project. He is currently involved in enabling semantic SLAM and re-localization for gardening robots (as part of a EU Horizon 2020 project where he leads the efforts on a workpackage), research for Google’s Tango project, where he leads CVG’s research efforts, and in work on self-driving cars.