What Is Visual SLAM? (Hint: It’s Not Just a Tech in Roomba Vacuums)

April 16, 2020 by Nicholas St. John

What are the origins of visual SLAM? and what are some other applications for this technology beyond floor cleaning? 

One of the honorees at CES this year was one of iRobot's many Roombas (specifically, the s9+ Vacuum & Clean Base Automatic Dirt Disposal). A unique technological feat of all Roomba vacuums is its use of vSLAM, or visual simultaneous localization and mapping.

According to iRobot, this technology captures 230,400 data points per second using optical sensors. This enables the roving vacuum to create a map of its surroundings, including its own position in that environment, and chart "where it is, where it's been, and where it needs to clean."


Roomba s9+

The Roomba s9+ uses iRobot's patented vSLAM technology. Image used courtesy of iRobot

But what are the origins of visual SLAM? and what are some other applications for this technology beyond floor cleaning? 


What Is Visual SLAM?

Generally, SLAM is a technology in which sensors are used to map a device’s surrounding area while simultaneously locating itself within that area. Sonar and laser imaging are a couple of examples of how this technology comes into play.

But unlike a technology like LiDAR that uses an array of lasers to map an area, visual SLAM uses a single camera for collecting data points and creating a map. Makhubela et al., who conducted a review on visual SLAM, explain that the single vision sensor can be a monocular, stereo vision, omnidirectional, or Red Green Blue Depth (RGBD) camera. 

There is no single algorithm to perform visual SLAM; in addition, this technology uses 3D vision for location mapping when both the location of the sensor and its broader surroundings are unknown, according to AIA, the world's largest machine vision trade association. 


3D Face Reconstructions and Drone Vision

While Makhubela et al. believe this technology is still in its infancy, visual SLAM has still made its way to a few interesting use cases.

One exciting development for visual SLAM comes out of Carnegie Mellon's robotics institute, which created a two-step method to create 3D face reconstructions with video from a smartphone. The first step of the process uses visual SLAM to triangulate points on the surface of the face while also using the information to identify the camera's position. Then, researchers use deep learning algorithms to fill in the gaps of the person's profile and facial landmarks (eyes, ears, and nose).


3D facial imaging with visual SLAM

Researchers say this method could build avatars for gaming or create customized surgical masks or respirators. Image (modified) used courtesy of Carnegie Mellon University

Another application for visual SLAM is Dragonfly, a software created by Accuware. Accuware has its own patented method of visual SLAM intended for 3D location in robots and drones—touted as having a 5-cm accuracy in its location mapping. One drawback, however, is that the software requires 16 GB minimum of computer RAM, much of which goes toward the processing engine turning data from the camera into a map. 

Accuware has expressed that they see a future for visual SLAM in autonomous vehicles and in autonomous robots and drones for delivery as well as search and rescue.



Quick history lesson on the two most popular visual SLAM iterations, MonoSLAM and PTAM: MonoSLAM, a real-time single camera SLAM, was the first implementation of vSLAM created by researchers Davison, et al. in 2007. 

Since then, researchers Taketomi et al., have expanded on the technology with PTAM, or parallel tracking and mapping. While the basic premise of both MonoSLAM and PTAM is similar, they differ in some important ways.


How Does MonoSLAM and PTAM Work?

First, both MonoSLAM and PTAM must initialize a map. In MonoSLAM, this is achieved by using a known object as the first data point. This point allows the device to calibrate and scale its measurements based upon the object’s known parameters. PTAM, on the other hand, achieves map initialization by using something called “the five-point algorithm”—a process that estimates location based on relative camera motion.

Next, both visual SLAM technologies perform tracking and localization—and this is where the real magic happens. In MonoSLAM, the technology uses a mathematical process called an Extended Kalman Filter to estimate camera motion and find 3D coordinates of “feature points,” which are 3D structures and objects recorded on the map.


Map created by Accuware's Dragonfly.

Map created by Accuware's Dragonfly. Image used courtesy of Accuware

PTAM matches feature points in order to estimate the camera position between current map points and the most recent input image from the camera. It then creates the 3D positions using triangulation and optimizes these data points using a bundling algorithm. AIA describes how the bundling algorithm utilizes Monte Carlo analysis to find an average location out of multiple data points.

As Taketomi et al. explain, PTAM, as well as many of the later implementations of visual SLAM, optimize camera location and map surroundings using relocalization and global map optimization.


Challenges of Visual SLAM: Motion and Light

While visual SLAM shows promise in robotics, research shows that the technology has several major issues.

A big one is its limitations in dealing with a dynamic environment. Visual SLAM must operate in real-time. But with only a single camera, visual SLAM does not afford a 360-degree view, Makhubela et al. explain. This means that the system must work at an exceptionally high speed to catch environmental changes and cover the entire viewing area in a short amount of time.

Makhubela et al. assert that these dynamic limitations have given rise to collaborative SLAM (CoSLAM), which uses multiple cameras to perform visual SLAM. CoSLAM fixes the issue of a restricted viewing area but ups the processing burden. That said, a more powerful computer is required to keep the system operating in real-time. 

Light variance is another issue Makhubela et al. cites. Namely, reflective surfaces and light changes between indoor to outdoor environments can obstruct data points. In other words, a house of mirrors is not a place where visual SLAM will comfortably maneuver.


What's Your Take on SLAM?

While visual SLAM shows promise in a number of fields, especially in drone and robotic design, it still has the issues of dynamic motion and light to overcome. But we want to hear from you on your experience with SLAM technologies in general. 

What are the virtues and limitations of this technology? Share your thoughts in the comments below.