Medium b2170e13 e2ed 48e9 9b2e 14d645d9c8a9

Simultaneous localization & mapping: a Visual SLAM tutorial

From self-driving cars to Augmented Reality, Visual SLAM algorithms are able to simultaneously build 3D maps while tracking the location and orientation of the camera. This article provides an overview of the concept and the currently used systems.


SLAM denotes Simultaneous Localization And Mapping, form the word, SLAM usually does two main functions, localization which is detecting where exactly or roughly (depending on the accuracy of the algorithm) is the vehicle in an Indoor/outdoor area, while mapping is building a 2D/3D model of the scene while navigating in it. Visual SLAM applications have increased drastically as many new datasets have become available in the cloud and as the complexity of hardware and the computational power increases as well. Applications of visual SLAM include 3D scanning, augmented reality, and Autonomous vehicles along with many others. We will provide an introduction to the core concepts underlying current SLAM systems.


Visual Slam

You may find it easy to get to know and identify everything in your surrounding environment including depth estimation, you can find a challenge in some tricky picture to detect which building is in front of the other, but you will definitely know which one is which in real life because you have a pair of eyes and don’t receive your inputs from a 2D image.

Getting information from a 3D scene is widely required, challenging as it may be, in many robotics applications. Self-driving cars for instance require a comprehensive understanding of real-time surroundings. Indoor robots require knowledge of both what something is, as well as where it is located to achieve a simple fetching task.

Convolutional Neural Networks (CNNs) gave us a huge push in the field of semantically analyzing the scene 2D-wise. Fusing this with SLAM, an autonomous vehicle can locate itself and recognizes objects simultaneously. A task like “move the chair behind the nearest desk” could be accurately accomplished. However, scaled sensors, such as stereo or RGB-D cameras only provide reliable measurements in their limited range, but again, errors occur if you move the setup from an indoor to an outdoor area.

Real-time monocular Simultaneous Localization and Mapping (SLAM) and 3D reconstruction are a hot topic due to the low hardware cost as well as the robustness, they also proved to be very reliable despite the low complexity.


Visual SLAM is divided into 2 main categories:

  • Feature based SLAM
  • Direct SLAM

Both start with getting the input images, while Direct SLAM uses the primary image for later processes. The feature based SLAM extracts and matches features from primary images for processing using techniques like SIFT, SURF. They are beyond the scope of this article but resources for further investigation will be provided at the end of this article.

After extracting the main features in the primary image, Feature based SLAM uses them to abstract the primary image to the main observations only, using this abstract image to perform two main tasks back and forth Mapping and Tracking (Localization and Mapping). Direct SLAM uses primary images to do the same two processes.


Inherent scale-ambiguity, one of the major challenges in SLAM algorithms is also one of the major benefits of monocular SLAM. Inherent scale-ambiguity is the problem of undefined scale of the world. If you walk with a camera over your head, the world seems small inside the house which suddenly change when you get out. This gives monocular SLAM the power to switch between different environments and scale at the time that stereo cameras work only in their predefined environments (limited depth range and special setup).


Feature-based methods

Feature based methods usually split the task into 2 subtasks, acquiring a set of key points features, constructing scene geometry as a function of the acquired features.


Query: but what is a key point?

Think of key points as landmarks or points of interest in an image, not necessarily of interest for you but to the vision system. The key characteristic of a key point is that it remains the same regardless of the operations performed on an image, whether it is rotated, shifted, distorted (see the figure for an example). Remember, Key points are “undistorted”.

Imaginghub Blog article Screenshot Visual slam


In particular, when using key points, information contained in straight or curved edges, especially in man-made environments making up a large part of the image is discarded. Several approaches have been made in the past to remedy this by including edge-based or even region-based features. Yet, since the estimation of the high-dimensional feature space is tedious, they are rarely used in practice. To obtain dense reconstructions, the estimated camera poses can be used to subsequently reconstruct dense maps, using multiview stereo.


Direct methods

Direct methods just avoid the hassle, they operate directly on the image intensities, while they make use of all information in the image, all pixel intensities, they avoid the computation in detecting the features in the image that may be discarded due to the straight and curved edges we talked about. In addition to higher accuracy and robustness in particular in environments with little key points, this provides substantially more information about the geometry of the environment, which can be very valuable for robotics or augmented reality applications.

While direct image alignment is well-established for RGB-D or stereo sensors, only recently monocular direct VO algorithms have been proposed. Accurate and fully dense depth maps are computed using a variational formulation, which however is computationally demanding and requires a state-of-the-art GPU to run in real-time.

A semi-dense depth filtering formulation was proposed which significantly reduces computational complexity, allowing real-time operation on a CPU and even on a modern smartphone. All these approaches however are pure visual odometries, they only track the motion of the camera locally and do not build a consistent, global map of the environment.

Imaginghub BLog article: visual slam graphic


The input is a video feed, say 30 fps, RGB frame sequence. Three separate processes take place here, a key frame selection process, a 2D semantic segmentation process, and a 3D reconstruction with semantic optimization process.

Key frames are selected from the sequence of frames as a reference and the consecutive frames are used for refining the depth and the variance. Semantic segmentation process then takes place in a 2-dimensional space. It is a usual segmentation process using convolutional NN that classifies the images using the selected key frames, i.e. for a street view image. It classifies to pedestrians, cars, sidewalk and buildings…..etc. After that, a 3D map is reconstructed putting the key frames together with the semantic information.


Now, the 3D reconstruction process takes place by stacking the key frames together with consideration given to the pose graph (which frame was taken where). This stacking can occur real-time today. Simultaneously, the semantic segmentation process classifies the frames and now they have fused to become a 3D model, so the classified building in the 2D image is now a classified building in the 3D environment.

The depth information of key frames are iteratively refined by their consecutive frames. It creates local optimal depth estimation for each key frame and correspondence between labelled pixels and voxels in the 3D point cloud.

To obtain a globally optimal 3D semantic segmentation, we exploit the information over neighboring 3D points, involving the distance, color similarity and semantic label. This process achieves the update of 3D point’s state and creates a globally consistent 3D map.


2D semantic segmentation

A Deep CNN as usually known, consists of one or more (convolutional-max-pool layers) pair followed by one or more fully connected layers. It mainly works on the features in the image, deciding which features stand out to form what. Here the pair is named ‘the dilated convolution” and “spatial pyramid pooling (ASPP)”. For the inference, a softmax layer is used to obtain the final probabilistic score map (you can look more into the Softmax layer for probabilistic output)

See: to view it in action.


Semi-Dense SLAM Mapping

LSD-SLAM is a real-time and semi-dense 3D mapping method. The 3D environment is reconstructed as pose graph key frames with associated semi-dense depth maps.

Query: what is a pose graph?

Well, you see in most SLAM algorithm a landmark graph is a common term, you can get thousands of landmark points from a frame sequence. To optimize the reconstructed environment over those would be a very time consuming and computationally expensive task, instead we use a pose graph which gives information about the pose of the camera in each frame and that’s it.

The key frame is selected from image frames, which contain 4 pieces of information, the image intensity, the depth map, the variance of depth map (from previous frames) and a semantic score map (remember The semantic score map is with a size of H × W × M (H: height, W: width, M: number of classes) directly from deep CNN.

Ok, now I want you to picture driving towards a certain building, this will result in scaling of the building, it keeps getting bigger each frame. So, a scale-drift aware image alignment is carried on these stacked key frames with refined depth map, which is used to align two differently scaled key frames.



Imaginghub Blog article: image results visual slam