With the development and increasing availability of 3D devices and data it is just natural that we would expect increase in interest and popularity in 3D imaging and video. While it is relatively a recent thing to be able to watch movies in 3D, (re)create 3D models from scenes and use 3D data to estimate distance (for smart, autonomous cars for example), most of the algorithms and approaches in 3D imaging and video acquisition are quite old, or at least are a straightforward extension and improvement of relatively old proposals. It is mostly the increase in hardware performance that allows lifelike 3D rendering, Augmented Reality (AR) and Virtual Reality (VR) applications to be possible in a finite time, often even obtaining real time performances.
Depth Map and Time-of-Flight algorithms
One of the first approaches in generating 3D data of a scene, the Lidar, which is an acronym for Light Detection And Ranging, is a technology that exists since the 1960s, created very soon after the invention of the laser. The idea behind the lidar method is to scan the scene by illuminating it with lasers and then analyze the scattered rays, very similar to how a radar (RAdio Detection And Ranging) works, which is why the initial meaning of its name was a combination of the words light and radar and only later was given the meaning of the previously mentioned acronym. By illuminating the scene and measuring the time needed for the light signal to reach the object, reflect and return to the receiver, also known as the time-of-flight (ToF) of the signal, an estimate of the distance of the object to the light source is calculated. If this estimation is dense enough and is done from various locations and/or angles, a fine representation of the 3D scene is possible.
This approach is still used in many devices and is used as an input for many algorithms. An example of devices based on this approach are the ToF cameras, usually having both infrared and visible spectrum sensors. The visible spectrum sensor provides the user with a common camera output (RGB color images), while the infrared sensor creates depth maps that more or less work the same way the lidar does, or with other words, it calculates how far each part of the scene is and consequently maps the distance to a specific grayscale value. Usually, if the pixel is closer to the camera, it has brighter values, and if it is more distant it has darker values. All of the ToF cameras have a range in which the distance is detected. The values of distances outside the range are saturated towards the minimum and maximum values, respectively. If the object is too close to the camera, it will appear all white, while if it is too far away it will appear entirely black.
Depending on the usage of the camera, this range can be wider, or narrower, as well as shifted toward the closer or more far away distances. SoftKinetic’s DepthSense, currently owned by Sony, for example, is a small camera dedicated generally for teleconferences and AR/VR applications that assumes the user being seated in front of his/her computer. On the other hand, Microsoft’s Kinect is at least three times bigger than the DepthSense and is dedicated for objects that are more distant to the camera, because its main objective is to provide Xbox players with interactive gameplays that include movement of parts or the entire body. It is usually expected that the player should be at least a meter away from the TV, which is why its minimum distance is 0.7 m, as opposed to the DepthSense (version DS325) that has a range from 0.15 m to 1 m. One can overcome this limitation, of course, by using a set of cameras jointly in order to get a good depth map for a variety of ranges.
Usefulness in Robotics
This kind of cameras (or similar lidar based sensors) are also used in robotics and autonomous cars, since they are usually the basis of distance measurements that signals the car or the robot how far it is from an obstacle and whether an impact is close or even imminent. The field in robotics where ToF cameras are possibly most useful is Simultaneous Localization and Mapping (SLAM), where as the name implies, a map of the surrounding area is generated while at the same time the location of the robot (or the autonomous car) within that map is being localized. These maps can be 2D (a view from top which has information only for the X and Y axis, with no data for the height), but 3D maps are continually getting more attractive to researchers as small, commercial, embedded devices, for example single-board computers (SBCs) such as the Raspberry Pi, or systems-on-a-chip (SoCs) and single-board mikrocontrollers such as the ESP32 and Arduino are becoming more powerful and available, as well as cloud computing and Internet of Things (IoT) services are becoming progressively cheaper. The depth map ToF cameras play pivotal role in the generation of these 3D maps, as well as the localization, since they inherently have information about the distance from the points on the depth map they are generating.
Multiple View Geometry
Another way in which 3D representation is obtained from 2D projections of a scene is by the utilization of epipolar geometry in the scenario of multiple vew geometry and (usually) binocular vision. Binocular vision is the most common type of vision in animals, especially in mammals, birds and reptiles, differing in the intensity based on the cyclodisparity, the field of view, binocular summation, eye dominance and other sensory parameters of the rods and cons of the eye. For this reason, many solutions in the field of multiple view geometry usually use two cameras for the map generation, or simulates two cameras by using a single camera but comparing two images that are taken at different times and positions relative to the scene, assuming that the scene has no or few dynamic objects that would drastically change the patterns being tracked in the two corresponding images.
When having multiple views for reconstruction, one of the fundamental algorithms that need to be addressed is called triangulation, which in simplest terms means deriving the ratio between three points, the first being the original 3D point that needs to be estimated, the second the projection of that point over the first camera plane and the third the projection of the same point over the second camera plane. If more than two cameras (or more then two views from a single camera) are used, these relations can be put down multiple times between different pairs.
Pattern extraction, matching and calculation of correspondences
It is important to note that here we assumed that we know which 3D point corresponds to which projected point in both views. Depending of the usage, these points can be known a priori, or estimated. For example, if we are using these algorithms to scan some object and generate its 3D map, we can physically measure a given set of points and select them in both of the views to obtain the camera and projective matrices, based on multitude of possible algorithms, such as the Direct Linear Transform (DLT). On the other hand, if we want to use it as an automatic algorithm, for let’s say, autonomous cars, then we can generate the corresponding points in both views (often simply called correspondences) by some feature extraction and matching algorithm, such as SURF, SIFT, FAST, Harris edge detector and many others. When the set of corresponding points is generated, the camera and projective matrices can be estimated by setting an epipolar geometry relation between them. A famous algorithm that does this is the Gold Standard Algorithm. When these matrices are known, we can estimate the 3D points and generate the 3D mesh of the object, or scene that is being recreated. Based on whether during the image creation process parallel lines are preserved (meaning the matrices are considered as affine transforms), or not (the matrices are considered as projective transforms), a variation of these algorithm should be used, affine algorithms for the first case and projective algorithms for the latter.
Another possible automated approach can be considered by using Convolutional Neural Networks (CNNs) with all their varieties, such as “vanilla” CNNs, CNN autoencoders, Generative Adversarial Networks (GANs), etc. that basically learn how to map two scenes (or even one to some extent) to a 3D map, essentially doing most of the extraction, matching, estimation and generally all the preprocessing work automatically in the background.
Structured Light and Ray Tracing
Having in mind how we can set the relationship between known points using triangulation, we can now combine the lidar based algorithms with the stereo imaging algorithms in order to get what is known as structured light. In this scenario, we can reimagine the two camera setting in which we saw how the 3D point is being projected over the camera planes, but instead of the second camera, now we have an emitter, or a projector of a sort. This emitter, as was the case with the lidar, sends signals that the camera captures, but in the case of structured light, we talk about light being created with a specific, known pattern, usually in stripes-like bars, or a checkerboard grid with different colors. Since the pattern before its reflection from the object is known, but also because the real world position of the camera and the emitter are known, we can triangulate with a high precision the real coordinates of the object and thus reconstruct it very accurately. The light emitted doesn’t have to be in the visible spectrum, as is the case with many ToF cameras, which themselves can be seen, in a sense, as a mix between structured light, depth from focus and depth from stereo. For example, the previously mentioned Kinect camera uses infrared light with a speckle pattern, effectively using structured light, among other things, to estimate the depth.
It is interesting how much the approaches discussed thus far start to look alike as we get deeper into their way of work. Even algorithms that are purely in the digital domain, like the ray tracing rendering algorithm, have similar basis. In this algorithm, the reflection of objects is simulated in that way that it builds each pixel in the image by extending the rays from the image into the 3D space, where their reflection is simulated backwards toward the virtual light source, thus creating high detailed photo realistic scenes.
3D mapping in Medicine
One of the fields that have seen the biggest widespread use of 3D imaging is medicine, mostly due to having vast amounts of 2D image databases that can be used in both development and analysis of the 3D data, but also because there is always interest in creating advances in medical technology from which everybody can benefit. Other reasons for the popularity, aside from the availability of CT, PET and MR scanning devices that can provide the developer with the needed data, is the tremendous usefulness it gives to diagnosis by creating 3D models of the internal organs, bones and vessels that the doctors can analyze afterwards. Autonomous robotic arms assisting doctors, as well as remotely controlled robotic arms are getting more available in hospitals and all of them are very closely connected to the accuracy of the computer vision and 3D model programs.
Specifically when generating 3D maps for bones and organs, number of slices are generated for the given object and are concatenated by some geometry processing algorithm, such as marching cubes (or alternatively marching tetrahedra) that can generate a 3D mesh based on the input slices by basically joining the slices together and calculating their interrelation.
The Digital Imaging and Communications in Medicine (DICOM) file format is also a good, standard way of transferring medical image data, but also meta-data that, apart from having a lot of other useful information, contains details of the image creation process, such as distance of the body from the x-ray source and the camera, the angles by which the cameras and the x-ray source are positioned, etc. This information can be directly used to increase the accuracy of the 3D models that are created.
It should be noted though, that this is not the only approach used in medical imaging. Epipolar geometry has found its use in estimating 3D models of vessels in angiography, where a contrast which is absorptive for the x-rays is channeled in the blood of the patient, thus making the vessels to have fairly different grayscale value from the rest of the body. When the images are generated, similar procedures as those previously explained generate 3D model of the veins and arteries, based on two (or more) images of the vessels, taken from different angles and distances.