## 14. Multi-Frame Stereo

### Assuming KITTI's Raw Data City

So far we have explained a new concept of stereo vision called gaze_line-depth model that obtains 3D information from left and right images of stereo camera. Here, we would like to consider the information obtained from the left and right images that are continuous in time from a stereo video camera. However, I would like to focus only on images that can be seen from a car that slowly travels in the city, not an arbitrary moving image. I will use 2011_09_26_drive_0091 from City of KITTI's Raw Data as an image from a car that runs slowly in the city. KITTI's license is "The Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License". Here, 340 color images were cited from the vast amount of data. Please follow this license for the reuse of this image. If these 340 color stereo images are converted into a movie, it will look like

.

### Detection of rotational movement information between frames

Since it is an image from a car that travels slowly in the city, it can be considered that most of the change in the image is caused by the movement of the car. Based on this premise, we will consider rotating the 3D shape obtained from a stereo image at a certain time and superimposing it on the 3D shape obtained from the stereo image at the next time. This can also be used to evaluate the performance of stereo vision processing because it cannot be determined if the 3D information obtained by stereo vision processing is not correct to some extent. For correct 3D information, the focal length in pixels and the distance between cameras must be correct. If this is not correct, it will not be possible to overlap by rotational movement. Since the focal length and the distance between the cameras can be obtained by calibration, it also evaluates the performance of the calibration process.

Let's use an example to explain what it is. The following movie displays frame 82 and frame 83 alternately.

You can see that the car is just moving forward. The next video shows the image being processed by rotating the 3D shape of frame 82 so that it overlaps the 3D shape of frame 83. From this three-dimensional rotational movement information, you can know the movement and rotation information of the car. From the program we created this time, we found that the car moved 695 mm forward, 6 mm left, and 11 mm upward, and then rotated 0.16 degrees horizontally right, 0.028 degrees vertically downward, and 0.074 degrees clockwise. It was. I don't know how correct this is, but when I display the 3D image of the frame 82 rotated and the 3D image of the frame 83 alternately, as in , the building does not change at all and only the person is visible. You can see that it is moving. The center bicycle is moving the most.

The following movie is a three-dimensional display of frames 0 to 30 as seen from the car.

When the same 3D image is displayed with the coordinates of frame 0, is displayed. You can see that the building is hardly moving. Back-calculating from the information on how the car moved and rotated between the frames, it is possible to reproduce how the 3D shape seen from the moving car can be seen from the position before moving and rotating. If this is done between multiple frames, everything can be displayed in frame 0 coordinates. However, as the frame advances, errors in the rotational movement information should overlap and the deviation in appearance should increase, but the fact that the building hardly moves indicates that the rotational movement information between the frames is detected fairly correctly. You The following is the same 3D image viewed 15m forward. It seems that only the person walking by the right wall moves waving his hand. The surrounding buildings seem to have stopped, although the resolution changes. Actually, the car was moving forward, so the building and the people should have moved backwards.

### Detection method of rotational movement information

A technique for detecting the camera arrangement and the shape of a stationary object from multiple still images from the same camera is called SLAM (Simultaneous Localization and Mapping), and various methods have been studied. Many methods extract feature points that are robust and easy to match by image processing, and restore three-dimensional information from the image positions of the corresponding feature points. Here, the three-dimensional shape itself is rotated without performing any such image processing. The RRR is the rotation matrix and TTT is the movement vector, and the optimal value of this rotation matrix and movement vector is searched. The rotation movement of the three-dimensional shape is performed as in

``````void trans(int w, int h, int d, int &rw, int &rh, int &rd) {
double x = -focal*half_base/d;
double y =  w*half_base/d;
double z = -h*half_base/d;
rot_and_mv(RRR, TTT, x, y, z);
double D = -focal*half_base/x;
double W =  y*D/half_base;
double H = -z*D/half_base;
rw = (int)(W+0.5);
rh = (int)(H+0.5);
rd = (int)(D+0.5);
}``````
. Here, rot_and_mv (RRR, TTT, x, y, z) is a function that overwrites (x, y, z) with the result of rotating the vector (x, y, z). (w, h) represents the position of gaze_line, and d is the depth of that position. focal is the focal length in pixels, and half_base is half the distance between cameras. The last three lines are rounded off. As you can see from this code, the representation of 3D space is also a gaze_line-depth model. At first, I thought about Point Cloud, but it stopped because it was a heavy load to judge whether the points were close. The evaluation of whether or not they are overlapped uses the degree of coincidence of the RGB values representing the brightness and color where the three-dimensional position matches in the gaze_line-depth space. In addition, we search for the optimum value in units of rotation angle 0.001 degrees and travel 1mm. The search is performed by the method of searching for the minimum value while changing the parameters in order, which was also used in the parallel processing of the camera.

### Correction using 3D information of previous frame

There are two problems with the stereo vision of the gaze_line-depth model. One is that the result is strange at the front discontinuity, and the other is that the 3D information of the back is given priority when the distance between the cameras is large. First look at

. This is a view of the three-dimensional information of frame 1 from various angles, but you can see that the stone pavement between the tracks is depressed. This is thought to occur because there is no previous information. If the frame 0 is rotated and overlapped with the frame 1, if the car is moving forward, it contains more information on the front. If this information is used, there will be no dent like . This method cannot be used when the car is in the back. This indicates that the forward and backward movement of the car is not symmetrical with respect to the acquisition of 3D information. that's strange.