## 10. Stereo Rectification

### What is Stereo Rectification

Stereo Rectification is called stereo parallelization processing in Japanese. Translate the right camera (not the left camera) to the left (angle) from the right image taken by the right camera and the left camera and the right image taken from the left image by the right camera (this is the same as the original right image) This is a process to obtain a virtual left image taken without changing. This virtual left image is obtained by transforming the original left image. It is interesting that stereo parallelization is possible because there is no way to obtain an image taken from an arbitrary position and angle unless all three-dimensional information of the target is known. Stereo parallel processing is an indispensable process for stereo vision. On the other hand, the purpose of stereo vision is to obtain three-dimensional information of the target, so it is very important that stereo parallel processing is possible. That's it. As will be seen later, the right image also needs to be converted in order to move it to the left and take the left image.

### Coordinate system transformation

Consider a transformation between two 3D coordinate systems \(OXYZ\) and \(oxyz\). The two coordinate systems are essentially equal, but here \(OXYZ\) is the world coordinate system or global coordinate system, and \(oxyz\) is the camera coordinate system attached to the camera. \(O\) and \(o\) represent the origin of each. The vector is expressed as being independent of the coordinate system. I will write \(\vec{Oo}=\vec{t}\). Then \(P\) can be expressed as \begin{align*} \vec{OP}=\vec{Oo}+\vec{oP}=\vec{t}+\vec{oP}=\vec{t}+x\vec{e_1}+y\vec{e_2}+z\vec{e_3} \end{align*} as a point in space. If \(\vec{e_1}\), \(\vec{e_2}\), and \(\vec{e_3}\) are the basis vectors of the coordinate system \(oxyz\), the coordinates of the \(P\) point in the coordinate system \(oxyz\) are \((x, y, z)\). On the other hand, the coordinates of the \(P\) point in the coordinate system \(OXYZ\) are assumed to be \((X, Y, Z)\). Here, if all the vectors are displayed as components using the coordinate system \(OXYZ\), the result is \begin{align*} \vec{OP}=\begin{pmatrix} X \\ Y \\ Z \end{pmatrix} .\end{align*} Also assume that it is \begin{align*} \vec{t}=\begin{pmatrix} t_1 \\ t_2 \\ t_3 \end{pmatrix}, \vec{e_1}=\begin{pmatrix} e_{11} \\ e_{12} \\ e_{13} \end{pmatrix}, \vec{e_2}=\begin{pmatrix} e_{21} \\ e_{22} \\ e_{23} \end{pmatrix}, \vec{e_3}=\begin{pmatrix} e_{31} \\ e_{32} \\ e_{33} \end{pmatrix} .\end{align*} Then \begin{align*} \vec{OP}=\vec{t}+x\vec{e_1}+y\vec{e_2}+z\vec{e_3} \end{align*} becomes \begin{align*} \begin{pmatrix} X \\ Y \\ Z \end{pmatrix}= \begin{pmatrix} t_1 \\ t_2 \\ t_3 \end{pmatrix}+ \begin{pmatrix} e_{11} \\ e_{12} \\ e_{13} \end{pmatrix}x+ \begin{pmatrix} e_{21} \\ e_{22} \\ e_{23} \end{pmatrix}y+ \begin{pmatrix} e_{31} \\ e_{32} \\ e_{33} \end{pmatrix}z= \begin{pmatrix} e_{11} & e_{21} & e_{31} \\ e_{12} & e_{22} & e_{32} \\ e_{13} & e_{23} & e_{33} \end{pmatrix} \begin{pmatrix} x \\ y \\ z \end{pmatrix}+ \begin{pmatrix} t_1 \\ t_2 \\ t_3 \end{pmatrix} \end{align*} in the component display. \begin{align*} \begin{pmatrix} X \\ Y \\ Z \end{pmatrix}= \begin{pmatrix} e_{11} & e_{21} & e_{31} \\ e_{12} & e_{22} & e_{32} \\ e_{13} & e_{23} & e_{33} \end{pmatrix} \begin{pmatrix} x \\ y \\ z \end{pmatrix}+ \begin{pmatrix} t_1 \\ t_2 \\ t_3 \end{pmatrix} \end{align*} is expressed as \begin{align*} X=R^{-1}x+t \end{align*} by newly writing \begin{align*} X=\begin{pmatrix} X \\ Y \\ Z \end{pmatrix}, x=\begin{pmatrix} x \\ y \\ z \end{pmatrix}, R^{-1}=\begin{pmatrix} e_{11} & e_{21} & e_{31} \\ e_{12} & e_{22} & e_{32} \\ e_{13} & e_{23} & e_{33} \end{pmatrix}, t=\begin{pmatrix} t_1 \\ t_2 \\ t_3 \end{pmatrix} .\end{align*} When \(R\) is an inverse matrix of \(R^{-1}\) and multiplied on both sides from the left, it becomes \begin{align*} RX=x+Rt ,\end{align*} but \(T=-Rt\) is represented as \begin{align} x=RX+T \label{eq:1} .\end{align} Note that \(X\) is the coordinate of the \(P\) point in the coordinate system \(OXYZ\), and \(x\) is the coordinate of the \(P\) point in the coordinate system \(oxyz\). The expression \eqref{eq:1} is "\(X\) is rotated by \(R\) in the coordinate system \(OXYZ\) and only \(T\) "The coordinate in the coordinate system \(OXYZ\) of the translated position is the coordinate of \(P\) point in the coordinate system \(oxyz\)." This is the coordinate transformation formula. \begin{align*} \begin{pmatrix} x \\ y \\ z \end{pmatrix}= \begin{pmatrix} r_{11} & r_{12} & r_{13} \\ r_{21} & r_{22} & r_{23} \\ r_{31} & r_{32} & r_{33} \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \end{pmatrix}+ \begin{pmatrix} T_1 \\ T_2 \\ T_3 \end{pmatrix} \end{align*} which wrote this in the component display can be expressed as \begin{align*} \begin{pmatrix} x \\ y \\ z \end{pmatrix}= \begin{pmatrix} r_{11} & r_{12} & r_{13} & T_1 \\ r_{21} & r_{22} & r_{23} & T_2 \\ r_{31} & r_{32} & r_{33} & T_3 \end{pmatrix} \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix} .\end{align*} Write this as \begin{align} x=(R|T)\tilde{X} \label{eq:2} .\end{align} \(\tilde{X}\) means \(X\) with \(1\) added, and is called homogeneous coordinate representation. It is said that homogeneous coordinates (also called homogeneous coordinates) are for handling points at infinity. Since the infinity point does not appear in this story, we will use only \(\tilde{X}\). Using homogeneous coordinates, the transformation of the coordinate system can be expressed as a matrix product.

I want to think about how the coordinate system itself rotates and translates, but it is quite troublesome because the rotation matrix and the movement vector are displayed in the first place. For the time being, we will not consider the rotation or translation of the coordinate system itself. Since \begin{align*} \vec{e_1}=\begin{pmatrix} e_{11} \\ e_{12} \\ e_{13} \end{pmatrix}, \vec{e_2}=\begin{pmatrix} e_{21} \\ e_{22} \\ e_{23} \end{pmatrix}, \vec{e_3}=\begin{pmatrix} e_{31} \\ e_{32} \\ e_{33} \end{pmatrix} \end{align*} is an orthonormal vector, it is \begin{align*} \begin{pmatrix} (\vec{e_1})^T \\ (\vec{e_2})^T \\ (\vec{e_3})^T \end{pmatrix} \begin{pmatrix} \vec{e_1} & \vec{e_2} & \vec{e_3} \end{pmatrix} =\begin{pmatrix} e_{11} & e_{12} & e_{13} \\ e_{21} & e_{22} & e_{23} \\ e_{31} & e_{32} & e_{33} \end{pmatrix} \begin{pmatrix} e_{11} & e_{21} & e_{31} \\ e_{12} & e_{22} & e_{32} \\ e_{13} & e_{23} & e_{33} \end{pmatrix} =\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} .\end{align*} That is, \begin{align*} R=\begin{pmatrix} e_{11} & e_{12} & e_{13} \\ e_{21} & e_{22} & e_{23} \\ e_{31} & e_{32} & e_{33} \end{pmatrix} =(R^{-1})^T \end{align*}\begin{align*} R^{-1}=\begin{pmatrix} e_{11} & e_{21} & e_{31} \\ e_{12} & e_{22} & e_{32} \\ e_{13} & e_{23} & e_{33} \end{pmatrix} =R^T .\end{align*}

### Camera and perspective transformation

Consider a pinhole camera in which light from an object always passes through the pinhole. Align the origin \(o\) of the 3D coordinate \(oxyz\) with this pinhole. Next, consider the plane that intersects with the \(oz\) axis at \(f\) point \((0, 0, f)\) and calls it the imaging surface. However, \(f\) is positive and the imaging surface is between the pinhole and the object. \(f\) is also called the focus. In an actual pinhole camera, the imaging surface is on the opposite side of the object across the pinhole, so this is a virtual arrangement to improve the visibility of the equation. Place the two-dimensional pixel coordinates \(fuv\) with the \(f\) point as the origin on the imaging surface so that the \(fv\) axis and the \(oy\) axis are parallel, so that the \(fu\) axis and the \(ox\) axis are parallel. The 3D coordinates \(oxyz\) where the 2D pixel coordinates \(fuv\) are arranged in this way are called camera coordinates.

When the light emitted from the point \((x, y, z)\) on the object passes through the point \((u, v)\) on the imaging surface, it is \(u=fx/z\) because of the relationship of \(x:z=u:f\). If the pixel coordinates are in pixels, the pixel size is \(p\) and \(u=fx/{pz}\). Similarly, \(v=fy/{pz}\). These are written as \begin{align*} s \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}= \begin{pmatrix} f/p & 0 & 0 \\ 0 & f/p & 0 \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \\ z \end{pmatrix} .\end{align*} Because this is \( su=fx/p, sv=fy/p, s=z \), if you use the third formula \(s=v\), the first formula will be \(u=fx/{pz}\) and the second formula will be \(v=fy/{pz}\). Since \begin{align*} s \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}= \begin{pmatrix} f_x/p & 0 & c_x \\ 0 & f_y/p & c_y \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \\ z \end{pmatrix} \end{align*} is \( su=f_xx/p+c_xz, sv=f_yy/p+c_yz, s=z \), using the third formula \(s=v\), the first formula represents \(u=f_xx/{pz}+c_x\) and the second formula represents \(v=f_yy/{pz}+c_y\). \((c_x, c_y)\) is the pixel coordinate of the center (intersection of the \(oz\) axis and the imaging surface). This is for the case where the origin of the pixel coordinates is shifted from the \(oz\) axis. When \(f_x/p\) and \(f_y/p\) are newly expressed as \begin{align} s \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}= \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x \\ y \\ z \end{pmatrix} \label{eq:3} \end{align} as \(f_x\) and \(f_y\), \(f_x\) and \(f_y\) are focal lengths in units of pixel size. There are two focal lengths, \(f_x\) and \(f_y\), because the \(x\) and \(y\) directions may have different values. Originally, \(c_x\) and \(c_y\) are the center positions in pixel size. This transformation from the camera coordinate system to the pixel coordinate system of the imaging surface is called perspective transformation. \begin{align*} \tilde{m}= \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} \end{align*} has a homogeneous coordinate representation. If homogeneous coordinates are used, the perspective transformation can also be expressed as a matrix product. The \(s\) from the above equations is needed when using homogeneous coordinates with any real number. The expression \eqref{eq:3} can be written as \begin{align} s\tilde{m}=Ax \label{eq:4} .\end{align} \(A\) is called camera matrix. Combined with \eqref{eq:1} in the previous paragraph, \begin{align} s\tilde{m}=A(RX+T) \label{eq:5} ,\end{align} combined with \eqref{eq:2} in the previous paragraph, becomes \begin{align*} s\tilde{m}=A(R|T)\tilde{X} .\end{align*} These formulas are called camera basic formulas. The basic camera equation represents the transformation from the global coordinate system to the pixel coordinate system, that is, how the real world is photographed.

### Relationship between cameras

\begin{align} x_l=R_lX+T_l \label{eq:6} \end{align}\begin{align} x_r=R_rX+T_r \label{eq:7} \end{align} is obtained by preparing two formulas \eqref{eq:1} for the transformation from the world coordinate system to the camera coordinate system for the left camera and the right camera. Multiply \(R\) from the left on both sides of the expression \eqref{eq:6} to get \begin{align*} Rx_l=RR_lX+RT_l .\end{align*} If this \(R\) is \(R\) which becomes \(R_r=RR_l\), it can be transformed to \begin{align*} Rx_l=R_rX+RT_l=x_r-T_r+RT_l \end{align*}\begin{align*} x_r=Rx_l+T_r-RT_l \end{align*} using this and \eqref{eq:7}. When \(T_r-RT_l\) is newly placed as \(-T\), it becomes \begin{align} x_r=Rx_l-T \label{eq:8} .\end{align} This means that if \(x_l\) is rotated by \(R\) and translated by \(-T\), it becomes \(x_r\), but in order to say that, the relationship of \begin{align*} R_r=RR_l \end{align*}\begin{align*} T_r=RT_l-T \end{align*} is necessary. It should be noted here that the \(R\) and \(T\) are not reflected in the camera itself by rotating or translating, although they are reflected in the camera. Now, these and \eqref{eq:8} represent the relative positions of the two cameras. I will change \eqref{eq:8} to \begin{align} x_l=R^{-1}(x_r+T) \label{eq:9} \end{align} for later use.

### Stereo parallel processing

The \eqref{eq:4} and \eqref{eq:1}, which are written separately for the basic camera \eqref{eq:5}, are prepared for the \begin{align} s_l\tilde{m_l}=A_lx_l \label{eq:10} \end{align}\begin{align*} x_l=R_lX+T_l \end{align*} for the left camera and the \begin{align*} s_r\tilde{m_r}=A_rx_r \end{align*}\begin{align} x_r=R_rX+T_r \label{eq:11} \end{align} for the right camera. Multiply \(A_l^{-1}\) on both sides of \eqref{eq:10} from the left, transform it to \begin{align*} s_lA_l^{-1}\tilde{m_l}=x_l ,\end{align*} and use \eqref{eq:9} to become \begin{align*} s_lA_l^{-1}\tilde{m_l}=R^{-1}(x_r+T) .\end{align*} If you use \eqref{eq:11}, you can transform it with \begin{align*} s_lA_l^{-1}\tilde{m_l}=R^{-1}(R_rX+T_r+T) .\end{align*} Multiplying \(R\) from the left on both sides results in \begin{align*} s_lRA_l^{-1}\tilde{m_l}=R_rX+T_r+T .\end{align*} Furthermore, if both sides are multiplied by \(A_r\) from the left, it will become \begin{align} s_lA_rRA_l^{-1}\tilde{m_l}=A_r(R_rX+T_r+T) \label{eq:12} .\end{align} \begin{align*} s_l\tilde{\dot{m_l}}=A_r(R_rX+T_r+T) \end{align*} can be obtained by using \(\tilde{\dot{m_l}}\) converted from \(\tilde{m_l}\) by \begin{align} \tilde{\dot{m_l}}=A_rRA_l^{-1}\tilde{m_l} \label{eq:13} .\end{align} Comparing this with the basic formula \begin{align} s_r\tilde{m_r}=A_r(R_rX+T_r) \label{eq:14} \end{align} of the right camera, it can be seen that the left pixel image converted by the formula \eqref{eq:13} can obtain a projection image when the object is translated by \(T\) in the same right camera. You Furthermore, in order to make the \(T\) move in the direction of the \(ox\) axis of the right camera, considering the rotation matrix \(L\) that becomes the \begin{align*} LT=c\begin{pmatrix} e_{r11} \\ e_{r12} \\ e_{r13} \end{pmatrix}=b \end{align*} and multiplying both sides of the \eqref{eq:12} and \eqref{eq:14} from the left by the \(A_rLA_r^{-1}\), \begin{align*} s_lA_rLRA_l^{-1}\tilde{m_l}=A_r(LR_rX+LT_r+b) \end{align*}\begin{align*} s_rA_rLA_r^{-1}\tilde{m_r}=A_r(LR_rX+LT_r) \end{align*} is completed and the stereo parallelization is completed. Pixel conversion needs to be performed on both images, but the \begin{align} \tilde{\ddot{m_l}}=A_rLRA_l^{-1}\tilde{m_l} \label{eq:15} \end{align}\begin{align} \tilde{\ddot{m_r}}=A_rLA_r^{-1}\tilde{m_r} \label{eq:16} \end{align} and the converted right image \(\tilde{\ddot{m_r}}\) and left image \(\tilde{\ddot{m_l}}\) are taken by a camera (right image) and the camera is accurately translated by \(-c\) in the direction of the \(ox\) axis. This is the left image. Here, for the first time, we considered the parallel movement of the camera itself, but it is thought that the situation in which the image was translated by \(c\) in the direction of the \(ox\) axis occurred because the camera was translated by \(-c\) in the direction of the \(ox\) axis. Because it is. It should also be noted that the \(L\) has an arbitrary degree of freedom with the \(ox\) axis as the rotation axis. Next, the OpenCV initUndistortRectifyMap function that performs this image conversion is explained.

### initUndistortRectifyMap function

The argument of OpenCV's initUndistortRectifyMap function is

- InputArray cameraMatrix
- InputArray distCoeffs
- InputArray R
- InputArray newCameraMatrix
- Size size
- int m1type
- OutputArray map1
- OutputArray map2