3D Geometry

Homogeneous Coordinates

The homogeneous representation of an image point $\mathbf{x} = [u, v]$ is denoted as $\homo{x} = [u, v, 1]$. A line in the image passing through $\homo{x}_1$ and $\homo{x}_2$ is $\homo{l} = \homo{x}_1 \times \homo{x}_2$. Homogeneous coordinates are useful because:

Various computer vision and graphics operations can be expressed concisely as matrix multiplications.
Points at infinity can be expressed in the same format, by setting the last element to 0.

Plücker Coordinates, Barycentric Coordinates

TODO

Types of Transforms

Table shows the types of transforms on 2D homogeneous coordinates. Each transform preserves all the quantities in its row and below it.

Transform	DoF	Preserves
Translation	2	absolute orientation
Rotation	1	distances
Similarity	4	angles
Affine	6	parallel lines
Projective	8	ratios of distances

Single View Geometry

If the transformation from world to camera coordinates is $^cT_w = \begin{bmatrix}R & t \\ 0 & 1 \end{bmatrix}$, the pinhole camera projection equation is

\[\homo{x} = K [R | t] ^w\homo{X}\]

where $\homo{X}$ is the homogeneous representation of the 3D point in world coordinates, and

\[K = \begin{bmatrix} f_x & 0 & c_x\\ 0 & f_y & c_y\\ 0 & 0 & 1 \end{bmatrix}\]

is the ideal pinhole camera intrinsics matrix. $P = K[R | t]$ is called the camera projection matrix.

Why don’t we represent 3D points in homogeneous coordinates when multiplying with $K$? Note that at this stage, the 3D point is already in the camera coordinate system. So scaling the non-homogeneous coordinates has the effect of moving along the ray connecting the 3D point to the camera center. And all points on that ray project to the same location on the image.

Dissecting the Camera Projection Matrix

Let’s denote the columns of P as $P = [P_1 | P_2 | P_3 | P_4]$. Now, $P_1 = P \begin{bmatrix}1 \\ 0 \\ 0 \\ 0\end{bmatrix}$. Hence $P_1$ is the image of vanishing point in the world coordinate X direction. Similarly for $P_2$ and $P_3$. $P_4$ is the image of the world origin.

Let’s denote the rows of P as $P = \begin{bmatrix}P_1^T \\ P_2^T \\ P_3^T\end{bmatrix}$. Since $P_1^T = \begin{bmatrix}1 & 0 & 0\end{bmatrix}P$, $P_1^T$ is the plane passing through the camera center and the image line $[1, 0, 0]$. To see this, consider a projected point $\homo{x} = P \homo{X}$. $\homo{x}$ lies on the image line $\homo{l}$ iff

\[\begin{align*} \homo{l}^T\homo{x} &= 0\\ \iff \homo{l}^TP\homo{X} &= 0\\ \iff (P^T\homo{l})^T\homo{X} &= 0 \end{align*}\]

Thus the plane corresponding to image line $\homo{l}$ is $P^T\homo{l}$.

Representing lines with Plücker coordinates

TODO

Camera Intrinsics and Pose from Vanishing Points

Find 3 sets of imaged parallel lines, and intersect each set to get 3 vanishing points $\mathbf{v}_i$. Can use Hough voting for efficiency.
$\homo{v}_i \approx K[R \vert \mathbf{t}] \begin{bmatrix}\mathbf{e}_i \\ 0 \end{bmatrix} = KR\mathbf{e}_i$, where $\mathbf{e}_1 = [1~0~0]^T$, and so forth.
The equation $\mathbf{e}_i^T \mathbf{e}_j = \mathbf{v}_i^T K^{-T} R^T R K \mathbf{v}_j = \mathbf{v}_i^T K^{-T} K \mathbf{v}_j = 0$ gives an equation containing focal length and principal point.
Solve the set of equations to get camera intrinisics.
Now, $\homo{v}_i \approx KR\mathbf{e}_i = K\mathbf{r}_i$ (where $\mathbf{r}_i$ is the i’th column of $R$) can be solved s.t. $\norm{\mathbf{r}_i} = 1$ by SVD. This gives camera rotation $R$.

2-view Geometry

Homography from 2D correspondences

Use DLT, basically $\homo{x} \times H \homo{x}’ = 0$. Each correspondence gives 2 LI equations. Since $H$ has 8 DoFs, we need at least 4 correspondences. See Hartley and Zisserman, 4.1.
Refine the initial estimate using nonlinear optimization on geometric distance (HZ 4.2.2), possibly with the Sampson error (HZ 4.2.6).

The Fundamental Matrix F (Song)

$\homo{x}’^T F \homo{x} = 0$

$F$ does not depend on the scene geometry. It can be calculated directly from pixel correspondences between two images.
$F$ is a homogeneous entity, and has rank 2, so it has $9-2=7$ degrees of freedom.
It maps points in one image to their epipolar lines in the other image.
$\homo{x}’^T F = \homo{l}^T$, the epipolar line of $\homo{x}’$. Since the epipole $\homo{e}$ is on this line, $\homo{x}’^T F \homo{e} = 0 \forall \homo{x}’$. This implies $F \homo{e} = 0$. Thus, the epipole $\homo{e}$ is the right null vector of $F$. Similarly, $\homo{e}’$ is the left null vector of $F$.
Because $F$ is determined up to a scale, and it has a nonzero null space, it has $9-2=7$ degrees of freedom.
Given $F$, epipoles can be estimated by either solving $F\homo{e} = 0$ and $F^T \homo{e}’ = 0$, or computing epipolar lines and shown above and intersecting them.

Estimating F (8-Point Algorithm)

Construct 1 linear equation per correspondence from $\homo{x}’^T F \homo{x} = 0$. This gives the system $A\mathbf{f} = 0$, which we solve s.t. $\norm{\mathbf{f}} = 1$ using SVD. $\mathbf{f}$ has 9 elements, but we only need 8 rows in $A$ (and hence 8 correspondences) because we impose the $\norm{\mathbf{f}} = 1$ constraint.
$F$ actually has only 7 DoFs. So we impose an additional constraint, that it has rank 2, again using SVD. Let $\hat{F} = U diag(r,s,t) V^T$. Then set $F = U diag(r, s, 0) V^T$.

The Essential Matrix E

Normalized image coordinates, $\hat{\homo{x}} = K^{-1}\homo{x}$. The essential matrix works on normalized image coordinates:

\[\begin{align*} \hat{\homo{x}}'^T E \hat{\homo{x}} &= 0\\ \implies E &= K'^{-1} F K \end{align*}\]

Another definition is $E = [\mathbf{t}]_{\times}R$. Since both $\mathbf{t}$ and $R$ have 3 DoFs, and $E$ is determined up to a scale, $E$ has $3 + 3 - 1 = 5$ degrees of freedom.

Calculating Camera Motion

3D correspondences

Use Procrustes Analysis, maybe wrapped in a RANSAC loop. Need minimum 3 correspondences.

2D correspondences (Visual Odometry)

Compute $F$.
Compute $E$ using the camera calibrations.
SVD $E = U diag(1, 1, 0) V^T$. The two cameras then are $P = [I \vert 0]$ and $P’ = [UWV^T \vert \pm \mathbf{u}_3 ]$ or $P’ = [UW^TV^T \vert \pm \mathbf{u}_3]$.
Disambiguate between these 4 choices by checking that the reconstructed points are in front of both cameras. To do this, project the points in the camera coordinate systems and check that the $Z$ coordinate is positive.

Reconstructing 3D Points

Get initial estimate using DLT using the fact that $\homo{x} \times P\homo{X} = 0$ and $\homo{x}’ \times P’\homo{X} = 0$. This needs 1 correspondence per 3D point. See Hartley and Zisserman, 12.2.
Refine the initial estimate using non-linear optimization of the projection error. See Hartley and Zisserman, 12.4.

Calculating Camera Motion + Reconstructing 3D Points (Structure from Motion - SfM)

TODO

Multi-view Geometry

TODO

Panoramas

The simplest panorama stitching can be done by modeling the camera motion between images as a 2D planar homography. This model holds true for two cases:

Planar scene. w.l.o.g. we can choose the world coordinates such that the scene is on the XY plane. The camera projection equation becomes:
\[\begin{align*} \homo{x} &= P \homo{X}\\ &= \begin{bmatrix} P1 & P2 & P3 & P4\end{bmatrix} \begin{bmatrix} X \\\ Y \\\ 0 \\\ 1 \end{bmatrix}\\ &= \begin{bmatrix} P1 & P2 & P4\end{bmatrix} \begin{bmatrix} X \\\ Y \\\ 1 \end{bmatrix} \end{align*}\]
which is a homography. Since the transformations from all world points to image points are homographies, the camera motions between image frames are also homographies.
In-place rotation. Suppose the camera moves by a rotation $R$ between two images. The image locations of a 3D point $\homo{X}$ are w.l.o.g.
\[\begin{align*} \homo{x}_1 &= K[I | 0]\homo{X}\\ &= KX \end{align*}\]
and
\[\begin{align*} \homo{x}_2 &= K[R | 0]\homo{X}\\ &= KRX\\ &= KRK^{-1}KX\\ &= KRK^{-1}\homo{x}_1 \end{align*}\]
Thus, the relationship between $\homo{x}_1$ and $\homo{x}_2$ is a homography $H=KRK^{-1}$. This is why smartphone panorama apps advise you to rotate in place while capturing a panorama.

(page source)