Understanding cv2.recoverPose's coordinate frame transformations

220 views Asked by At

I am having problems understanding the functionality of cv2.recoverPose(points1, points2). As of my understanding of the documentation, this function should return the rotation matrix from camera1 to camera2 as well as the translation from camera1 to camera2.

I am using an artificial set of points, points_3d, as well as the transformation between two cameras, T_1_0, to recover the camera transformation in Python. To get the points on the cameras, I calculate the projections of the 3D points onto the camera sensors:

import cv2
import numpy as np

def calc_projection(K_c, transform, pt_3d):
    p_projected = np.hstack((K_c, np.zeros((3,1)))) @ np.linalg.inv(np.vstack((transform, [0,0,0,1]))) @ np.vstack((pt_3d.reshape(3,1), 1))
    p_projected = p_projected[:2,:] / p_projected[2,:]
    p_projected = p_projected.ravel()
    return p_projected

points_3d = np.random.rand(100, 3)

K_c = np.eye(3)

T_0 = np.hstack((np.eye(3), np.zeros((3,1))))
rot_vec = np.array([0.2, 0.1, 0.3])
R_1_0, _ = cv2.Rodrigues(np.array(rot_vec))
t_0_10 = np.array([0.2, 0.4, 0.1])
T_1_0 = np.hstack((R_1_0, t_0_10.reshape(3,1)))
points1 = []
points2 = []

for pt_3d in points_3d:
    points1.append(calc_projection(K_c, T_0, pt_3d))
    points2.append(calc_projection(K_c, T_1_0, pt_3d))

points1 = np.array(points1)
points2 = np.array(points2)

E, mask = cv2.findEssentialMat(points1, points2, method=cv2.RANSAC)
inliers1 = points1[mask]
inliers2 = points2[mask]
_, R, t, _ = cv2.recoverPose(E, inliers1, inliers2)

r, _ = cv2.Rodrigues(R)
assert np.allclose(r, rot_vec)
assert np.allclose(t, t_0_10)

I would expect the result to be equal to T_1_0 (as of the assertion) but the result is:

r = [[0.20329041]
 [0.15711541]
 [0.37188371]]
t = [[0.50969714]
 [0.79593836]
 [0.32663581]]

What am I missing here? Why is it not working as expected? Am I doing something wrong or what is the expected behavior here?

Edit

The formula I've used for the projection wrongly introduces the inverse of the transformation. It should be the following instead:

p_projected = np.hstack((K_c, np.zeros((3,1)))) @ np.vstack((transform, [0,0,0,1])) @ np.vstack((pt_3d.reshape(3,1), 1))

Side Note

solvePnP also solves the problem that I've been trying to solve here (3D->2D projection)

_, r, t = cv2.solvePnP(points_3d, points2, K_c, None)
assert np.allclose(r, rot_vec.reshape(3, 1), rtol=0.15)
assert np.allclose(t, t_0_10.reshape(3, 1), rtol=0.15)

But I still don't know why cv2.recoverPose does not work? Looking at the documentation it should also return the translation and rotation...

2

There are 2 answers

0
patwis On BEST ANSWER

TL;DR

Most important findings:

  • The cv2.recoverPose(points1, points2,...) function returns the rotation from camera1 to camera2 if points1 are found on camera1 and points2 are found on camera2.

  • The returned translation vector is also from camera1 to camera2 but in the coordinate frame of camera1.

  • The translation vector is found only up to a factor which can't be recovered without further logic.

  • Setting the cameraMatrix on findEssentialMat is important.

Description

I am now able to use cv2.recoverPose and recover the right rotation and translation. Here is the updated code:

E, mask = cv2.findEssentialMat(points1, points2, cameraMatrix=K_c)
inliers1 = points1[mask]
inliers2 = points2[mask]
_, R, t, _ = cv2.recoverPose(E, inliers1, inliers2, cameraMatrix=K_c)

r, _ = cv2.Rodrigues(R)

The results are

# As expected
r = [[0.2]
 [0.1]
 [0.3]]

# expected: [0.2, 0.4, 0.1], but is:
t = [[0.43643578]
 [0.87287156]
 [0.21821789]]

But! The documentation tells that the recovery of the translation is only possible up to a factor! So in this case, the following assertion works and the behavior is as expected!:

factors = t.ravel() / t_0_10 
assert np.allclose(factors, factors[0])

Comment: What's funny is that if I use K_c = np.eye(3) it sometimes works and sometimes not. But if I use the intrinsics of a real camera, the assertion always is true...

K_c = np.array([
    [485.0, 0.0, 320.0],
    [0.0, 485.0, 240.0],
    [0.0,0.0,1.0]
])
0
Rakshit Kothari On

This has been tricky to get right. There are so many conflicting explanations that I felt it best to create a dummy script. This helped me understand exactly what was going on. The CV2 documentation is not descriptive enough and doesn't help.

TLDR

Let 3D points in the world coordinate system (CS) or camera 1's CS be defined as X1. In a 2 camera system, we can only defined camera's wrt each other. For simplicity, we generally assumed camera 1's CS to be the world CS.

Let 3D points in camera 2's coordinate system be defined as X2.

Then, X2 = R @ X1 + t. Here, R and t are recovered from cv2.recoverPose.

If you want to get camera 2's center and orientation in camera 1 CS, you need R_cam2 = -inv(R) @ t and t_cam2 = R.T @ t

Detailed script

Please feel free to use my script and convince yourself

import numpy as np
import cv2
from scipy.spatial.transform import Rotation as R
import matplotlib.pyplot as plt 

if __name__ == '__main__':
    # Generating a random point cloud within specified spatial limits
    num_points = 100  # Reduced number of points for clarity
    points_3d = np.random.rand(num_points, 3)  # Random points in 3D
    # Scale and shift points to fit within the specified range
    points_3d[:, 0] = points_3d[:, 0] * 1000 - 500  # X: -500 to 500 mm
    points_3d[:, 1] = points_3d[:, 1] * 1000 - 500  # Y: -500 to 500 mm
    points_3d[:, 2] = points_3d[:, 2] * 4500 + 500   # Z: 500 to 5000 mm

    # Defining camera intrinsic parameters
    fov_x, fov_y = 80, 45  # Field of view in degrees for X and Y axes
    resolution_x, resolution_y = 1920, 1080  # Camera resolution
    # Focal lengths derived from field of view and sensor resolution
    fx = resolution_x / (2 * np.tan(np.deg2rad(fov_x) / 2))
    fy = resolution_y / (2 * np.tan(np.deg2rad(fov_y) / 2))
    # Assuming the principal point is at the center of the image
    cx, cy = resolution_x / 2, resolution_y / 2
    intrinsic_matrix = np.array([
        [fx, 0, cx],
        [0, fy, cy],
        [0, 0, 1]  # Homogeneous component
    ])

    # Camera 1 (Origin) - no rotation or translation
    R1_identity = np.eye(3)  # Identity rotation matrix
    t1_zero = np.zeros((3, 1))  # Zero translation

    # Camera 2 - positioned 50mm to the right and rotated -20 degrees around Y-axis
    t2_translation = np.array([[50], [0], [0]])  # Translation vector
    R2_rotation = R.from_euler('y', -20, degrees=True).as_matrix()  # Rotation matrix

    # Constructing projection matrices without considering intrinsic parameters for simplicity
    P1_projection = np.dot(np.eye(3), np.hstack((R1_identity, t1_zero)))
    P2_projection = np.dot(np.eye(3), np.hstack((R2_rotation, t2_translation)))

    # Projecting 3D points to each camera's coordinate system
    points_3d_homogeneous = np.hstack((points_3d, np.ones((num_points, 1))))  # Homogeneous coordinates
    points_in_cam1 = (P1_projection @ points_3d_homogeneous.T).T
    points_in_cam2 = (P2_projection @ points_3d_homogeneous.T).T

    # Converting homogeneous coordinates to 2D image points
    points_in_cam1_2d = points_in_cam1[:, :2] / points_in_cam1[:, 2, np.newaxis]
    points_in_cam2_2d = points_in_cam2[:, :2] / points_in_cam2[:, 2, np.newaxis]

    # Estimating the essential matrix from point correspondences
    E_matrix, inliers = cv2.findEssentialMat(
        points_in_cam1_2d, points_in_cam2_2d,
        focal=1.0, pp=(0., 0.), method=cv2.LMEDS, prob=0.999, threshold=1.0
    )

    # Decomposing the essential matrix to find relative rotation and translation
    _, recovered_R_matrix, recovered_t_vector, _ = cv2.recoverPose(E_matrix, points_in_cam1_2d, points_in_cam2_2d)

    # Converting rotation matrices to Euler angles for easier interpretation
    recovered_rotation_euler = R.from_matrix(recovered_R_matrix).as_euler('xyz', degrees=True)
    groundtruth_rotation_euler = R.from_matrix(R2_rotation).as_euler('xyz', degrees=True)
    
    print('Recovered Rotation (Euler angles):', recovered_rotation_euler)
    print('Ground Truth Rotation (Euler angles):', groundtruth_rotation_euler)
    
    # Calculating Camera 2's position relative to Camera 1
    camera2_position = - np.linalg.inv(R2_rotation) @ t2_translation
    camera2_orientation_in_world = R2_rotation.T  # Orientation relative to the world
    
    # Projection matrix of Camera 2 with intrinsic parameters
    P2_with_intrinsic = np.dot(intrinsic_matrix, np.hstack((R2_rotation, t2_translation)))
    M_orientation = P2_with_intrinsic[:, :3]  # Orientation component of the projection matrix
    principal_ray_direction = M_orientation[2, :]  # Camera 2's principal ray
    
    # Testing which rotation gives the correct principal ray direction
    incorrect_principal_ray = R2_rotation @ np.array([0, 0, 1])
    correct_principal_ray = R2_rotation.T @ np.array([0, 0, 1])
    
    print('Camera 2 Principal Ray Direction:', principal_ray_direction)
    print('Incorrect Principal Ray (Using R2):', incorrect_principal_ray)
    print('Correct Principal Ray (Using R2.T):', correct_principal_ray)