# Superaccurate Camera Calibration via Inverse Rendering

###### Abstract

The most prevalent routine for camera calibration is based on the detection of well-defined feature points on a purpose-made calibration artifact. These could be checkerboard saddle points, circles, rings or triangles, often printed on a planar structure. The feature points are first detected and then used in a nonlinear optimization to estimate the internal camera parameters. We propose a new method for camera calibration using the principle of inverse rendering. Instead of relying solely on detected feature points, we use an estimate of the internal parameters and the pose of the calibration object to implicitly render a non-photorealistic equivalent of the optical features. This enables us to compute pixel-wise differences in the image domain without interpolation artifacts. We can then improve our estimate of the internal parameters by minimizing pixel-wise least-squares differences. In this way, our model optimizes a meaningful metric in the image space assuming normally distributed noise characteristic for camera sensors. We demonstrate using synthetic and real camera images that our method improves the accuracy of estimated camera parameters as compared with current state-of-the-art calibration routines. Our method also estimates these parameters more robustly in the presence of noise and in situations where the number of calibration images is limited.

Further author information: (Send correspondence to M.H.)

M.H. E-mail:

## 1 Introduction

Accurate camera calibration is essential for the success of many optical metrology techniques such as pose estimation, white light scanning, depth from defocus, passive and photometric stereo, and more. To obtain sub-pixel accuracy, it can be necessary to use high-order lens distortion models, but this necessitates a large number of observations to properly constrain the model and avoid local minima during optimization.

A very commonly used camera calibration routine is that of Zhang[19]. This is based on detection of feature points, an approximate analytic solution and a nonlinear optimization of the reprojection error to estimate the internal parameters, including lens distortion. Oftentimes, checkerboard corners are detected using Harris’ corner detector[11], followed by sub-pixel saddle-point detection, such as that of Förstner and Gülch[8], which is implemented in OpenCV’s cornerSubPix() routine. This standard technique can be improved for example by more robust and precise sub-pixel corner detectors[9, 3] or use of a pattern different from the prevalent checkerboard[6, 10]. A different line of work aims at reducing perspective and lens-dependent bias of sub-pixel estimates[13, 15]. In the work of Datta[5], reprojection errors are reduced significantly by iteratively rectifying images to a frontoparallel view and re-estimating saddle points. Nevertheless, such techniques are still dependent on how accurately and unbiased the corners/features were detected in the first place. Perspective and lens-distortion are then not considered directly, as their parameters are known only after calibration. Instead, the common approach is to try to make the detector mostly invariant to such effects. However, for larger features such as circles, it is questionable whether these can be detected in an unbiased way without prior knowledge of lens parameters. In addition, the distribution of the localization error is unknown and least-squares optimization may not be optimal.

In this paper, instead of relying solely on the sub-pixel accuracy of points in the image, we render an image of the calibration object given the current estimate of calibration parameters and the pose of the object. This non-photorealistic rendering of the texture of the calibration object can be compared to the observed image, which lets us compute pixel-wise differences in the image domain without interpolation. Because we are comparing differences in pixel intensities, we can model the errors as normally distributed which closely resembles the noise characteristics usually seen in camera images. This process is iterated in an optimization routine so that we are able to directly minimize the squared difference between the observed pixels and our rendered equivalent.

To ensure convergence of the optimization, the error must be differentiable with respect to camera parameters, object pose, and image coordinates. We ensure this by rendering slightly smoothed versions of the calibration object features.

## 2 Related Work

We use a texture for our implicit rendering. This bears some resemblance to the version of texture-based camera calibration[18] where a known pattern is employed. We thus inherit some of the robustness and accuracy benefits that this method earns because it is not relying exclusively on feature extraction. Our optimization strategy is however simpler and more easily applied in practice as compared with their rank minimization problem with nonlinear constraints.

The work by Rehder *et al.*[16] is more closely related to ours. They argue that an initial selection of feature points (like corners) is an inadequate abstraction. As in our work, they use a standard calibration technique for initialization. With this calibration, they implicitly render the calibration target into selected pixels to get a more direct error formulation based on image intensities. This is then used to further refine different calibration parameters through optimization. Their approach results in little difference from the initial calibration values in terms of intrinsic parameters. Instead, they focus on the use of their technique for estimating line delay in rolling shutter cameras and for inferring exposure time from motion blur. Rehder *et al.* select pixels for rendering where they find large image gradients in the calibration image. Our pixel selection scheme is different from theirs: we use all the pixels that the target is projected to, and our objective function is different.

In more recent work, Rehder and Siegwart[17] extend their direct formulation of camera calibration[16] to include calibration of inertial measurement units (IMUs). In this work, the authors introduce blurring into their renderings to simulate imperfect focusing and motion blur. We also use blurring, and their objective function is more similar to ours in this work. However, they still only select a subset of pixels for rendering based on image gradients, and they, again, did well in estimating exposure time from motion blur but did not otherwise improve results over the baseline approach.

In terms of improved image features, Ha *et al.*[10] proposed replacing the traditional checkerboard with a triangular tiling of the plane (a deltille grid). They describe a method for detecting this pattern and checkerboards in an image and introduce a method for computing the sub-pixel location of corner points for deltille grids or checkerboards. This is based on resampling of pixel intensities around a saddle point and fitting a polynomial surface to these. We consider this approach state-of-the-art in camera calibration based on detection of interest points, and we therefore use it for performance comparison.

## 3 Method

Our method builds on top of an existing camera calibration method. This is used as a starting guess for the camera matrix , the distortion coefficients and the poses of each calibration object , . We use this to render images of calibration objects, which we compare with images captured by the camera. Based on this comparison, the optimizer updates the camera calibration until the result is satisfactory. An outline of our method is in Figure 1.