Images and Image Formation
Images
When we capture an image we sample light intensity in space (and time), and quantize the result
Fidelity depends on sampling density and quantization strength (Spatial resolution and radiometric resolution)
Shading
Bidirectional Reflection Distribution Function (BRDF)
the proportion of the incoming light is reflected along outgoing ray
$$ \rho(\theta_{i}, \phi_{i}, \theta_{i}, \phi_{i}) $$
Diffuse (Lambertian) Reflection
- Light is reflected equally in all directions
- Dull, matte surfaces like chalk or latex paint
- Microfacets scatter incoming light randomly
- Effect is that light is reflected equally in all directions
- Common assumption in computer vision
- Brightness of the surface depends on the incidence of illumination
Lambert’s law
Lambertian surface BRDF:
$$ B = \rho (N \cdot S) = \rho ||S|| cos\theta $$
- B: radiosity (total power leaving the surface per unit area)
- ρ: albedo (fraction of incident irradiance reflected by the surface)
- N: unit normal
- S: source vector (magnitude proportional to intensity of the source)
Specular Reflection
- Radiation arriving along a source direction leaves along the specular direction (source direction reflected about normal) in a lobe of directions
- Specular: reflected energy fall
Phong Model
Ambient + Diffuse + Specular
Color
Color Perception
Rods and cones act as filters on the spectrum
- To get the output of a filter, multiply its response curve by the spectrum, integrate over all wavelengths
- Each cone yields one number
- All colors are reduced to three real numbers
- How can we represent an entire spectrum with 3 numbers?
- We can’t! Most of the information is lost
- As a result, two different spectra may appear indistinguishable
- such spectra are known as metamers
Color Spaces: RGB
Any color:
$$ C = r \times R + g \times G + b \times B $$
- Easy for devices
- Non-perceptual
- Strongly correlated channels
- Where is hue and saturation? We
Normalised RGB
How to cope with changing illumination?
- E.g., (r,g,b) with doubled illumination to (2r,2g,2b)
- To make many vision tasks easier, a common trick is to normalise for illumination
$$ R = \frac{r}{r+g+b} \quad G = \frac{g}{r+g+b} \quad B = \frac{b}{r+g+b} $$
$$ R = \frac{2r}{2r+2g+2b} \quad G = \frac{2g}{2r+2g+2b} \quad B = \frac{2b}{2r+2g+2b} $$
For normalised RGB, common to use only two channels, e.g., (r,g), because r+g+b=1.
- Shadowed and unshadowed RGB areas nearly the same
- Normalised white and grey all the same
- Normalized black is unpredictable due to division by small number
Color Space: HSV
Hue, Saturation, Value (Intensity)
Cameras
Pinhole Model
- Captures pencil of rays – all rays through a single point
- The point is called Center of Projection (COP)
- The image is formed on the Image Plane
- Effective focal length f is distance from COP to Image Plane
3D to 2D Projection
- Points -> Points
- But projection of points on the focal plane is undefined
- Lines -> Lines
- But lines through the focal point (visual rays) project to a point
- Planes -> Planes
- But planes through the focal point project to lines
- Many-to-one
- All points along the same visual ray map to the same point in an image
- An infinite number of 3D world scenes correspond to any single image
- We (our brains and computer vision algorithms) need to use prior knowledge about likely scenes to disambiguate
Ideal Camera Models
Modelling Projection
Given a point P=(x,y,z) in the world, where does it appear in the image plane?
- z: z distance from pinhole to object
- f: z distance from pinhole to plane
- y: y distance from pinhole to object
- y': y distance from pinhole to plane
$$ \frac{f}{-y'} = \frac{z}{y} \quad \text{ to } \quad y' = \frac{-fy}{z} $$
hence: $$ (x,y,z) \rightarrow (-f \frac{x}{z}, -f \frac{y}{z}) $$
Projection from world to imaging plane
- All points along one ray project to the same point on the plane
- The projection of the point at O is undefined
- The image is inverted vertically and horizontally (Kepler)
Projective Geometry: Some Linear Algebra
!!important!! see slides
Filtering and Edges
Low-Level Image Processing Overview
Point Operators
- Point operators are the simplest kind of image processing transform
- Each pixel's output value depends only on the corresponding pixel's input value
- Examples: Brightness adjustment by adding a constant to each pixel's value
- Examples: Histogram equalization, contrast, color correction, etc.
Image Filtering
- Goal
- Remove unwanted sources of variation
- Keep the information relevant for whatever task we need to solve
- Approach
- Modify the pixels in an image based on a function of the local neighbourhood around each pixel
Fourier Transforms
Fourier analysis decomposes any 2D luminance image into the sum of a set of sinusoidal gratings that differ in spatial frequency, orientation, amplitude and phase. (Modify an image holistically by manipulating its power spectra.)
Image Filtering
Images as Functions
- Interpret an image as a function f from R2 to R1 or R3
- Consider f(x,y) gives the intensity at position (x,y)
- Function is defined over a rectangle, with a finite range:$$ f: [a,b] \times [c,d] \rightarrow [0,255] $$
- Color image: $$ f(x,y) = [r(x,y), g(x,y), b(x,y)] $$
Images as Discrete Functions
- Images are usually digital (discrete):
- Sample the 2D space on a regular grid
Filtering as Functions of Functions
$$ f[n,m] \rightarrow \text{System S} \rightarrow g[n,m] $$
or $$ g[n,m] = S{f[n,m]} $$
Image Filtering
Compute function of local neighbourhood at each position
$$ h[m,n] = \sum_{k,l} f[k,l] I[m+k,n+l] $$
- h: output image
- f: filter
- I: input image
- 2d coords=k, l
- 2d coords=m, n
Linear Filtering
- Different Choices of filter f achieve different effects
- Let’s replace each pixel with a weighted average of its neighborhood
- The weights are called the filter kernel
Box filter (Mean Filter)
$$
\frac{1}{9}
\begin{bmatrix}
1 & 1 & 1 \
1 & 1 & 1 \
1 & 1 & 1 \
\end{bmatrix}
$$
- Replaces each pixel with an average of its neighborhood
- Achieve smoothing effect (remove sharp features)
Non-linear Filtering
Median filter
- Replace each pixel by the median of its neighbours
- A 'rank' filter based on ordering of gray levels
- Outlier robustness. Better at removing salt and pepper noise
Properties of Convolution and Correlation
2D Correlation
h=filter2(f,I)
or h=imfilter(I,f)
$$ h[m,n] = \sum_{k,l} f[k,l] I[m+k,n+l] $$
- Key properties:
- Linearity:
imfilter(I, f1 + f2) = imfilter(I,f1) + imfilter(I,f2)
- Shift invariance: same behaviour regardless of pixel location i
mfilter(I,shift(f)) = shift(imfilter(I,f))
- Any linear, shift-invariant operator can be represented as a convolution
2D Convolution
h=conv2(f,I)
$$ h[m,n] = \sum_{k,l} f[k,l] I[m-k,n-l] $$
conv2(I,f)
is the same as filter2(rot90(f,2),I)
- Correlation and convolution are identical when the filter is symmetric
- Key properties:
- Commutative: a * b = b * a
- Conceptually no difference between filter and signal
- But particular filtering implementations might break this equality, e.g., image edges
- Associative: a * (b * c) = (a * b) * c
- Often apply several filters one after another: (((a * b1) * b2) * b3)
- This is equivalent to applying one filter: a * (b1 * b2 * b3)
- Correlation is not associative (rotation effect)
- Distributes over addition: a * (b + c) = (a * b) + (a * c)
- Scalars factor out: ka * b = a * kb = k (a * b)
- Identity: unit impulse e = [0, 0, 1, 0, 0], a * e = a
Gaussian Kernel Filtering
$$ G_{\sigma} = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}} $$
Standard deviation $\sigma$: determines extent of smoothing
- Remove high-frequency components from the image (low-pass filter)
- Convolution with self is another Gaussian
- So can smooth with small-s kernel, repeat, and get same result as larger-s kernel would have
- Convolving two times with Gaussian kernel with std. dev. $\sigma$ is same as convolving once with kernel with std. dev. $\sigma \sqrt{2}$
- Separable kernel
- Factors into product of two 1D Gaussians
Separability
The 2D Gaussian can be expressed as the product of two functions, one a function of x and the other a function of y
$$ G_{\sigma}(x,y) = \frac{1}{2\pi\sigma^2} \exp\left(-\frac{x^2 + y^2}{2\sigma^2}\right) = \left( \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{x^2}{2\sigma^2}\right) \right) \left( \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{y^2}{2\sigma^2}\right) \right) $$
- Separability means that a 2D convolution can be reduced to two 1D convolutions (one among rows and one among columns)
- (Recall associative property of convolution)
- What is the complexity of filtering an n*n image with an m*m kernel? $O(n^2m^2)$
- What if the kernel is separable? $O(n^2m)$
Dealing with Edges
What about missing pixel values?
- the filter window falls off the edge of the image
- need to extrapolate
- methods (MATLAB):
- clip filter (black):
imfilter(f, g, 0)
- wrap around:
imfilter(f, g, ‘circular’)
- copy edge:
imfilter(f, g, ‘replicate’)
- reflect across edge:
imfilter(f, g, ‘symmetric’)
Edges Detection
Origin of Edges
- surface normal discontinuity
- depth discontinuity
- surface color discontinuity
- illumination discontinuity
Edge Detection
An edge is a place of rapid change in the image intensity function
Derivatives with convolution
For 2D function f(x,y) and discrete data, we can approximate using finite differences:
$$ \frac{\partial f(x, y)}{\partial x} \approx \frac{f(x + 1, y) - f(x, y)}{1} $$
It shows changes with respect to x, so it draws the vertical lines in the image.
On the contrary, use respect to y to draw the horizontal lines:
$$ \frac{\partial f(x, y)}{\partial y}$$
Image Gradient
The gradient of an image with direction $\theta$:
$$ \nabla f = \left[ \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right] $$
The gradient points in the direction of most rapid increase in intensity
- $\nabla f = \left[ \frac{\partial f}{\partial x}, 0 \right]$: left to right, black to white
- $\nabla f = \left[ 0, \frac{\partial f}{\partial y} \right]$: top to bottom, black to white
- The gradient direction is given by: $$ \theta = \tan^{-1} \left( \frac{\partial f}{\partial y} / \frac{\partial f}{\partial x} \right) $$
- The edge strength is given by the gradient magnitude: $$ |\nabla f| = \sqrt{\left(\frac{\partial f}{\partial x}\right)^2 + \left(\frac{\partial f}{\partial y}\right)^2} $$
With Noise
Due to the noise, it is hard to find the edges by derivatives, so we need to smooth it first, e.g. use kernel g
To find edges, look for peaks in $\frac{d}{dx}(f \times g)$
Differentiation is convolution, and convolution is associative:
$$ \frac{d}{dx} (f \times g) = f \times \frac{d}{dx} g $$
This saves one operation: Precompute the differentiated filter
Smoothing vs Derivative Filters
- Smoothing filters
- Gaussian: remove “high-frequency” components; “low-pass” filter
- Can the values of a smoothing filter be negative? No
- What should the values sum to?
- One: constant regions are not affected by the filter
- Derivative filters
- Derivatives of Gaussian
- Can the values of a derivative filter be negative?
- What should the values sum to?
- Zero: no response in constant regions
Canny Edge Detector
Filtering, followed by some post processing:
- Compute x and y gradient images
- Find magnitude and orientation of gradient
- Non-maximum suppression:
- Thin wide “ridges” down to single pixel width
- Linking and thresholding (hysteresis):
- Use a high threshold to start edge curves and a low threshold to continue them
- Two thresholds - high and low
- Grad. mag. > high threshold: strong edge
- Grad. mag. < low threshold: noise
- low threshold < Grad. mag. < high threshold: weak edge
- 'Follow' edges starting from strong edge pixels
- Continue them into weak edges
Line Fitting
Difficulty
- Extra edge points (clutter)
- Which point go with which line, if any?
- Only some parts of each line are detected, some are missing
- Noise in measured edge position, orientation
- How to fit the true underlying edge?
Least Squares Line Fitting
Data: $(x_1,y_1), \text{...}, (x_n,y_n)$
Line equation: $y_i = mx_i + b$
Find $(m, b)$ to minimize
$$ E = \sum_{i=1}^{n} (y_i - mx_i - b)^2 $$
- Good
- Clearly specified objective
- Optimization is easy
- Bad
- Sensitive to outliers
- Bad matches, extra points
- Doesn’t allow you to get multiple good fits
- Detecting multiple objects, lines, etc
Random Sample Consensus (RANSAC)
- Outline:
- Choose a small subset of points ($s$) uniformly at random
- Fit a model to the subset
- Find all points that are “close” to the model (threshold $t$), and reject the rest as outliers
- Do this many times ($N$) and choose the best model
- Algorithm:
- Sample (randomly) the number of points $s$ required to fit the model
- Solve for model parameters using samples
- Score by the fraction of inliers within a preset threshold of the model
- Repeat 1-3 until the best model is found with high confidence
- Key parameters:
- More complex models need more $s$ to fit
- Choose $t$ to separate “noise” and true “outliers”
- Higher outlier ratio. OR complex model requiring more samples $s$
- Less likely to randomly chose an inlying point set, more $N$ required to find a good answer
- Good:
- Robust to outliers
- Applicable to a large number of objective function parameters
- Bad:
- Computational time grows quickly with fraction of outliers and number of parameters
- Not good for getting multiple fits
- Various parameters to tune
- Common applications:
- Line fitting
- (Use least squares fit as a subroutine)
- Computing a homography (e.g., image stitching)
- Estimating fundamental matrix (relating two views)
Interest Points Descriptors Matching
Interest Points
Pipeline for Matching
- Detection: Find a set of distinctive key points
- Description: Extract feature descriptor around each interest point as vector
- Matching: Compute distance between feature vectors to find correspondence
Characteristics of Good Keypoints
- Repeatability: The same keypoint can be found in several images despite geometric and photometric transformations
- Saliency: Each keypoint is distinctive
- Compactness and efficiency: Many fewer keypoints than image pixels
- Locality: A keypoint occupies a relatively small area of the image; robust to clutter and occlusion
Interest Point Detection - Harris
Corners
We should easily recognize the point by looking through a small window. Shifting a window in any direction should give a large change in intensity
Harris Detector
Eigenvalues Corner Detector
$$ H = \sum_{x,y} w(x, y) \begin{bmatrix} I_{xx} & I_{xy} \ I_{yx} & I_{yy} \end{bmatrix} $$
2 x 2 matrix of image derivatives smoothed by Gaussian weights
- First compute $I_x$, $I_y$, and $I_xI_y$ as 3 images; then apply Gaussian to each
- OR, first apply the Gaussian and the compute the derivatives.
To Compute the Eigenvalues: See Slides.
Interpreting the Eigenvalues:
- Flat: $\lambda_1$ and $\lambda_2$ are small; $E$ is almost constant in all directions
- Edge: $\lambda_1 >> \lambda_2$ or $\lambda_2 >> \lambda_1$
- Corner: $\lambda_1$ and $\lambda_2$ are small, $\lambda_1 \sim \lambda_2$; $E$ increases in all directions
Corner Response Function Detector
$$ R = \lambda_1\lambda_2 - \alpha(\lambda_1 + \lambda_2)^2 = \det(M) - \alpha \cdot \text{trace}(M)^2 $$
Where
$$
\det (\begin{bmatrix}
a & b \
c & d
\end{bmatrix}) = ad - bc
$$
$$
\text{trace}(A) = \sum_{i=1}^{n}a_{ii}
$$
Harris Detector: Steps
- Compute derivatives at each pixel and smooth with Gaussian (or smooth first then derivatives)
- Compute Harris matrix H in a window around each pixel
- Compute corner response function R
- Threshold R
- Find local maxima of response function (non-maximum supression)
Properties of the Harris Corner Detector
- Intensity Invariant: Partial
- Translation invariant: Yes
- Rotation invariant: Yes
- Scale invariant: No
Interest Point Detection - SIFT
Convolve the image with a 'blob filter' at multiple scales and look for extrema of filter response in the resulting scale space
- Ideal Filter for Blob Detection:
- Laplacian (second derivative) of Gaussian Scale of Gaussian controls scale of Blob that filter responds to
- In practice:
- Approximate Laplacian-of-Gaussian (LoG) by Difference-of-Gaussian (DoG)
- Blur image with σ Gaussian kernel
- Blur image with kσ Gaussian kernel
- Subtract 2. from 1.
Post Processing
- Procedure so far:
- Returns (lots) of keypoints that are:
- Blobs: Local maxima of LoG/DoG in both space and scale.
- But some may be low-contrast, others may lie on edges.
- Post-processing:
- Prune low-contrast points:
- Remove points with DoG response values below a threshold.
- Prune points along edges:
- Filter based on the Harris ('cornerness') function at each detected blob
Descriptor - Patch
Patches with similar content should have similar descriptors
The simplest way to describe the neighborhood around an interest point is to write down the list of intensities to form a feature vector. But this is very sensitive to all transformations: Small shifts, rotations, illumination, etc.
From Feature Detection to Feature Description
- Detection should be covariant:
- features(transform(image)) = transform(features(image))
- Description should be invariant
- features(transform(image)) = features(image)
Descriptor - SIFT
- Full version
- Divide the 16x16 patch into a 4x4 grid of cells
- Compute an orientation histogram for each cell
- 16 cells * 8 orientations = 128 dimensional descriptor
- Additional Processing
- Gaussian weighted magnitudes
- Trilinear interpolation so each gradient contributes to neighboring bins and orientations
- Threshold gradients to avoid domination by outlying gradients
- Normalize 128-dim vector to magnitude 1
- Orientation Normalization
- Compute orientation histogram
- Select dominant orientation $\theta$
- Normalize: rotate to fixed orientation
Properties of SIFT
- Extraordinarily robust matching technique
- Can handle changes in viewpoint
- Up to about 30 degree out of plane rotation
- Can handle significant changes in illumination
- Sometimes even day vs. night (below)
- Fast and efficient - can run in real time
- Various code available
Matching and Alignment
How to Align Two Images
- Detect and describe keypoints in each image
- Find matching keypoints
- Find the transformation that brings the keypoints into alignment
Basic Keypoint Matching
RANSAC Matching
- Randomly select a seed group of matches
- Compute transformation for seed group (least squares)
- Find inliers to this transformation: SSD(x,Hx’) < threshold.
- Keep the transformation with the largest number of inliers
- Re-compute least-squares estimate of transformation on all of the inliers
Recognition
Representations
Histogram of Gradients (HoG)
- Merits:
- Relatively simple and fast (convolution and binning)
- Based on gradients: Invariant to global illumination change (unlike pixels!)
- Due to quantization of gradients, robust to small changes in viewpoint/scale/pose
- If background clutter has random orientations, it simply adds a constant to the histogram => Some robustness to clutter
- Issues:
- Number of cells proportional to size of image.
- => Different sized images have different sized dense HoG descriptors
- => Every image has to be resampled to fixed size before HoG to get fixed length vector for downstream machine learning classifier.
- NOT robust to translation, major change of pose, viewpoint, etc.
- (=> Not comparing like-with-like on a per-cell basis )