Introduction

Interactive graph cuts is a general purpose interactive segmentation technique used for N-dimensional images as well as video sequences that was proposed by Boykov and Jolly. The user marks certain pixels as 'object' or 'background' to provide hard constraints for segmentation. This means that after the Segmentation takes place the pixels that were marked as 'object' or 'background' will have to belong to the object or the background. Additional soft constraints are imposed through the use of an energy function that incorporates both boundary and region information. At the end an optimal solution is obtained that finds the best balance among all segmentations that satisfy the constraints.

Probability models and energy function

In more detail the pixels(seeds) that were marked by the user as background or object serve a double function. Firstly, they are used to indicate which pixels will absolutely have to belong to the object or the background after the segmentation takes place. Secondly they provide us with clues of what the user considers to be object and what background. However to obtain optimal results additional information has to be incorporated. This is achieved through the introduction of an energy function that combines region and boundary information. The energy function is defined as follows \[ E(A) = \lambda R(A) +B(A)\] \[R(A) = \sum_{p \in P}R_p(A_p)\] \[B(A) = \sum_{{p,q} \in N}B_{p,q} \cdot \delta (A_p, A_q)\] \[ \delta(A_p, A_q) = \begin{cases} 1 & if A_p \neq A_q\\ 0 & otherwise \end{cases} \] $R(A)$ is called region term and it shows how well the color of a pixel fits into a known region model. To construct the region model we use the color information that was attained from the seed pixels. In most cases a GMM is used to model the color distributions, since it has been shown\cite{grabCut} to better match the distributions. However a normalized color histogram may be used as well.

$B(A)$ is called boundary term and is used to model the boundary information. For every pair of neighbouring pixels a score is computed, e.g based on their intensity values, that shows how similar the pixels are. When the pixels are pretty different with each other, then the cost is close to zero and when they are similar the cost is pretty big. For the computation of the scores the following ad hock function is used:

\[B_{p,q} \propto exp(-\frac{(I_p - I_q)^2}{2 \sigma^2}) \cdot \frac{1}{dist(p,q)}\] where $I_p$ and $I_q$ denote the intensity values of neighbouring pixels p and q and $dist(p,q)$ is their corresponding distance. For a 4-neighbourhood system $dist(p,q)$ is equal to 1 and for a 8-neighbourhood system it is equal to 1 or $\sqrt{2}$ depending on the edge.

Graph theory

In order to use graph cuts for the segmentation a flow network is defined that corresponds to the image.

A flow network (also known as a transportation network) is a directed graph where each edge has a capacity and each edge receives a flow.

In the case of graph cuts for images the network has as many nodes as the pixels in the image. The source and the sink are used to represent the object and the background, whereas the edges define the neighbourhood relationships of the pixels. To fully define the network the weights of the edges need also be computed.

This is done through the use of the region term and boundary term. For every edge connecting the terminals with the pixels a weight is computed based on the corresponding region model. Instead of using the probabilities derived from the region model as they are, we first apply the negative log likelihood on them and then set the weights as follows: \[R_p("obj") = -lnPr(I_p|O)\] \[R_p("bkg") = -lnPr(I_p|B)\] For the edges that connect the neighbouring pixels with each other we use the boundary term.

Segmentation technique

After the graph has been defined and the weights have been computed a max flow algorithm is used to find the minimum cut in the graph. According to the max flow minimum cut theorem by finding the maximum amount of flow passing from the source to the sink, the minimum cut is found as well, which actually corresponds to the saturated edges of the network.

Minimum cut is particularly suited for the segmentation of the image since not only does it define a segmentation but it also finds the minimum of the energy function. This is of particular importance because based on the way the weights are allocated It only makes sense to compute the minimum of the energy function. In other words Pixels that belong together in terms of intensity and boundary properties are assigned bigger weights whereas pixels which significantly differ are assigned smaller weights.

Finally after the minimum cut has been computed a binary vector is returned that defines the segmentation.
$A$ = ($A_1$, . . . , $A_p$, . . . , $A_{|P|}$)