Scene Completion using Millions of Photographs
This project contains the results and implementation details of a method for completing images using millions of photographs as presented in James Hays' and Alexei Efros' SIGGRAPH 2007 paper "Scene Completion Using Millions of Photographs". This project was part of the course Processing Digital Camera Images (winter term 2011/12) at Technical University Berlin. The implementation was done by Christian Thurow and Christopher Leiste over the course of one semester. Our final presentation in class can be found here.
The main goal is to replace image information in a given query image in the area defined by a mask. Be it for the sake of missing image information or creating new versions of the input, the masked area is to be covered by another "appropriate" image.
The figure below shows one of the results we produced, for more please refer to the results section below.
Input Image | Input Mask | Our result |
The following figure shows the proposed algorithm from a high level view. In order to find suitable candidates for insertion, we look up similar images in a large pre-processed image database, but use only descriptors of the images to compare them, rather than the pixel information themselves. Due to the massive amount of data being processed, pre-calculation of the almost 1.5M images is inevitable.
Figure 2: High Level Overview of AlgorithmOnce a good match for the input image has been found, the paper suggests that the best fit is being calculated. This is done by altering the scale, rotation and translation of the similar image and comparing it with the input image at the mask's position. After that the image is aligned properly we can then use a graph cut algorithm to improve the mask in a way that it falls into areas for which the gradients of the two images to be combined are as similar as possible. This will later result in a smooth transition when blending the images. Once the improved mask is available, the input image is merged with the candidate using Poisson blending.
From Hays et al. method we've made several improvements:
- Poisson: border pixel from input or output? ->adaptively!
- changed graph cut weighting inside mask
- GIST on color
- only GIST near mask
- distributed implementation (retrieval and search separate)
- suggestion: weighting the GISTs differently
With these improvements, we were able to achieve the following results:
Hays result | Our result |
The software was coded in C++ and extensively uses OpenCV to do image processing. To be able to run a whole set of images in a row, we added a batch script that doesn't rely on user interaction (we call this the retrieval framework). Further, as external libraries we've used: AMD, UMFPack, Eigen and graphcut - see the Dependencies folder for details on versions, etc.
Figure 1: 6x6 tile Layout of GIST DescriptorSearching a database of millions of images is quite an effort. On the one hand, hundreds of gigabytes need to be read
which already takes a lot of time, depending on the medium they are saved on. Also comparing images to one another
is time-consuming. On the other hand, when looking for images, we are not interested in single pixels being similar or different.
This is the reason we use GIST to describe images for us.
Every image's GIST descriptor is pre-calculated. This means that the image is divided into tiles and for each tile
edges of different frequencies and angular orientations are being detected and represented by a number.
When comparing two images, those numbers can be compared component-by-component and
afterwards summed up to give us a single value describing the similarity between two images. For the images in our database,
the GIST descriptors are not going to change over time. This is why it is reasonable to pre-calculate them to be able to just read them
again when they are needed.
Our GIST algorithm uses only greyscale information about the images. It doesn't care about colors, as we found that it did not impact the results in a significant positive manner.
Building the big database took about 49 hours on a modern CPU(from 2012) with 4 cores. However, browsing the database for similarities with an input image takes only about 7 minutes.
The original algorithm, proposed in the paper, suggests to calculate the best fit in the Lab color space by comparing rotated, scaled and translated versions of similar image to the query image. Although it may be a simple task, this is very computationally expensive. Depending on the dimensionality of the problem, this would add a considerably large amount to the overall processing time. We found that despite not having implemented this feature, the results are comparable to the original implementation of the papers algorithm.
Figure 2: left: Input Mask; right: Refined Mask after Graph CutWhen a candidate for blending has been found, we improve the input mask using a graph cut algorithm. The input mask already covers the areas of the input image that the user specified. But the edges of the mask are not necessarily the best seams to stitch the images. In case the gradients of both images are very different to each other, two very semantically different areas may be stitched together. In order to reduce that risk, we allow the mask to leave from its original path and find its place some pixel lengths further if that helps to make the subsequent blending appear more plausible.
The graph we are building gets one node for each pixel that will be in the final image. Each pixel node is connected by edges to its four immediate neighbors. The weights of these edges later decide where the best cut will be. This is where we put the difference of both images' gradients combined with a cost function that assumes very low values on the original path of the mask and quickly rises for growing distance. To create a flow inside the graph, we also need sources and sinks for it. Those are placed inside the mask and on the borders of the input image so that the flow definitely goes through the border of the original mask. We experimented with the weighting function and found that an extension to the original implementation may be helpful in some cases, where we allow not only adding pixels to the mask, but also cutting out pixels from the mask if it helps minimizing the cutting costs. For calculating the minimal cut, we utilized a library implementing the max-flow algorithm.
Calculating the best cut takes about 1 minute for 15 results.
In order to get convincing results, we use Poisson blending for the actual joining of the images. From our experience, once a similar image is found, this step adds the most improvement to the final result. To solve the Poisson equations we make use of the UMFPACK library. For that we adapted the implementation found here and made several improvements to it.
The runtime of the Poisson blending mainly depends on the size of the final mask. 15 images are usually blended within 2 to 4 minutes.
On this page, we present our results. We picked some of the good and bad results by hand to use them for an evaluation. You also find many examples of the output of the test set already used in the original implementation of Scene Completion Algorithm.
In many cases, the algorithm gave us beautiful results that were sometimes hard to distinguish from real photographs. Especially with natural landscapes, the seamless blending combined with the optimized mask really does its job.
When there are problems merging images, it has mostly to do with semantic errors. Big letters in the sky, a cow on a roof, a creek of water evolving into a stone wall – the human eye knows that those thing cannot exist. But as long as no semantic validation of the results is happening, those mistakes cannot be avoided. The GIST descriptor is already able to reduce the most serious flaws by only allowing images that have similar distributions of their edges all over the canvas. But the final check if the result really makes sense still remains unsolved.
To see the whole test set and results of it, see the Documentation/img/results folder.
Other methods were proposed to solve the problem of scene completion. Find below a brief purely visual comparison of the most prominent methods available. All methods are compared by using the same input and mask, as shown in the following figure. We picked the scene completion methods of Criminisi et al., Wilczowiak et al. as well as Microsofts Smart Erase tool. We found that the most striking results are those produced by the scene completion using millions of photographs.
input + mask | Criminisi et al. | MS Smart Erase |
Wilczkowiak et al. | Hays et al. | Our result |
In O-Notation.
O(p) , p - pixels in image
O(i) , i - number of images
O(mn^2) , n - nodes, m - edges in graph, n = p, m = 2p -> O(2*p^3) , p - pixels in image
An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision (Yuri Boykov and Vladimir Kolmogorov) http://www.csd.uwo.ca/~yuri/Papers/pami04.pdf
complexity of gaussian elimination: O(n^3) , n - matrix dimension our matrix dimension is: 3*(pixels in mask) -> O(3cp^3) , c = (mask pixels) / (image pixels) <- constant, p - image pixels http://en.wikipedia.org/wiki/Gaussian_elimination