Deep Image Prior

Posted by Luming on December 27, 2017

During my internship at Cornell, for the first time, I started to organize and write a paper by myself. Although I believed the idea I proposed was easy to follow, I found it extremely hard to describe the idea to someone else in a concise way. That’s the reason why the introduction and abstract I wrote were quite terrible. That’s also the moment I started to realize that, research is more than just doing experiments. Besides the consideration that I am not a native English speaker, I think this phenomenon releases another truth that, I cannot get the main point and the big picture when doing research. If not addressing this issue, there is no doubt that I will always be an immature researcher.

So, in order to practice my English academic writing skills, as well as to improve my ability to summarize research work, I decide to write summaries when reading good papers.

Today, let me start with the paper I read last week, ‘Deep Image Prior’. Honestly, at the first glance, I didn’t like this paper because the word ‘Deep’ in its title made me thought this was another low-quality paper about deep neural network. But it proved me wrong with its beautiful mathematical formula and concise idea.

Deep convolution neural network achieves great success on a variety of computer vision tasks and people attribute the success to the prior knowledge learned by CNN from large-scale image datasets. However, this paper shows that these prior might also come from CNN architecture itself. In order to prove this idea, the paper conducts the following experiments on several computer vision tasks including super-resolution, denoising and inpainting.

At the beginning, the paper organizes the whole problem to a very concise mathematical formula. In the image restoration problem, the optimization target is to recover the orignal image $x$ from a corrupted image $x_0$: $\min\limits_{x} E(x;x_0)+R(x)$. $E(x;x_0)$ is a data term indicating how well does $x$ contain the information in $x_0$. $R(x)$ is the so-called image prior indicating how well the $x$ looks like a normal or even good image. Previously, people spend a lot of time on designing the prior term, however, now the author assumes that all the images produced by the neural network match the prior, or say, is good images. In details, here we define $f$ as a deep ConvNet with parameters $\theta$ and $z$ is a fixed input. Then the optimization problem changes to $\min\limits_{\theta} E(f_\theta(z);x_0)$. Here, the CNN’s parameters are randomly initialized and the input $z$ is also a random but fixed noise. So, if this method could produce good results, it means the assumption about neural network’s image prior is right.

When looking at the experiments part, I am quite surprised by the significant performance, even compared to the state-of-the-art methods trained with large-scale dataset. It should be noted that this neural network is trained with only one single image even without the ground truth. The authors also set up a website to demonstrate their demos and release their codes on Github. I really like this kind of behavior, which I think is good for the whole research community, especially after following all those terrible papers without any codes or implementation details. Here, I only pick up some of the results that impress me most: Here, (b) is the corrupted image $x_0$ and (a) is the original or perfect good image. (c) is the previous state-of-the-art results trained with numerous images. (d) is the proposed method trained with only the corrupted image a. Here is an inpaiting task. (b) is the previous sota and (c)(d) is the proposed method. It really impressed me that it could recover the table’s circle curve and the contents in the door.

Besides surprise, I also have some thoughts. Honestly, this paper’s result is easy to understand through the way of optimization of CNN. The so-called prior is actually the regular patterns in images, which are easy for CNN to learn because they appear a lot and deserve more gradient. The so-called noise is irregular thus hard to learn. But this paper provides a novel view to see this phenomenon, especially the formula at the beginning. It demonstrates that CNN could really act as what human being does when viewing images. However, could this method also generalize to LSTM when handling language? Maybe the answer is NO. This paper again reminds me of the truth that, compared to the success of neural nets in computer vision, a huge difficulty we need to overcome is to invent some network architecture to handle language efficiently, just like the CNN to vision. There is still a long way to go.