传统的深度学习方法是根据输入数据更新网络的权值。而IST的算法是固定网络的参数,更新输入的数据。
1. We take input image and style images and resize them to equal shapes.
# Imports
import numpy as np
from PIL import Image
import requests
from io import BytesIO
import tensorflow as tf
import keras.backend as K
from keras.models import Model
from keras.preprocessing.image import load_img, img_to_array, array_to_img
# Loads an image into PIL format.
# Converts a PIL Image instance to a Numpy array.
from keras.applications import vgg19
from scipy.optimize import fmin_l_bfgs_b
# Hyperparams
ITERATIONS = 2
CHANNELS = 3
IMAGE_WIDTH = 300
IMAGE_HEIGHT = 300
CONTENT_WEIGHT = 0.02
STYLE_WEIGHT = 4.5
TOTAL_VARIATION_WEIGHT = 0.9950
TOTAL_VARIATION_LOSS_FACTOR = 1.25
content_image_path = "input.png"
style_image_path = "style.png"
output_image_path = "output.png"
combined_image_path = "combined.png"
content_url_path = "http://y0.ifengimg.com/e42a62d6e58775df/2015/0813/re_55cc3b76ddef0.jpg"
style_rul_path = "http://5b0988e595225.cdn.sohucs.com/images/20181020/9b226e94fad640a9b711c99e5985853c.jpeg"
#Input visualization
# PIL.Image.open读入的是RGB顺序,而opencv中cv2.imread读入的是BGR通道顺序 。cv2.imread会显示图片更蓝一些。
input_image = Image.open(BytesIO(requests.get(content_url_path).content))
input_image = input_image.resize((IMAGE_WIDTH, IMAGE_HEIGHT))
input_image.save(content_image_path)
input_image
# Style visualization
style_image = Image.open(BytesIO(requests.get(style_rul_path).content))
style_image = style_image.resize((IMAGE_WIDTH, IMAGE_HEIGHT))
style_image.save(style_image_path)
style_image
def preprocess_image(image_path):
# local
# img = load_img(image_path, target_size=(self.nH, self.nW)) # Loads an image into PIL format.
# web
img = image_path
img = img_to_array(img) # Converts a PIL Image instance to a Numpy array.
img = np.expand_dims(img, axis=0) # (500, 500, 3) ->(1, 500, 500, 3)
img = vgg19.preprocess_input(img)
# The images are converted from RGB to BGR, then each color channel is zero-centered with respect to the ImageNet dataset,
# without scaling.
return img
input_image_array = preprocess_image(input_image)
style_image_array = preprocess_image(style_image)
2. We load a pre-trained CNN (VGG16).
# 搭建模型
input_image = K.variable(input_image_array)
style_image = K.variable(style_image_array)
combination_image = K.placeholder((1, IMAGE_HEIGHT, IMAGE_WIDTH, 3))
# TensorShape([3, 500, 500, 3])
input_tensor = K.concatenate([input_image, style_image, combination_image], axis=0)
model = vgg19.VGG19(input_tensor=input_tensor, include_top=False)
model.summary()
4. Then we set our task as an optimization problem where we are going to minimize:
- content loss (distance between the input and output images - we strive to preserve the content)
- style loss (distance between the style and output images - we strive to apply a new style)
- total variation loss (regularization - spatial smoothness to denoise the output image)
内容损失的目标是确保生成的照片x仍能保留内容照片p的“全局”风格。要想达到这个目标,内容损失函数会分别在给定层L中定义为p和x的特征表示之间的均方误差。内容损失函数为:
F和P是两个矩阵,包含N个行和M个列, F表示层L中x的特征, P表示层L中p的特征
def content_loss(content, combination):
return K.sum(K.square(combination - content))
layers = dict([(layer.name, layer.output) for layer in model.layers])
content_layer = "block2_conv2"
layer_features = layers[content_layer] # (3, 250, 250, 128)
content_image_features = layer_features[0, :, :, :] # (250, 250, 128) input_image
combination_features = layer_features[2, :, :, :] # (250, 250, 128)
loss = K.variable(0.)
loss = loss + CONTENT_WEIGHT*content_loss(content_image_features, combination_features)
风格损失需要保存风格照片a的风格特征。论文作者并未利用特征表示之间的不同,而是利用选定层中的格拉姆矩阵的不同之处,(纹理生成领域比较常见的Gram Matrix)其中格拉姆矩阵定义如下:
格拉姆矩阵是一个正方矩阵,包含层级L中每个矢量过滤器(vectorized filter)之间的点积。因此该矩阵可以看作层级L中过滤器的一个非规整矩阵。
那么我们可以将给定层L中的风格损失函数定义为:
其中A是风格照片a的格拉姆矩阵,G为生成照片x的格拉姆矩阵。N是CHANNELS层L中的过滤器数量,M是层L的特征图谱(高度乘以宽度)中空间元素的数量
在大多数卷积神经网络中如VGG,提升层(ascending layer)的感受野(receptive field)会越来越大。随着感受野不断变大,输入图像的更大规模的特征也得以保存下来。正因如此,我们应该选择多个层级用于“风格迁移”,将局部和全局的风格质量进行合并。为了让这些层之间连接顺畅,我们可以为每个层赋予一个权重w,将整个风格损失函数定义为:
Gram矩阵就是每一层滤波后的feature map, 后将其转置并相乘得到的矩阵,如下图所示。其实就是不同滤波器滤波结果feature map两两之间的相关性。譬如说,(如下图)某一层中有一个滤波器专门检测尖尖的塔顶这样的东西,另一个滤波器专门检测黑色。又有一个滤波器负责检测圆圆的东西,又有一个滤波器用来检测金黄色。对梵高的原图做Gram矩阵,谁的相关性会比较大呢?如上图所示,“尖尖的”和“黑色”总是一起出现的,它们的相关性比较高。而“圆圆的”和“金黄色”都是一起出现的,他们的相关性比较高。因此在风格转移的时候,其实也在风景图里去寻找这种“匹配”,将尖尖的渲染为黑色,将圆圆的渲染为金黄色。如果我们承认“图像的艺术风格就是其基本形状与色彩的组合方式” ,这样一个假设,那么Gram矩阵能够表征艺术风格就是理所当然的事情了
风格表示使用了每个block的第一个卷积来计算损失函数,作者认为这种方式得到的纹理特征更为光滑,因为仅仅使用底层Feature Map得到的图像较为精细但是比较粗糙,而高层得到的图像则含有更多的内容信息,损失了一些纹理信息,但他的材质更为光滑。
def gram_matrix(x):
# K.batch_flatten Flattening a 3D tensor to 2D by collapsing the last dimension.
features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1))) # (64, 500, 500) -> (64, 250000)
gram = K.dot(features, K.transpose(features)) # (64, 64)
return gram
def compute_style_loss(style, combination):
style = gram_matrix(style) # (64, 64)
combination = gram_matrix(combination) # (64, 64)
size = IMAGE_HEIGHT * IMAGE_WIDTH
return K.sum(K.square(style - combination)) / (4. * (CHANNELS ** 2) * (size ** 2))
style_layers = ["block1_conv2", "block2_conv2", "block3_conv3", "block4_conv3", "block5_conv3"]
# (3, 500, 500, 64), (3, 250, 250, 128), (3, 125, 125, 256), (3, 62, 62, 512), (3, 31, 31, 512)
LAYER_WEIGHT = (STYLE_WEIGHT / len(style_layers))
for layer_name in style_layers:
style_features = layers[layer_name][1, :, :, :] # (500, 500, 64)
combination_features = layers[layer_name][2, :, :, :] # (500, 500, 64)
style_loss = compute_style_loss(style_features, combination_features)
loss = loss + LAYER_WEIGHT * style_loss
使得生成的图像局部平滑
def total_variation_loss(x):
a = K.square(x[:, :IMAGE_HEIGHT-1, :IMAGE_WIDTH-1, :] - x[:, 1:, :IMAGE_WIDTH-1, :])
b = K.square(x[:, :IMAGE_HEIGHT-1, :IMAGE_WIDTH-1, :] - x[:, :IMAGE_HEIGHT-1, 1:, :])
return K.sum(K.pow(a + b, TOTAL_VARIATION_LOSS_FACTOR))
loss = loss+TOTAL_VARIATION_WEIGHT*total_variation_loss(combination_image)
5. Finally, we set our gradients and optimize with the L-BFGS algorithm.
要想开始改变我们的生成图像以最小化损失函数,我们必须用scipy和Keras后端再定义两个函数。首先,用一个函数计算整体损失,其次,用另一个函数计算梯度。两者计算后得到的结果会分别作为目标函数和梯度函数输入到Scipy优化函数中。在这里,我们使用L-BFGS算法(limited-memory BFGS)。
对于每张内容照片和风格照片,我们会提取特征表示,用来构建P和A(对于每个选中的风格层),然后为风格层赋给相同的权重。在实际操作中,通常用L-BFGS算法进行超过500次迭代后,产生的结果就比较可信了。
outputs = [loss]
# with tf.GradientTape() as gtape:
# #grads = gtape.gradient(african_elephant_output, last_conv_layer.output)
# outputs = outputs + gtape.gradient(loss, combination_image)
outputs = outputs + K.gradients(loss, combination_image)
def evaluate_loss_and_gradients(x):
x = x.reshape((1, IMAGE_HEIGHT, IMAGE_WIDTH, CHANNELS))
outs = K.function([combination_image], outputs)([x])
loss = outs[0]
gradients = outs[1].flatten().astype("float64")
return loss, gradients
class Evaluator:
def loss(self, x):
loss, gradients = evaluate_loss_and_gradients(x)
self._gradients = gradients
return loss
def gradients(self, x):
return self._gradients
evaluator = Evaluator()
x = np.random.uniform(0, 255, (1, IMAGE_HEIGHT, IMAGE_WIDTH, 3)) - 128. # 随机生成output_image
for i in range(ITERATIONS):
# fmin_l_bfgs_b是scipy包中一个函数。第一个参数是定义的损失函数,第二个参数是输入数据,fprime通常用于计算第一个损失函数的梯度,
# maxfun是函数执行的次数。它的第一个返回值是更新之后的x的值,这里使用了递归的方式反复更新x,第二个返回值是损失值。
x, loss, info = fmin_l_bfgs_b(evaluator.loss, x.flatten(), fprime=evaluator.gradients, maxfun=20)
print("Iteration %d completed with loss %d" % (i, loss))
def deprocess_image(img):
# preprocess_image逆操作
img = img.reshape((IMAGE_HEIGHT, IMAGE_WIDTH, CHANNELS))
print(img.shape)
# 加上ImageNet训练集的图像平均值
img[:, :, 0] += 103.939
img[:, :, 1] += 116.779
img[:, :, 2] += 123.68
# BGR转RGB
img = img[:, :, ::-1]
img = np.clip(img, 0, 255).astype('uint8')
img = array_to_img(img) # Image.fromarray(img)
return img
output_image = deprocess_image(x)
output_image.save(output_image_path)
output_image
# Visualizing combined results
combined = Image.new("RGB", (IMAGE_WIDTH*3, IMAGE_HEIGHT))
x_offset = 0
for image in map(Image.open, [content_image_path, style_image_path, output_image_path]):
combined.paste(image, (x_offset, 0))
x_offset = x_offset + IMAGE_WIDTH
combined.save(combined_image_path)
combined
https://github.com/dingtom/python/blob/master/Style_Transfer.ipynb