原文标题为:Saving and loading a large number of images (data) into a single HDF5 file。这篇文章介绍如何将大量的图片存储在单个HDF5文件中,以及如何分batch读取数据来训练网络。原文中分别使用了python中的h5py模块和tables模块,这里仅使用h5py模块。此外,原文代码中对于彩色图像的表示分为TensorFlow和Theano两种情况来处理,这里只处理TensorFlow的情况,即图像表示为[batch, image_height, image_width, image_depth]的形式。
Introduction
在深度学习中,通常会使用巨量的数据或图片来训练网络,比如ImageNet上就有数百万张图片。对于如此大的数据集,如果对于每张图片都单独从硬盘读取、预处理、之后再送入网络进行训练、验证或是测试,这样效率可是太低了。如果将这些图片都放入一个文件中再进行处理,这样效率会更高。有多种数据模型和库可完成这种操作,如HDF5和TFRecord。这篇文章中会介绍如何将大量的图片存入单个HDF5文件中,以及如何将它们按batch读出来。不论数据多大,即不论数据是否大于电脑内存,该方法都是有效的。HDF5提供了工具,可用于对数据进行管理、操作、可视化、压缩和存储等。
本文使用kaggle上Dogs vs. Cats中的训练集。
List images and their labels
首先介绍下使用的数据集,Dogs vs. Cats训练集中有25,000张图片,猫和狗各一半,图片的命名为dog.5199.jpg或cat.123.jpg的形式。下面的代码读取所有文件名(并没有读入图片内容),并对各图片指定对应的标签,如果是猫,则label=0,如果是狗,则label=1。接着shuffle数据集,并分为训练(60%)、验证(20%)和测试集(20%)。
"""
List images and label them
"""
from random import shuffle
import glob
# shuffle the addresses before saving
shuffle_data = True
# address to where you want to save the hdf5 file
hdf5_path = 'Cat vs Dog/dataset.hdf5'
# 数据集的读取路径
cat_dog_train_path = 'Cat vs Dog/train/*.jpg'
# read addresses and labels from the 'train' folder
# 使用glob来获取所有符合条件的文件名列表
# 此时,addrs中的元素皆为'Cat vs Dog/train/dog.5199.jpg'此类字符串
addrs = glob.glob(cat_dog_train_path)
# 对各文件名指定对应的label
labels = [0 if 'cat' in addr else 1 for addr in addrs] # 0 = Cat, 1 = Dog
# to shuffle data
if shuffle_data:
c = list(zip(addrs, labels))
shuffle(c)
addrs, labels = zip(*c)
# Divide the hata into 60% train, 20% validation, and 20% test
train_addrs = addrs[0:int(0.6*len(addrs))]
train_labels = labels[0:int(0.6*len(labels))]
val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]
test_addrs = addrs[int(0.8*len(addrs)):]
test_labels = labels[int(0.8*len(labels)):]
Create a HDF5 file
有两个库可处理HDF5格式,h5py和tables,原文中介绍了二者,这里只介绍h5py。第一步先建立HDF5文件。为了存储图片,需要对训练、验证、测试集分别定义shape为(number of data, image_height, image_width, image_depth)的矩阵。同样,对于标签需要分别定义shape为(number of data)的矩阵。最后,计算训练集中每一个像素点的平局值,并存到shape为(image_height, image_width, image_depth)的矩阵中。注意在建立矩阵式,要一直注意其数据类型。
在h5py中,使用create_dataset来建立矩阵。可以直接使用numpy中的数据类型来指定矩阵的dtype。在建立矩阵时要指定矩阵的大小(shape)。代码如下:
"""
Creating a HDF5 file
"""
import numpy as np
import h5py
train_shape = (len(train_addrs), 224, 224, 3)
val_shape = (len(val_addrs), 224, 224, 3)
test_shape = (len(test_addrs), 224, 224, 3)
# open a hdf5 file and create earrays
hdf5_file = h5py.File(hdf5_path, mode='w')
# 建立了下面四个矩阵,但是没有赋值
hdf5_file.create_dataset("train_img", train_shape, np.int8)
hdf5_file.create_dataset("val_img", val_shape, np.int8)
hdf5_file.create_dataset("test_img", test_shape, np.int8)
hdf5_file.create_dataset("train_mean", train_shape[1:], np.float32)
# 建立各标签矩阵,并赋值
hdf5_file.create_dataset("train_labels", (len(train_addrs),), np.int8)
# 第二个索引框[...]是必须有滴。
hdf5_file["train_labels"][...] = train_labels
hdf5_file.create_dataset("val_labels", (len(val_addrs),), np.int8)
hdf5_file["val_labels"][...] = val_labels
hdf5_file.create_dataset("test_labels", (len(test_addrs),), np.int8)
hdf5_file["test_labels"][...] = test_labels
下面,一张张读取图片,进行预处理后存到hdf5_file中。代码分为三个个循环,分别处理训练集、验证集合测试集。代码中对图片的预处理只是使用opencv来resize下。
"""
Load images and save them
"""
import cv2
# a numpy array to save the mean of the images
# shape := (image_height, image_width, image_depth)
mean = np.zeros(train_shape[1:], np.float32)
# loop over train addresses
for i in range(len(train_addrs)):
# print how many images are saved every 1000 images
if i % 1000 == 0 and i > 1:
print('Train data: {}/{}'.format(i, len(train_addrs)))
# read an image and resize to (224, 224)
# cv2 load images as BGR, convert it to RGB
addr = train_addrs[i]
img = cv2.imread(addr)
img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# add any image pre-processing here
# save the image and calculate the mean so far
# img.shape := (224, 224, 3)
hdf5_file["train_img"][i, ...] = img
mean += img / float(len(train_labels))
# loop over validation addresses
for i in range(len(val_addrs)):
# print how many images are saved every 1000 images
if i % 1000 == 0 and i > 1:
print('Validation data: {}/{}'.format(i, len(val_addrs)))
# read an image and resize to (224, 224)
# cv2 load images as BGR, convert it to RGB
addr = val_addrs[i]
img = cv2.imread(addr)
img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# add any image pre-processing here
# save the image
hdf5_file["val_img"][i, ...] = img
# loop over test addresses
for i in range(len(test_addrs)):
# print how many images are saved every 1000 images
if i % 1000 == 0 and i > 1:
print('Test data: {}/{}'.format(i, len(test_addrs)))
# read an image and resize to (224, 224)
# cv2 load images as BGR, convert it to RGB
addr = test_addrs[i]
img = cv2.imread(addr)
img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# add any image pre-processing here
# save the image
hdf5_file["test_img"][i, ...] = img
# save the mean and close the hdf5 file
hdf5_file["train_mean"][...] = mean
hdf5_file.close()
Read the HDF5 file
下面来检测数据是否已经正确存在HDF5文件中了。按batch载入任意数量的图片,并显示前五个batch的第一个图片。代码中定义了一个变量subtract_mean,用于指示在显示图像前是否需要减去训练集中的平均值。在h5py中,可以像访问字典一样,通过数组名来访问其内容(hdf5_file["arrayname""])。此外,也可以像numpy数组一样通过.shape来得到数组尺寸。
"""
Open the HDF5 for read
"""
hdf5_path = 'Cat vs Dog/dataset.hdf5'
subtract_mean = False
# open the hdf5 file
hdf5_file = h5py.File(hdf5_path, "r")
# subtract the training mean
if subtract_mean:
mm = hdf5_file["train_mean"][0, ...]
mm = mm[np.newaxis, ...]
# Total number of samples
data_num = hdf5_file["train_img"].shape[0]
下面,按batch读取图片。
batch_size = 64
nb_class = 2
from random import shuffle
from math import ceil
import matplotlib.pyplot as plt
# create list of batches to shuffle the data
batches_list = list(range(int(ceil(float(data_num) / batch_size))))
# shuffle索引
shuffle(batches_list)
# loop over batches
for n, i in enumerate(batches_list):
i_s = i * batch_size # index of the first image in this batch
i_e = min([(i + 1) * batch_size, data_num]) # index of the last image in this batch
# read batch images and remove training mean
images = hdf5_file["train_img"][i_s:i_e, ...]
if subtract_mean:
# 注意,这里有个bug,mm.dtype为np.float32
# 而images[i,...].dtype为np. int8
# 直接相减会抛出TypeError错误。需要类型转换。
images -= mm
# read labels and convert to one hot encoding
labels = hdf5_file["train_labels"][i_s:i_e]
labels_one_hot = np.zeros((batch_size, nb_class))
labels_one_hot[np.arange(batch_size), labels] = 1
print(n+1, '/', len(batches_list))
print(labels[0], labels_one_hot[0, :])
plt.imshow(images[0])
plt.show()
if n == 5: # break after 5 batches
break
hdf5_file.close()
总结
"""
写模式打开hdf5文件:
"""
hdf5_file = h5py.File(hdf5_path, mode='w')
"""
建立矩阵
"""
hdf5_file.create_dataset("train_img", train_shape, np.int8)
"""
对矩阵赋值
"""
hdf5_file["train_img"][i, ...] = img
hdf5_file["train_mean"][...] = mean
"""
读模式打开hdf5文件
"""
hdf5_file = h5py.File(hdf5_path, "r")
"""
读取矩阵中内容
"""
images = hdf5_file["train_img"][i, ...]