在单个HDF5文件中存取大量数据

原文标题为：Saving and loading a large number of images (data) into a single HDF5 file。这篇文章介绍如何将大量的图片存储在单个HDF5文件中，以及如何分batch读取数据来训练网络。原文中分别使用了python中的h5py模块和tables模块，这里仅使用h5py模块。此外，原文代码中对于彩色图像的表示分为TensorFlow和Theano两种情况来处理，这里只处理TensorFlow的情况，即图像表示为[batch, image_height, image_width, image_depth]的形式。

Introduction

在深度学习中，通常会使用巨量的数据或图片来训练网络，比如ImageNet上就有数百万张图片。对于如此大的数据集，如果对于每张图片都单独从硬盘读取、预处理、之后再送入网络进行训练、验证或是测试，这样效率可是太低了。如果将这些图片都放入一个文件中再进行处理，这样效率会更高。有多种数据模型和库可完成这种操作，如HDF5和TFRecord。这篇文章中会介绍如何将大量的图片存入单个HDF5文件中，以及如何将它们按batch读出来。不论数据多大，即不论数据是否大于电脑内存，该方法都是有效的。HDF5提供了工具，可用于对数据进行管理、操作、可视化、压缩和存储等。

本文使用kaggle上Dogs vs. Cats中的训练集。

List images and their labels

首先介绍下使用的数据集，Dogs vs. Cats训练集中有25,000张图片，猫和狗各一半，图片的命名为dog.5199.jpg或cat.123.jpg的形式。下面的代码读取所有文件名(并没有读入图片内容)，并对各图片指定对应的标签，如果是猫，则label=0，如果是狗，则label=1。接着shuffle数据集，并分为训练(60%)、验证(20%)和测试集(20%)。

"""
List images and label them
"""
from random import shuffle
import glob
# shuffle the addresses before saving
shuffle_data = True  
# address to where you want to save the hdf5 file
hdf5_path = 'Cat vs Dog/dataset.hdf5'  
# 数据集的读取路径
cat_dog_train_path = 'Cat vs Dog/train/*.jpg'

# read addresses and labels from the 'train' folder
# 使用glob来获取所有符合条件的文件名列表
# 此时，addrs中的元素皆为'Cat vs Dog/train/dog.5199.jpg'此类字符串
addrs = glob.glob(cat_dog_train_path)
# 对各文件名指定对应的label
labels = [0 if 'cat' in addr else 1 for addr in addrs]  # 0 = Cat, 1 = Dog

# to shuffle data
if shuffle_data:
    c = list(zip(addrs, labels))
    shuffle(c)
    addrs, labels = zip(*c)
    
# Divide the hata into 60% train, 20% validation, and 20% test
train_addrs = addrs[0:int(0.6*len(addrs))]
train_labels = labels[0:int(0.6*len(labels))]

val_addrs = addrs[int(0.6*len(addrs)):int(0.8*len(addrs))]
val_labels = labels[int(0.6*len(addrs)):int(0.8*len(addrs))]

test_addrs = addrs[int(0.8*len(addrs)):]
test_labels = labels[int(0.8*len(labels)):]

Create a HDF5 file

有两个库可处理HDF5格式，h5py和tables，原文中介绍了二者，这里只介绍h5py。第一步先建立HDF5文件。为了存储图片，需要对训练、验证、测试集分别定义shape为(number of data, image_height, image_width, image_depth)的矩阵。同样，对于标签需要分别定义shape为(number of data)的矩阵。最后，计算训练集中每一个像素点的平局值，并存到shape为(image_height, image_width, image_depth)的矩阵中。注意在建立矩阵式，要一直注意其数据类型。
在h5py中，使用create_dataset来建立矩阵。可以直接使用numpy中的数据类型来指定矩阵的dtype。在建立矩阵时要指定矩阵的大小(shape)。代码如下：

"""
Creating a HDF5 file
"""
import numpy as np
import h5py

train_shape = (len(train_addrs), 224, 224, 3)
val_shape = (len(val_addrs), 224, 224, 3)
test_shape = (len(test_addrs), 224, 224, 3)

# open a hdf5 file and create earrays
hdf5_file = h5py.File(hdf5_path, mode='w')

# 建立了下面四个矩阵，但是没有赋值
hdf5_file.create_dataset("train_img", train_shape, np.int8)
hdf5_file.create_dataset("val_img", val_shape, np.int8)
hdf5_file.create_dataset("test_img", test_shape, np.int8)

hdf5_file.create_dataset("train_mean", train_shape[1:], np.float32)

# 建立各标签矩阵，并赋值
hdf5_file.create_dataset("train_labels", (len(train_addrs),), np.int8)
# 第二个索引框[...]是必须有滴。
hdf5_file["train_labels"][...] = train_labels
hdf5_file.create_dataset("val_labels", (len(val_addrs),), np.int8)
hdf5_file["val_labels"][...] = val_labels
hdf5_file.create_dataset("test_labels", (len(test_addrs),), np.int8)
hdf5_file["test_labels"][...] = test_labels

下面，一张张读取图片，进行预处理后存到hdf5_file中。代码分为三个个循环，分别处理训练集、验证集合测试集。代码中对图片的预处理只是使用opencv来resize下。

"""
Load images and save them
"""
import cv2
# a numpy array to save the mean of the images
# shape := (image_height, image_width, image_depth)
mean = np.zeros(train_shape[1:], np.float32)

# loop over train addresses
for i in range(len(train_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print('Train data: {}/{}'.format(i, len(train_addrs)))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = train_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # save the image and calculate the mean so far
    # img.shape := (224, 224, 3)
    hdf5_file["train_img"][i, ...] = img
    mean += img / float(len(train_labels))

# loop over validation addresses
for i in range(len(val_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print('Validation data: {}/{}'.format(i, len(val_addrs)))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = val_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # save the image
    hdf5_file["val_img"][i, ...] = img

# loop over test addresses
for i in range(len(test_addrs)):
    # print how many images are saved every 1000 images
    if i % 1000 == 0 and i > 1:
        print('Test data: {}/{}'.format(i, len(test_addrs)))

    # read an image and resize to (224, 224)
    # cv2 load images as BGR, convert it to RGB
    addr = test_addrs[i]
    img = cv2.imread(addr)
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # add any image pre-processing here

    # save the image
    hdf5_file["test_img"][i, ...] = img

# save the mean and close the hdf5 file
hdf5_file["train_mean"][...] = mean
hdf5_file.close()

Read the HDF5 file

下面来检测数据是否已经正确存在HDF5文件中了。按batch载入任意数量的图片，并显示前五个batch的第一个图片。代码中定义了一个变量subtract_mean，用于指示在显示图像前是否需要减去训练集中的平均值。在h5py中，可以像访问字典一样，通过数组名来访问其内容(hdf5_file["arrayname""])。此外，也可以像numpy数组一样通过.shape来得到数组尺寸。

"""
Open the HDF5 for read
"""
hdf5_path = 'Cat vs Dog/dataset.hdf5'
subtract_mean = False
# open the hdf5 file
hdf5_file = h5py.File(hdf5_path, "r")
# subtract the training mean
if subtract_mean:
    mm = hdf5_file["train_mean"][0, ...]
    mm = mm[np.newaxis, ...]
# Total number of samples
data_num = hdf5_file["train_img"].shape[0]

下面，按batch读取图片。

batch_size = 64
nb_class = 2

from random import shuffle
from math import ceil
import matplotlib.pyplot as plt
# create list of batches to shuffle the data
batches_list = list(range(int(ceil(float(data_num) / batch_size))))
# shuffle索引
shuffle(batches_list)
# loop over batches
for n, i in enumerate(batches_list):
    i_s = i * batch_size  # index of the first image in this batch
    i_e = min([(i + 1) * batch_size, data_num])  # index of the last image in this batch
    # read batch images and remove training mean
    images = hdf5_file["train_img"][i_s:i_e, ...]
    if subtract_mean:
        # 注意，这里有个bug，mm.dtype为np.float32
        # 而images[i,...].dtype为np. int8
        # 直接相减会抛出TypeError错误。需要类型转换。
        images -= mm
    # read labels and convert to one hot encoding
    labels = hdf5_file["train_labels"][i_s:i_e]
    labels_one_hot = np.zeros((batch_size, nb_class))
    labels_one_hot[np.arange(batch_size), labels] = 1
    print(n+1, '/', len(batches_list))
    print(labels[0], labels_one_hot[0, :])
    plt.imshow(images[0])
    plt.show()
    if n == 5:  # break after 5 batches
        break
hdf5_file.close()

总结

"""
写模式打开hdf5文件：
"""
hdf5_file = h5py.File(hdf5_path, mode='w')

"""
建立矩阵
"""
hdf5_file.create_dataset("train_img", train_shape, np.int8)

"""
对矩阵赋值
"""
hdf5_file["train_img"][i, ...] = img
hdf5_file["train_mean"][...] = mean

"""
读模式打开hdf5文件
"""
hdf5_file = h5py.File(hdf5_path, "r")

"""
读取矩阵中内容
"""
images = hdf5_file["train_img"][i, ...]

最后编辑于：2018.05.11 12:10:15

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,723评论 6赞 481
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,485评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 152,998评论 0赞 344
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,323评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,355评论 5赞 374
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,079评论 1赞 285
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,389评论 3赞 400
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,019评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,519评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,971评论 2赞 325
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,100评论 1赞 333
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,738评论 4赞 324
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,293评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,289评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,517评论 1赞 262
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,547评论 2赞 354
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,834评论 2赞 345

在单个HDF5文件中存取大量数据

Introduction

List images and their labels

Create a HDF5 file

Read the HDF5 file

总结

推荐阅读更多精彩内容