fastai深度学习官方教程代码笔记Lesson3-4

这一课最后的部分简单的介绍了一下如何使用fastai来解决自然语言处理的问题，课程中分别介绍了对句子进行预测以及分类的两个问题。

课程地址：https://www.kaggle.com/hortonhearsafoo/fast-ai-v3-lesson-3-imdb

#预处理
#导入fastai库
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.text import *

#下载数据集,下载路径做了修改，下载到当前文件夹下
path = untar_data(URLs.IMDB_SAMPLE, dest="./")
path.ls()

[WindowsPath('imdb_sample/data_save.pkl'),
 WindowsPath('imdb_sample/texts.csv')]

#读取下载的数据集文件，观察文件的结构可以看到，文件中每条数据分为标签（积极，消极），文本内容，以及是否为验证集
df = pd.read_csv(path/'texts.csv')
df.head()

#查看第二条文本的内容
df['text'][1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very unpleasant to remember and to see on the screen. But it is never painted as a black-and-white case. There is baseness and nobility on both sides, and also the hope for change in the younger generation.<br /><br />There is redemption of a sort, in the end, when Puro has to make a hard choice between a man who has ruined her life, but also truly loved her, and her family which has disowned her, then later come looking for her. But by that point, she has no option that is without great pain for her.<br /><br />This film carries the message that both Muslims and Hindus have their grave faults, and also that both can be dignified and caring people. The reality of partition makes that realisation all the more wrenching, since there can never be real reconciliation across the India/Pakistan border. In that sense, it is similar to "Mr & Mrs Iyer".<br /><br />In the end, we were glad to have seen the film, even though the resolution was heartbreaking. If the UK and US could deal with their own histories of racism with this kind of frankness, they would certainly be better off.'

#从csv文件生成训练数据集
data_lm = TextDataBunch.from_csv(path, 'texts.csv', num_workers=0)
# data_lm

#保存我们创建的数据集，这样在下次训练时可以直接载入
data_lm.save()

#从文件中读取之前保存的数据集,fastai最新的版本里已经修改了读取数据集的函数，这里也相应修改
data = load_data(path)
# data = TextDataBunch.load(path)
#显示数据集数据，可以看出数据集已经自动做过处理（数据清洗）
#处理主要包括分词，处理html标签，大写转小写，对特殊符号处理等
#其中可以看到有一些xx开头的词，这是由于在进行数据清洗的时候，fastai会对所有词汇进行统计，
#并且将出现次数最多的词（默认是前60000个），组成一个词典，且没个词的频次不可地域一个最低值
#而在词典以外的那些词，会被标记为未知的词汇，同时还会给文章分成段落并且标记文章的开始结束等信息
#这些未知词汇，分段信息都会编码放在数据集中，一般以xx开头的表示
data.show_batch()

#同时数据集还自动为每个分词进行了编号，放在一个列表vocab.itos中，用于通过数字找到对应的分词
#同时也提供了vocab.stoi反向通过分词查找对应序号
#下面显示了该列表的前十个分词
data.vocab.itos[:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

#展示以及被分词和清洗后的数据
data.train_ds[0][0]

Text xxbos xxmaj just do n't bother . i thought i would see a movie with great xxunk and action . 
 
  xxmaj but it grows boring and terribly predictable after the interesting start . xxmaj in the middle of the film you have a little social drama and all tension is lost because it slows down the speed . xxmaj towards the end the it gets better but not really great . i think the director took this movie just too serious . xxmaj in such a kind of a movie even if u do n't care about the plot at least you want some nice action . i nearly dozed off in the middle / main part of it . xxmaj rating 3 / 10 . 
 
  xxunk .

#展示真正训练的数据，已经被数字编码化
data.train_ds[0][0].data[:10]

array([  2,   5,  58,  60,  37, 946,  11,  19, 212,  19], dtype=int64)

#上面生成数据集的方式采用了默认的方法和参数
#为了可以灵活配置我们的数据集，我们可以采用data block的api来手动配置生成数据集
data = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)#对数据集拆分，其中列数表示验证集的标记列，即is_valid列
                .label_from_df(cols=0)#根据指定的标签列，对数据进行标记
                .databunch(num_workers=0))#生成数据集

#定义batchsize，即每批训练数据的大小，由于占用内存（或显存）很大，因此这个值建议设置小一点
bs=48

#下载完整的数据集
path = untar_data(URLs.IMDB)
path.ls()

#显示训练集文件
(path/'train').ls()

#使用fastai的datablockapi构建语言模型的数据集，
#数据集会打乱每个batch中的text，并且重新将他们合在一起，
#同时会忽略标签，并且将每句话的后面的词作为这句话的学习目标
#最终我们需要训练的是一个预测模型，输入一句话，预测后面的话
data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test', 'unsup']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .random_split_by_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly（使用语言模型来构建数据集）
            .databunch(bs=bs))
#保存构建好的数据集
data_lm.save('tmp_lm')

#读取数据集,使用新的函数
# data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs=bs)
data_lm = load_data(path, 'tmp_lm', bs=bs)
#显示数据集的每个批次
data_lm.show_batch()

#定义学习器，默认的学习器是一个RNN网络，我们使用预训练好的模型进行训练，
#这个模型是用维基百科的语料进行训练的，最后设置了drop率为0.3
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103_1, drop_mult=0.3)
#寻找合适的学习率
learn.lr_find()
#显示到第15条数据
learn.recorder.plot(skip_end=15)
#使用合适的学习率训练,moms表示学习率的变化率，
#在学习率上升阶段，学习率的变化从0.8到0.7，在学习率下降阶段，学习率变化率则从0.7到0.8
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
#保存训练好的模型
learn.save('fit_head')

#读取模型
learn.load('fit_head');
#对模型微调后，我们可以继续进一步重新训练模型的所有参数
#解冻模型参数
learn.unfreeze()
# commented out because the training time didn't fit in a single Kernel session
#重新训练，这里由于性能关系，课程中将其注释了
# learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))
#保存进一步训练后的模型
learn.save('fine_tuned')

#读取模型
learn.load('fine_tuned');
#定义输入数据
TEXT = "i liked this movie because"
#定义参数，预测单词个数，预测的句子个数
N_WORDS = 40
N_SENTENCES = 2
#打印所有预测的结果，这里设置了预测的热度为0.75，这个值越大则预测的结果越随机，预测结果是编码后的值
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))
#最后需要保存一下词编码器，使得对预测的结果可以反编码回原来的句子
learn.save_encoder('fine_tuned_enc')

#下面进行评论的分类学习
#首先下载带有评论标签的数据，标签有正面和负面两个值
path = untar_data(URLs.IMDB)
#手动构建数据集，这里我们将之前保存的数据集中的字典应用于这个数据集，使用这个字典来对文中的单词进行编号
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             #设定验证集的文件夹
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             #通过文件夹来设置文中标签
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))
#保存数据集
data_clas.save('tmp_clas')

#读取之前保存的数据集,原文中的方法比较旧，采用新的方法
# data_clas = TextClasDataBunch.load(path, 'tmp_clas', bs=bs)
data_clas = load_data(path, 'tmp_clas', bs=bs)
#显示数据集
data_clas.show_batch()

#初始化rnn网络
learn = text_classifier_learner(data_clas, drop_mult=0.5)
#读取之前保存的编码器
learn.load_encoder('fine_tuned_enc')
#冻结最后一层之前的参数
learn.freeze()
#寻找学习率
learn.lr_find()
#显示学习率曲线
learn.recorder.plot()
#使用合适的学习率进行一轮训练
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))
#保存第一次训练模型
learn.save('first')

#读取第一次训练的模型
learn.load('first');
#冻结倒数两层之前的参数
learn.freeze_to(-2)
#选择其他的学习率进行一轮训练（最小的学习率变化了）
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))
#保存第二次学习的模型
learn.save('second')

#读取第二次的模型
learn.load('second');
#冻结到最后三层之前的参数
learn.freeze_to(-3)
#调整学习率继续训练一轮
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))
#保存第三次训练的模型
learn.save('third')

#读取第三次的模型
learn.load('third');
#解冻所有参数
learn.unfreeze()
#调整学习率进行两轮训练，这一次是对所有参数进行重新训练
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))
#使用训练好的模型对输入的句子进行预测，返回它的标签对应的概率（正面，负面）
learn.predict("I really loved that movie, it was awesome!")

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 200,612评论 5赞 471
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 84,345评论 2赞 377
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 147,625评论 0赞 332
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,022评论 1赞 272
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 62,974评论 5赞 360
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,227评论 1赞 277
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,688评论 3赞 392
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,358评论 0赞 255
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,490评论 1赞 294
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,402评论 2赞 317
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,446评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,126评论 3赞 315
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,721评论 3赞 303
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,802评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,013评论 1赞 255
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,504评论 2赞 346
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,080评论 2赞 341

fastai深度学习官方教程代码笔记Lesson3-4

推荐阅读更多精彩内容