Torchtext学习01

首先，Torchtext的入门介绍我参考的是这个连接。

然后在github上看到了这个教程，学习一下

1.0简介

导入相关库

import pandas as pd
import numpy as np
import torch
from torchtext.data import Field
from torchtext.data import TabularDataset
from torchtext.data import Iterator, BucketIterator

这个教程用的数据集好像是一个恶意评论检测的数据集，正好本人本科一个课程设计是做的这个…

1.1 声明Filed

我们希望conment_text是小写，根据空格直接分词，并有一些预处理的，我们将这些要求告诉Filed。
同时，因为标签已经是经过二进制预处理的，我们通过use_vocab = False来告诉Filed标签已经处理完毕。

tokenize = lambda x: x.split()
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)

1.2 创建数据集

我们将使用TabularDataset类读取数据，因为它是csv格式的（TabularDataset从现在起处理csv、tsv和json文件）。
对于训练集和验证集，我们需要处理标签。我们传入的字段必须与列的顺序相同。对于不使用的字段，我们传入一个元组，其中第二个元素为None。

tv_datafields = [("id", None),
                 ("comment_text", TEXT),
                 ("toxic", LABEL),
                 ("severe_toxic", LABEL),
                 ("threat", LABEL),
                 ("obscene", LABEL),
                 ("insult", LABEL),
                 ("identity_hate", LABEL)]

trn, vld = TabularDataset.splits(
    path="data",
    train='train.csv',
    validation="valid.csv",
    format='csv',
    skip_header=True,
    fields=tv_datafields)

而对于测试集，我们不需要任何标签。

tst_datafields = [("id", None),
                  ("comment_text", TEXT)]
tst = TabularDataset(
    path="data/test.csv",
    format='csv',
    skip_header=True,
    fields=tst_datafields
)

要使文本字段从单词转换为向量的形式，需要告诉它整个词汇表是什么。为此，我们使用TEXT.build_vocab，传入数据集以在其上生成词汇表。

1.3 创建迭代器

在训练期间，我们将使用一种特殊的迭代器，称为BucketIterator。
当我们将数据传递g:到神经网络中时，我们希望将数据填充为相同的长度，以便我们可以批量处理它们

e.g:
[3，15，2，7]，[4，1]，[5，5，6，8，1]
[[3，15，2，7，0]，[4，1，0，0，0]，[5，5，6，8，1]]

如果序列的长度相差很大，填充将消耗大量的内存和时间。BucketIterator将每批的相似长度的序列组合在一起，以最小化填充。

train_iter, val_iter = BucketIterator.splits(
    (trn, vld),
    batch_size=(64, 64),
    device=torch.device('cuda'),  # gpu
    sort_key=lambda x: len(x.comment_text),
    sort_within_batch=False,
    repeat=False
)

对于测试集，数据不需要清洗，所以使用标准的iter。

test_iter = Iterator(
    tst,
    batch_size=64,
    device=torch.device('cuda'),
    sort=False,
    sort_within_batch=False,
    repeat=False
)

1.4 打包迭代器

当前，迭代器返回一个名为torchtext.data.Batch文件. 这使得代码重用变得困难（因为每次列名更改时，我们都需要修改代码），并且使得torchtext很难与其他库一起用于某些用例（比如torchsample和fastai）。
作者这里写了一个简单的包装器，使批处理易于使用。
具体地说，我们将把批处理转换为一个元组，其格式为（x，y），其中x是自变量（模型的输入），y是因变量（监控数据）。

class BatchWrapper:
    def __init__(self, dl, x_var, y_vars):
        self.dl, self.x_var, self.y_vars = dl, x_var, y_vars  # we pass in the list of attributes for x and y

    def __iter__(self):
        for batch in self.dl:
            x = getattr(batch, self.x_var)  # we assume only one input in this wrapper

            if self.y_vars is not None:  # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_vars], dim=1).float()
            else:
                y = torch.zeros((1))

            yield (x, y)

    def __len__(self):
        return len(self.dl)


train_dl = BatchWrapper(train_iter, "comment_text",
                        ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
valid_dl = BatchWrapper(val_iter, "comment_text",
                        ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"])
test_dl = BatchWrapper(test_iter, "comment_text", None)

1.5 建立文本分类器

使用的是LSTM

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
import tqdm


class SimpleBiLSTMBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim=300,
                 spatial_dropout=0.05, recurrent_dropout=0.1, num_linear=1):
        super().__init__()  # don't forget to call this!
        self.embedding = nn.Embedding(len(TEXT.vocab), emb_dim)
        self.encoder = nn.LSTM(emb_dim, hidden_dim, num_layers=1, dropout=recurrent_dropout)
        self.linear_layers = []
        for _ in range(num_linear - 1):
            self.linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
        self.linear_layers = nn.ModuleList(self.linear_layers)
        self.predictor = nn.Linear(hidden_dim, 6)

    def forward(self, seq):
        hdn, _ = self.encoder(self.embedding(seq))
        feature = hdn[-1, :, :]
        for layer in self.linear_layers:
            feature = layer(feature)
        preds = self.predictor(feature)
        return preds
em_sz = 100
nh = 500
nl = 3
model = SimpleBiLSTMBaseline(nh, emb_dim=em_sz);
model.cuda()

1.6训练

opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()
epochs = 2

for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train()  # turn on training mode
    for x, y in tqdm.tqdm(train_dl):  # thanks to our wrapper, we can intuitively iterate over our data!
        opt.zero_grad()

        preds = model(x)
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()

        running_loss += loss.data[0] * x.size(0)

    epoch_loss = running_loss / len(trn)

    # calculate the validation loss for this epoch
    val_loss = 0.0
    model.eval()  # turn on evaluation mode
    for x, y in valid_dl:
        preds = model(x)
        loss = loss_func(preds, y)
        val_loss += loss.data[0] * x.size(0)

    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))

1.7 预测

test_preds = []
for x, y in tqdm.tqdm(test_dl):
    preds = model(x)
    # if you're data is on the GPU, you need to move the data back to the cpu
    # preds = preds.data.cpu().numpy()
    preds = preds.data.numpy()
    # the actual outputs of the model are logits, so we need to pass these values to the sigmoid function
    preds = 1 / (1 + np.exp(-preds))
    test_preds.append(preds)
test_preds = np.hstack(test_preds)
df = pd.read_csv("data/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    df[col] = test_preds[:, i]

# if you want to write the submission file to disk, uncomment and run the below code
# df.drop("comment_text", axis=1).to_csv("submission.csv", index=False)