前言
上一篇文章中(Hugging face 模型微调系列1—— 实战transfomers文本分类finetune) ,我们学会如何利用hugging face中的预训练模型训练一个文本分类的任务,接下来我们尝试利用hugging face的AutoModelForTokenClassification的api完成一个实体识别的任务。其中 transfomers 包的安装和hugging face的下载这一步,笔者在Hugging face 模型微调系列1—— 实战transfomers文本分类finetune做了详细的介绍,这里就不多做描述了,直接进入实战代码部分。
实战部分
数据预处理
数据集的样例
{"text": "科技全方位资讯智能,快捷的汽车生活需要有三屏一云爱你", "entity_list": []}
{"text": "对,输给一个女人,的成绩。失望", "entity_list": []}
{"text": "今天下午起来看到外面的太阳。。。。我第一反应竟然是强烈的想回家泪想我们一起在嘉鱼个时候了。。。。有好多好多的话想对你说李巾凡想要瘦瘦瘦成李帆我是想切开云朵的心", "entity_list": [{"entity_index": {"begin": 38, "end": 39}, "entity_type": "LOC", "entity": "嘉"}, {"entity_index": {"begin": 59, "end": 62}, "entity_type": "PER", "entity": "李巾凡"}, {"entity_index": {"begin": 68, "end": 70}, "entity_type": "PER", "entity": "李帆"}]}
{"text": "今年拜年不短信,就在微博拜大年寻龙记", "entity_list": []}
{"text": "浑身酸疼,两腿无力,眼神呆滞,怎么了", "entity_list": []}
实体识别可以被抽象成对输入的每个字进类别预测,这里笔者定义的实体的标注规则如下:将输入的字分成下方9个类别
非实体字:0
GPE-start:1
GPE-end:2
LOC-start:3
LOC-end:4
ORG-start:5
ORG-end:6
PER-start:7
PER-end:8
只标注实体的头和尾。当然这样的标准不是一个比较好的标准方式,一般做法会把实体中间词也标注出来。
举个例子比如下方这条数据:
{"text": "今天下午起来看到外面的太阳。。。。我第一反应竟然是强烈的想回家泪想我们一起在嘉鱼个时候了。。。。有好多好多的话想对你说李巾凡想要瘦瘦瘦成李帆我是想切开云朵的心", "entity_list": [{"entity_index": {"begin": 38, "end": 39}, "entity_type": "LOC", "entity": "嘉"}, {"entity_index": {"begin": 59, "end": 62}, "entity_type": "PER", "entity": "李巾凡"}, {"entity_index": {"begin": 68, "end": 70}, "entity_type": "PER", "entity": "李帆"}]}
经过处理和之后变成:
data: [CLS]今天下午起来看到外面的太阳。。。。我第一反应竟然是强烈的想回家泪想我们一起在嘉鱼个时候了。。。。有好多好多的话想对你说李巾凡想要瘦瘦瘦成李帆我是想切开云朵的心 [CLS]
其标注结果如下 :
label: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 8, 0, 0, 0,0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0
data中的 [CLS] 和 [SEP]是transformers的tokenizer会自动帮你加上的。 其中label中的7, 0, 8 就表示李巾凡这个人物实体。7, 8 则表示李帆这个人物实体。
采用下方代码将数据处理成上方的数据格式,方便输入模型进行训练。
labeldic = {'GPE':0, 'LOC':2, 'ORG':4, 'PER':6}
def get_train_data(file,labeldic):
data = []
label = []
with open(file,"r",encoding="utf-8") as f:
for line in f.readlines():
linedic = eval(line)
data.append(linedic["text"])
label.append(linedic["entity_list"])
labelpast = process_label(data,label,labeldic)
return data,labelpast
def process_label(data,label,labeldic):
process_labeled = []
for i,j in zip(data,label):
la = [0]*len(i)
if len(j) > 0:
for l in j:
la[l["entity_index"]["begin"]] = 1 + labeldic[l["entity_type"]]
la[l["entity_index"]["end"]-1] = 2 + labeldic[l["entity_type"]]
process_labeled.append(la)
return process_labeled
def padding_label(label, maxlen = 100):
label.insert(0,0)
if len(label) > maxlen:
return label[:maxlen]
else:
label += [0] * (maxlen -len(label))
return label
data = get_train_data("./weibo/weibo_ner_train.txt",labeldic)
预训练模型的载入
这里我们加载预训练模型,由于任务是预测每个token对应的label(0-8之间的一个数字),所以num_labels为9。其中AutoModelForTokenClassification这个接口可以很方便的对输入的每个字进行分类预测。
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("./Robert")
model = AutoModelForTokenClassification.from_pretrained("./robert", num_labels=9)
定义优化器和学习率
下方我们只需要自己定义优化器为AdamW 和学习率为0.0001。
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 16
train_data = tokenizer(data[0], padding = "max_length", max_length = 100, truncation=True ,return_tensors = "pt")
train_label = [padding_label(i) for i in data[1]]
train = TensorDataset(train_data["input_ids"], train_data["attention_mask"], torch.tensor(train_label))
train_sampler = RandomSampler(train)
train_dataloader = DataLoader(train, sampler=train_sampler, batch_size=batch_size)
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=1e-4)
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
模型训练
这一步就是模型的微调训练过程,每10步输出一下loss。训练3轮之后loss确实在逐渐降低。
for epoch in range(num_epochs):
total_loss = 0
model.train()
for step, batch in enumerate(train_dataloader):
if step % 10 == 0 and not step == 0:
print("step: ",step, " loss:",total_loss/(step*batch_size))
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
model.zero_grad()
outputs = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
lr_scheduler.step()
avg_train_loss = total_loss / len(train_dataloader)
print("avg_loss:",avg_train_loss)
#step: 10 loss: 0.008040577522478998
# step: 20 loss: 0.007003763725515455
# step: 30 loss: 0.006395801079149047
# step: 40 loss: 0.0060769430099753665
# step: 50 loss: 0.005745032941922546
# step: 60 loss: 0.005538870729894067
# step: 70 loss: 0.005380560013665153
# step: 80 loss: 0.005348111099738162
# avg_loss: 0.08806420908157908
# step: 10 loss: 0.0032662286947015675
# step: 20 loss: 0.0034820106608094647
# step: 30 loss: 0.0031873046692150334
# step: 40 loss: 0.003313490017899312
# step: 50 loss: 0.0034354829916264863
# step: 60 loss: 0.0034220680904885133
# step: 70 loss: 0.003371356403139154
# step: 80 loss: 0.0033751878301700343
# avg_loss: 0.05340275425615525
# step: 10 loss: 0.0028469588607549666
# step: 20 loss: 0.0025143598119029774
# step: 30 loss: 0.0025101488562844073
# step: 40 loss: 0.002548581719020149
# step: 50 loss: 0.002388586633023806
# step: 60 loss: 0.002467507263160466
# step: 70 loss: 0.002398934058146551
# step: 80 loss: 0.0024104547520437335
# avg_loss: 0.03700044267231512
模型预测
完成了模型的微调之后,接下来我们用微调后的模型进行实体抽取。输入“倩萍小姐胡小亭转闲置还有天就开奖啦小亭会专程去深圳为你们寄奖品大家加油转发哦”,模型确实将“深圳” 和“倩萍”等顺利的实体抽取了出来。
import numpy as np
def get_entity(model,sen):
res = {}
test = tokenizer(sen,return_tensors="pt",padding="max_length",max_length=100)
model.eval()
with torch.no_grad():
outputs = model(test["input_ids"],
token_type_ids=None,
attention_mask=test["attention_mask"])
pred_flat = np.argmax(outputs["logits"],axis=2).numpy().squeeze()
print(pred_flat)
labeldic = {'GPE':0, 'LOC':2, 'ORG':4, 'PER':6}
try:
for i,j in labeldic.items():
res[i]= []
if (j+1) in pred_flat:
for ins in np.where(pred_flat==(j+1))[0]:
for ine in range(ins,ins+10):
if pred_flat[ine] == j+2:
res[i].append(sen[ins-1:ine])
break
else:
res[i].append(sen[ins-1])
except:
pass
return res
get_entity(model,"倩萍小姐胡小亭转闲置还有天就开奖啦小亭会专程去深圳为你们寄奖品大家加油转发哦")
结语
这里我们将实体识别任务抽象成对输入的每个字进行类别预测,同时利用hugging face自带的AutoModelForTokenClassification 接口完成了一个实体识别模型的训练。但整个过程没有考虑到输入文字序列之间的相关性,效果肯定有所损失,后续可以考虑引入一层crf 进行模型的优化。接下来笔者将介绍如何使用torchserve对利用hugging face训练好的模型进行部署。