转化成int8的模型
AutoGPTQ的方式量化:
https://github.com/QwenLM/Qwen/issues/464
https://github.com/AutoGPTQ/AutoGPTQ/issues/133
搞半天,最后能work的代码:
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_path="Qwen-14B-datayes"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,trust_remote_code=True)
quantize_config = BaseQuantizeConfig(
bits=8, # quantize model to 4-bit
group_size=128, # it is recommended to set the value to 128
desc_act=False, # set to False can significantly speed up inference but the perplexity may slightly bad
)
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config,trust_remote_code=True)
examples = [
tokenizer(
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantized_model_dir="Qwen-14B-datayes-int8"
model.quantize(examples)
# save quantized model
model.save_quantized(quantized_model_dir, use_safetensors=True)
tokenizer.save_pretrained(quantized_model_dir)
~
测试完之后,还要能推理:
需要把Qwen下面的py 全部cp过来,然后还需要根据提示的问题各种测试:
目前只能用AutoGPTQForCausalLM.from_quantized来加载。
另外注意model_basename 为量化保存的模型名字,不要带后缀,不然找不到,坑得要死。
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM # 注意:这里假设存在类似的接口
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
torch.cuda.set_device(1)
quantized_model_dir="Qwen-14B-datayes-int8"
model_basename="gptq_model-8bit-128g"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir,use_fast=True, trust_remote_code=True)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,model_basename=model_basename,
device_map="auto",use_safetensors=True,trust_remote_code=True)
# 准备输入
input_text = "分析一下宁德时代的投资价值"
input_ids = tokenizer(input_text, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=256, min_new_tokens=100)
print(tokenizer.decode(output[0]))