使用Transformers加载大模型, 并使用流式输出进行文本生成
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model = AutoModelForCausalLM.from_pretrained(
args.model, device_map="auto", torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer or args.model, trust_remote_code=True
)
while True:
prompt = input('请输入内容: ')
if prompt == 'end':
break
inputs = tokenizer(
# args.prompt,
prompt,
return_tensors="pt",
)
streamer = TextStreamer(tokenizer) if args.streaming else None
outputs = model.generate(
inputs.input_ids.cuda(),
max_new_tokens=args.max_tokens,
streamer=streamer,
eos_token_id=tokenizer.convert_tokens_to_ids(args.eos_token),
do_sample=True,
repetition_penalty=1.3,
no_repeat_ngram_size=5,
temperature=0.7,
top_k=40,
top_p=0.8,
)
if streamer is None:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
使用Transformers加载大模型, 并使用流式输出进行对话
- 这种有简单的历史对话功能
os_name = platform.system()
clear_command = 'cls' if os_name == 'windows' else 'clear'
stop_stream = False
model_path = r'/01-aiYi-34B-Chat-4bits'
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
# Since transformers 4.35.0, the GPT-Q/AWQ model can be loaded using AutoModelForCausalLM.
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
torch_dtype='auto',
trust_remote_code=True
).cuda().eval()
streamer = TextStreamer(tokenizer)
history = []
print("零一万物 01-aiYi-34B-Chat-4bits 量化模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序\n")
while True:
print("请输入prompt,以Ctrl+D(在Windows上是Ctrl+Z)结束输入:")
content = sys.stdin.readlines()
content = ' '.join(content)
content = content.strip()
if content == "stop":
break
if content == "clear":
history = []
os.system('clear')
print("零一万物 01-aiYi-34B-Chat-4bits 量化模型,输入内容即可进行对话,clear 清空对话历史,stop 终止程序")
continue
content = content if content else 'hi'
messages= {"role": "user", "content": content}
history.append(messages)
input_ids = tokenizer.apply_chat_template(conversation=history, tokenize=True, add_generation_prompt=True,
return_tensors='pt')
output_ids = model.generate(input_ids.to('cuda'),streamer=streamer,max_new_tokens=512,do_sample=True,
eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>"),
bos_token_id=tokenizer.convert_tokens_to_ids("<|im_start|"),
repetition_penalty=1.3, no_repeat_ngram_size=5,temperature=0.7,top_k=40,top_p=0.8)
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print('*' * 30)
print('返回的结果是:{}'.format(response))
history.append({"role": "assistant", "content": response})
- 更完善的维护上下文关系,需要生成Standalone question.
可以参考
(1)LangChain中condense llm的工作原理
简单来说就是给定一个新的和Chat History来提升模型生成Standalone question。prompt = f"""
Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.
Chat History:
Human: What did the president say about Ketanji Brown Jackson
Assistant: The President said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court. He described her as one of our nation's top legal minds and mentioned that she comes from a family of public school educators and police officers. He also highlighted that she has received broad support from various groups, including the Fraternal Order of Police and former judges appointed by Democrats and Republicans.
Follow Up Input: Did he mention who she succeeded
Standalone question:
"""
(2)How to construct the prompt for a standalone question?
instruction = "Generate a standalone question which is based on the new question plus the chat history. Just create the standalone question without commentary. New question: ".question;
chatHistory[] = ["role" => "user", "content" => $instruction];
大模型的分布式部署
- 通常我们的资源有限,GPU大小的限制,而大模型的参数量很大,需要多块GPU。
- 使用huggingface提供的 accelerate库
- 参考:
(1) https://huggingface.co/blog/accelerate-large-models
(2)pytorch在有限的资源下部署大语言模型(以ChatGLM-6B为例) - 值得注意的是,在使用
load_checkpoint_and_dispatch
函数时,需要避免包含残差链接的类的切分。
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
model = load_checkpoint_and_dispatch(model, model_path,
device_map='auto',
offload_folder="offload",
offload_state_dict=True,
dtype = "float16",
no_split_module_classes=["LlamaDecoderLayer"])