End-to-End Neural Pipeline for Goal-Oriented Dialogue Systems using GPT-2
Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang, Kee-Eung Kim. KAIST. ACL2020
HighLight
- It is trained to follow the traditional dialogue management pipeline, making the monolithic neural model more interpretable and easily integratable with external systems
- It is trained in an end-to-end fashion with simple gradient descent
- Leverages GPT-2, a powerful pre-trained language model
Introduction
Traditional goal-oriented dialogue system mostly adopts a pipelined modular architecture:
-
Natural Language Understanding (NLU) module that first recognizes and comprehends user’s intent and extracts values for slots
- Input: user's utterance
- Output: , where refers to Intention, refers to slot-value pairs
-
Dialogue State Tracking (DST) module that tracks the values of slots
- Input: , , (N-best list)
- Output:
-
Dialogue policy (POL) module that decides the system action
- Input:
- Output:
-
Natural language generation (NLG) module that generates the utterance that corresponds to the system action
- Input:
- Output:
End-to-end methods build a dialog system using a single model, where natural language context is taken as input and natural language response is generated as an output
Dataset
MultiWOZ dataset
Evaluated by ConvLab
Each dialogue consists of ‘Goal’, ‘Database’ and ‘Dialogue turns’.
- Goal is defined by the domain and the slots. The slots are divided into informable, requestable and book slots.
- Informable slots represent user constraints
- Requestable slots hold additional information that the user wants to obtain
- Book slots are used to reserve a place recommended by the system
End-to-end neural dialogue model
- Predict the recent domain and the corresponding dialogue state conditioned on the dialogue history
- Predict the system action with delexicalized tokens conditioned on the dialogue history and dialogue state
- If the system action (e.g. ‘inform’, ‘book’) needs external information from the database, the query module2 retrieves the candidates and returns one of them
- Update the current system action when detecting Empty Query Results
- Generate the system response with delexicalized tokens conditioned on dialogue history, dialogue state, and system action
- Update the delexicalized tokens in the system response with the query result
Input
In the MultiWOZ dataset, the ‘metadata’ is treated as the dialogue state and the ‘dialogue act’ is treated as the system action
Delimiter tokens :
- <usr>
- <sys>
- <ds>
- <sa>
Special tokens :
- domain and slot names
- <nm> and <dc>
Input embedding = Token embedding + Speaker embedding + Positional embedding
Training Objective
The objective function is the weighted sum of the objectives of language modeling (LM) and next-utterance classification (NC) :
- For LM,
- For NC, the model needs to distinguish the gold response (gold dialogue state+gold system action+gold system response) from a distractor (gold dialogue state+gold system action+fake system response), given the dialogue history
Result
On DSTC8: