Background Aware Movie Conversations Dataset

What is Holl-E?

Holl-E is a background aware movie conversations dataset consisting of ~9K chats with at least three background resources associated with the chat. Every alternate utterance is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie. Holl-E paper (EMNLP 2018)

Getting Started

Download the dataset of 9k chats with at least three background resources per chat.

Download dataset

Models

Our baselines consist of (1) pure generation based models which ignore the background knowledge (2) generation based models which learn to copy information from the background knowledge when required and (3) span prediction based models which predict the appropriate response span in the background knowledge. We also have different splits of the dataset to test different capabilities of new architectures. We also provide a multi-reference evaluation for our test set. Please see the code repository for more details.

Code

Have Questions?

Ask us questions at nikitamoghe29@gmail.com.

Star

Results: Response Generation (trained on oracle)

Holl-E evaluates the ability of new architectures to produce responses grounded in knowledge. For a given context of the conversation, the task is to generate responses by copying and/or modifying contiguous segments from the background knowledge. The resources consist of the plot of the movie, review of the movie, some comments about the movie, or a fact table. We provide both the span-level annotation and the actual utterance for every alternate response. We use ROUGE-L and BLEU-4 as the performance measure. The other evaluation measures and human evaluation are discussed in the respective papers of different models.

Rank	Model	ROUGE-L	BLEU-4
1 Nov 21, 2019	GLKS Shandong University + University of Amsterdam (Ren et al., '19)	38.69	NA
2 Jun 16, 2019	CaKe University of Amsterdam (Zhang et al. '19)	37.48	26.02
3 Aug 18, 2019	RefNet Shandong University + University of Amsterdam (Meng et al., '19)	37.11	27
4 May 28, 2020	SSS BERT IIT Madras (Moghe et al., '20)	35.2	22.78
5 Sep 18, 2018	GTTP Baseline (Moghe et al., '18)	25.67	13.92

Results: Response Generation (trained on mixed-long)

Here, we report the test perfromance of various models on the mixed-long version of the dataset. In this setup, all the background resources are provided as they are with the chat. This setup measures the ability of the system to retrieve the correct background resource first and then generate a response

Rank	Model	ROUGE-L	BLEU-4
1 Nov 21, 2019	GLKS Shandong University + University of Amsterdam (Ren et al., '19)	30.36	NA
2 Aug 18, 2019	RefNet Shandong University + University of Amsterdam (Meng et al., '19)	29.64	17.19
3 Sep 18, 2018	GTTP Baseline (Moghe et al., '18)	17.35	7.51

Results: Response Generation (trained on mixed-short)

Here, we report the test perfromance of various models on the mixed-short version of the dataset. In this setup, all the background resources are provided but such that the total number of tokens is limited to 256.

Rank	Model	ROUGE-L	BLEU-4
1 Nov 21, 2019	GLKS Shandong University + University of Amsterdam (Ren et al., '19)	39.63	NA
2 Aug 18, 2019	RefNet Shandong University + University of Amsterdam (Meng et al., '19)	36.17	29.38
3 Jun 16, 2019	CaKe University of Amsterdam (Zhang et al. '19)	36.01	26.17
4 Mar 25, 2019	AKGCM Baidu Inc (Liu et al., '20)	34.72	30.84
5 Sep 18, 2018	GTTP Baseline (Moghe et al., '18)	25.13	11.05