Holl-E

Background Aware Movie Conversations Dataset

What is Holl-E?

Holl-E is a background aware movie conversations dataset consisting of ~9K chats with at least three background resources associated with the chat. Every alternate utterance is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie. Holl-E paper (EMNLP 2018)


Getting Started

Download the dataset of 9k chats with at least three background resources per chat.


Models

Our baselines consist of (1) pure generation based models which ignore the background knowledge (2) generation based models which learn to copy information from the background knowledge when required and (3) span prediction based models which predict the appropriate response span in the background knowledge. We also have different splits of the dataset to test different capabilities of new architectures. We also provide a multi-reference evaluation for our test set. Please see the code repository for more details.

Have Questions?

Ask us questions at nikitamoghe29@gmail.com.

Star

Results: Response Generation (trained on oracle)

Holl-E evaluates the ability of new architectures to produce responses grounded in knowledge. For a given context of the conversation, the task is to generate responses by copying and/or modifying contiguous segments from the background knowledge. The resources consist of the plot of the movie, review of the movie, some comments about the movie, or a fact table. We provide both the span-level annotation and the actual utterance for every alternate response. We use ROUGE-L and BLEU-4 as the performance measure. The other evaluation measures and human evaluation are discussed in the respective papers of different models.

RankModelROUGE-LBLEU-4

1

Nov 21, 2019
GLKS

Shandong University + University of Amsterdam

(Ren et al., '19)
38.69NA

2

Jun 16, 2019
CaKe

University of Amsterdam

(Zhang et al. '19)
37.4826.02

3

Aug 18, 2019
RefNet

Shandong University + University of Amsterdam

(Meng et al., '19)
37.1127

4

May 28, 2020
SSS BERT

IIT Madras

(Moghe et al., '20)
35.222.78

5

Sep 18, 2018
GTTP

Baseline

(Moghe et al., '18)
25.6713.92

Results: Response Generation (trained on mixed-long)

Here, we report the test perfromance of various models on the mixed-long version of the dataset. In this setup, all the background resources are provided as they are with the chat. This setup measures the ability of the system to retrieve the correct background resource first and then generate a response

RankModelROUGE-LBLEU-4

1

Nov 21, 2019
GLKS

Shandong University + University of Amsterdam

(Ren et al., '19)
30.36NA

2

Aug 18, 2019
RefNet

Shandong University + University of Amsterdam

(Meng et al., '19)
29.6417.19

3

Sep 18, 2018
GTTP

Baseline

(Moghe et al., '18)
17.357.51

Results: Response Generation (trained on mixed-short)

Here, we report the test perfromance of various models on the mixed-short version of the dataset. In this setup, all the background resources are provided but such that the total number of tokens is limited to 256.

RankModelROUGE-LBLEU-4

1

Nov 21, 2019
GLKS

Shandong University + University of Amsterdam

(Ren et al., '19)
39.63NA

2

Aug 18, 2019
RefNet

Shandong University + University of Amsterdam

(Meng et al., '19)
36.1729.38

3

Jun 16, 2019
CaKe

University of Amsterdam

(Zhang et al. '19)
36.0126.17

4

Mar 25, 2019
AKGCM

Baidu Inc

(Liu et al., '20)
34.7230.84

5

Sep 18, 2018
GTTP

Baseline

(Moghe et al., '18)
25.1311.05