Holl-E is a background aware movie conversations dataset consisting of ~9K chats with at least three background resources associated with the chat. Every alternate utterance is explicitly generated by copying and/or modifying sentences from unstructured background knowledge such as plots, comments and reviews about the movie. Holl-E paper (EMNLP 2018)
Download the dataset of 9k chats with at least three background resources per chat.
Our baselines consist of (1) pure generation based models which ignore the background knowledge (2) generation based models which learn to copy information from the background knowledge when required and (3) span prediction based models which predict the appropriate response span in the background knowledge. We also have different splits of the dataset to test different capabilities of new architectures. We also provide a multi-reference evaluation for our test set. Please see the code repository for more details.
Ask us questions at nikitamoghe29@gmail.com.
Holl-E evaluates the ability of new architectures to produce responses grounded in knowledge. For a given context of the conversation, the task is to generate responses by copying and/or modifying contiguous segments from the background knowledge. The resources consist of the plot of the movie, review of the movie, some comments about the movie, or a fact table. We provide both the span-level annotation and the actual utterance for every alternate response. We use ROUGE-L and BLEU-4 as the performance measure. The other evaluation measures and human evaluation are discussed in the respective papers of different models.
Rank | Model | ROUGE-L | BLEU-4 |
---|---|---|---|
1 Nov 21, 2019 | GLKS Shandong University + University of Amsterdam (Ren et al., '19) | 38.69 | NA |
2 Jun 16, 2019 | CaKe University of Amsterdam (Zhang et al. '19) | 37.48 | 26.02 |
3 Aug 18, 2019 | RefNet Shandong University + University of Amsterdam (Meng et al., '19) | 37.11 | 27 |
4 May 28, 2020 | SSS BERT IIT Madras (Moghe et al., '20) | 35.2 | 22.78 |
5 Sep 18, 2018 | GTTP Baseline (Moghe et al., '18) | 25.67 | 13.92 |
Here, we report the test perfromance of various models on the mixed-long version of the dataset. In this setup, all the background resources are provided as they are with the chat. This setup measures the ability of the system to retrieve the correct background resource first and then generate a response
Rank | Model | ROUGE-L | BLEU-4 |
---|---|---|---|
1 Nov 21, 2019 | GLKS Shandong University + University of Amsterdam (Ren et al., '19) | 30.36 | NA |
2 Aug 18, 2019 | RefNet Shandong University + University of Amsterdam (Meng et al., '19) | 29.64 | 17.19 |
3 Sep 18, 2018 | GTTP Baseline (Moghe et al., '18) | 17.35 | 7.51 |
Here, we report the test perfromance of various models on the mixed-short version of the dataset. In this setup, all the background resources are provided but such that the total number of tokens is limited to 256.
Rank | Model | ROUGE-L | BLEU-4 |
---|---|---|---|
1 Nov 21, 2019 | GLKS Shandong University + University of Amsterdam (Ren et al., '19) | 39.63 | NA |
2 Aug 18, 2019 | RefNet Shandong University + University of Amsterdam (Meng et al., '19) | 36.17 | 29.38 |
3 Jun 16, 2019 | CaKe University of Amsterdam (Zhang et al. '19) | 36.01 | 26.17 |
4 Mar 25, 2019 | AKGCM Baidu Inc (Liu et al., '20) | 34.72 | 30.84 |
5 Sep 18, 2018 | GTTP Baseline (Moghe et al., '18) | 25.13 | 11.05 |