CANARD: Question-in-Context Rewriting
CANARD is a crowdsourced dataset for question rewriting in conversational contexts. It contains 40,527 question-rewriting pairs designed to evaluate models handling coreference resolution and ellipsis — the two most common reasons conversational questions are hard to answer in isolation.
CANARD is built on top of QuAC, a conversational reading comprehension dataset based on Wikipedia articles.
Dataset Statistics
| Split | Examples |
|---|---|
| Train | 31,526 |
| Dev | 3,430 |
| Test | 5,571 |
| Reference (dual-annotated) | 100 pairs |
Data Format
Each JSON entry contains:
| Field | Description |
|---|---|
History |
Previous dialog utterances |
Question |
Original question from conversation |
Rewrite |
Context-independent rewrite |
QuAC_dialog_id |
Source dialog identifier |
Question_no |
Question number within dialog |
Resources
- Paper: Can You Unpack That? Learning to Rewrite Questions-in-Context — EMNLP 2019
- Code & data: github.com/aagohary/canard
- License: CC BY-SA 4.0
Citation
Ahmed Elgohary, Denis Peskov, Jordan Boyd-Graber. Can You Unpack That? Learning to Rewrite Questions-in-Context. EMNLP 2019.
Contact
Ahmed Elgohary — elgohary@cs.umd.edu