CANARD: Question-in-Context Rewriting

CANARD is a crowdsourced dataset for question rewriting in conversational contexts. It contains 40,527 question-rewriting pairs designed to evaluate models handling coreference resolution and ellipsis — the two most common reasons conversational questions are hard to answer in isolation.

CANARD is built on top of QuAC, a conversational reading comprehension dataset based on Wikipedia articles.

Dataset Statistics

Split Examples
Train 31,526
Dev 3,430
Test 5,571
Reference (dual-annotated) 100 pairs

Data Format

Each JSON entry contains:

Field Description
History Previous dialog utterances
Question Original question from conversation
Rewrite Context-independent rewrite
QuAC_dialog_id Source dialog identifier
Question_no Question number within dialog

Resources

Citation

Ahmed Elgohary, Denis Peskov, Jordan Boyd-Graber. Can You Unpack That? Learning to Rewrite Questions-in-Context. EMNLP 2019.

Contact

Ahmed Elgohary — elgohary@cs.umd.edu