| Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer? |
- |
- |
NAACL 2025 |
- |
| GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration |
- |
- |
ACL 2025 calibration benchmark |
yysung/advcalibration |
| No Questions are Stupid and but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions |
- |
- |
ACL 2025 question quality |
nehasrikn/poorly-posed-questions |
| Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above |
- |
- |
ACL 2025 evaluation methods |
- |
| ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks |
- |
- |
NAACL 2025 adversarialness metric |
- |
| Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA |
- |
Quizbowl collection |
EMNLP 2024 complementarity |
maharshi95/neural-irt |
| You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions |
- |
- |
EMNLP 2024 naturalized QA |
Pinafore/qb2nq |
| PEDANTS (Precise Evaluations of Diverse Answer Nominee Text for Skinflints): Use Evaluation Metrics Wisely—Efficient Evaluation Analysis and Benchmarking for Open-Domain Question Answering |
- |
- |
EMNLP Findings 2024 evaluation |
zli12321/PEDANTS-LLM-Evaluation |
| Automatic Explicitation to Bridge the Background Knowledge Gap in Translation and its Evaluation with Multilingual QA |
- |
- |
EMNLP 2023 translation and QA |
- |
| Learning to Explain Selectively: A Case Study on Question Answering |
Data |
- |
EMNLP 2022 explanations |
- |
| SimQA: Detecting Simultaneous MT Errors through Word-by-Word Question Answering |
- |
- |
EMNLP 2022 simultaneous MT QA |
SimQA code |
| Cheater’s Bowl: Human vs. Computer Search Strategies for Open-Domain QA |
Data |
- |
EMNLP Findings 2022 |
Code |
| Re-Examining Calibration: The Case of Question Answering |
- |
- |
EMNLP Findings 2022 calibration |
NoviScl/calibrateQA |
| Evaluation Examples Are Not Equally Informative: How Should That Change NLP Leaderboards? |
Data |
- |
ACL 2021 leaderboard analysis |
leaderboard.pedro.ai |
| Distantly-Supervised Dense Retrieval Enables Open-Domain Question Answering without Evidence Annotation |
- |
- |
EMNLP 2021 dense retrieval |
henryzhao5852/DistDR |
| Evaluation Paradigms in Question Answering |
- |
- |
EMNLP 2021 paradigm framing |
- |
| Toward Deconfounding the Influence of Subject’s Demographic Characteristics in Question Answering |
- |
- |
EMNLP 2021 fairness |
- |
| What’s in a Name? Answer Equivalence For Open-Domain Question Answering |
- |
- |
EMNLP 2021 answer equivalence |
- |
| Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval |
- |
- |
NAACL 2021 multistep retrieval |
- |
| Complex Factoid Question Answering with a Free-Text Knowledge Graph |
- |
- |
WWW 2020 free-text KG QA |
henryzhao5852/DELFT |
| Meta Answering for Machine Reading |
- |
- |
ArXiv 2020 machine reading |
- |
| Quizbowl: The Case for Incremental Question Answering |
- |
- |
ArXiv 2020 incremental QA |
QANTA site |
| What Question Answering can Learn from Trivia Nerds |
- |
- |
ACL 2020 perspective paper |
- |
| Mitigating Noisy Inputs for Question Answering |
- |
- |
Interspeech 2019 noisy QA inputs |
- |
| Can You Unpack That? Learning to Rewrite Questions-in-Context |
Data |
- |
EMNLP 2019 question rewriting |
aagohary/canard |
| What AI can do for me: Evaluating Machine Learning Interpretations in Cooperative Play |
- |
- |
IUI 2019 interpretability in play |
- |
| Trick Me If You Can: Human-in-the-loop Generation of Adversarial Question Answering Examples |
Data |
- |
TACL 2019 adversarial QA |
Eric-Wallace/trickme-interface |