Efficient Multimodal Question Answering (EMM-QA)

EMM-QA logo

Workshop room: ASEM Ballroom 201
How to get there: From the registration desk on the ground floor, follow the hallway to the Auditorium, then take the escalator inside the Auditorium to the second floor.

EMM-QA is an ICML 2026 workshop focused on question answering systems that must balance accuracy, efficiency, and adaptability across multiple input modalities. The workshop brings together researchers from academia and industry working on knowledge-intensive multimodal systems that operate under practical resource constraints.

Rather than focusing only on larger models, the workshop emphasizes methods that make multimodal question answering usable in real settings, including retrieval-augmented systems, compact models, efficient inference, and human-in-the-loop evaluation.

ANNOUNCEMENTS
📌 The exact Contributed Paper Spotlight Schedule is now available.

📌 List of Accepted Papers and Shared Challenge Papers is now available! ⚠️ Virtual poster presentations are also available here under ▶️ icon!

📌 Check Accepted Papers, Poster Sessions and Board Assignments section at the bottom of this page to find your poster ID, session, and board number in Hall A.

📌 Workshop Schedule is now available.

📌 Call for Papers

📌 Computer Teams

📌 Join the community on Discord.

Scope

The workshop is centered on efficient multimodal question answering. It also welcomes closely related work on multimodal retrieval, reasoning, evaluation, benchmarking, and efficient inference when those contributions are clearly connected to question answering or other knowledge-intensive multimodal tasks.

Like the previous iteration of EfficientQA, which focused on text-only question answering, we will also host a human-computer question answering competition. If you’d like to take part in that part of the competition (it should be fun!), you can either play as a team or write questions.

Workshop Format

The workshop is planned as a one-day event combining:

Contributed papers
Poster presentations
Invited keynotes
Shared-task highlights
A live human-computer question answering event

The workshop will also serve as the venue where we announce the winning systems from the QANTA 2026 computer competition.

Schedule

Workshop takes place on July 11th. Both poster sessions
All talk sessions (invited talks, spotlights, challenge talks, awards, etc.) will take place in the ASEM Ballroom 201 at COEX.
All poster sessions will take place separately in Hall A, poster boards 1612–1617, 1700–1711, outside the workshop room area at COEX. You can review the Hall A plan here.

Time	Activity	Duration
08:00‑08:10	Welcome & Workshop Overview	10 min
08:10‑08:50	🟦 Naman Goyal & Jenny Ni: Multimodal Robustness Under Distribution Shift	40 min
08:50‑09:00	Q&A	10 min
09:00‑09:15	☕ Coffee Break	15 min
09:15‑09:55	🟦 Sewon Min: PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation	40 min
09:55‑10:05	Q&A	10 min
10:05‑10:50	🟨 Contributed Paper Spotlights I	45 min
10:05‑10:15	🟨 Mirage Probes: How Vision Models Fake Visual Understanding — Daniel Ben-Levi et al.	10 min
10:16‑10:26	🟨 VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following — Hyesoo Hong et al.	10 min
10:27‑10:37	🟨 DistortBench: Benchmarking Vision Language Models on Image Distortion Identification — Divyanshu Goyal et al.	10 min
10:38‑10:48	🟨 CCDiff: Inverse Canonical Correlation Analysis for Discovering Visual Differences in Natural Language (Virtual) — Neelesh Bisht et al.	10 min
10:50‑11:50	🟧 Workshop Posters I	60 min
11:50‑12:50	Lunch	60 min
12:50‑13:20	🤖 Live AI QA Competition	30 min
13:20‑14:00	🟦 Mrinmaya Sachan: Behavioral Shortcuts and Cross-Modal Alignment Failures of Multimodal Large Language Models	40 min
14:00‑14:10	Q&A	10 min
14:10‑14:50	🟦 Robin Jia: The Golden Age of Adversarial Evaluation	40 min
14:50‑15:00	Q&A	10 min
15:00‑15:15	☕ Coffee Break	15 min
15:15‑15:35	Shared Challenge Introduction & Results Overview	20 min
15:35‑15:55	🟨 Contributed Paper Spotlights II	20 min
15:35‑15:45	🟨 Stop Thinking, Start Looking: Efficient Post-Training for Multimodal Document Question Answering via Reasoning-Free Alignment — Harikrishnan Puthan Madathil et al.	10 min
15:45‑15:55	🟨 FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models (Virtual) — Youngsun Lim et al.	10 min
15:55‑16:05	🏆 Challenge Awards	10 min
16:05‑16:10	Closing Remarks	5 min
16:10‑17:00	🟧 Workshop Posters II + Shared Challenge Posters	50 min

Legend

🟦 Invited Talks
🟨 Contributed Paper Spotlights / Best Challenge Team Talk
🟧 Poster Sessions

Confirmed Keynote Speakers

Sewon Min (UC Berkeley EECS & Allen Institute for AI)
Mrinmaya Sachan (ETH Zürich)
Robin Jia (University of Southern California)
Naman Goyal (Google DeepMind) & Jenny Ni (Google)

Organizers

Jordan Boyd-Graber, University of Maryland
Martin Fajčík, Brno University of Technology
George Jojo Boateng, ETH Zurich / Kwame AI
Ikuya Yamada, Studio Ousia / Tohoku University / Nagoya University / RIKEN
Chen Zhao, NYU Shanghai

Contact

Questions about the workshop can be sent to emm-qa-organizers@googlegroups.com. Or join the Discord.

Accepted Papers, Poster Sessions and Board Assignments

🗺️ The plan of the Hall A is here.

Poster ID	Session	Board	Topic	Paper	Authors
P01	1	1612	Multimodal Retrieval QA	Scaling Down: Multi-Hop Information Retrieval in Resource-Constrained Environments	Nikolay Staroverov
P02	1	1612	Multimodal Retrieval QA	Latent Abstraction for Retrieval-Augmented Generation	Ha-Lan Nguyen; Nguyen A Minh; Dung D. Le
P03	1	1613	Multimodal Retrieval QA	Salient Knowledge Pathways: Sparse Cross-Modal Routing for Efficient Knowledge-Intensive Multimodal Question Answering	Noor Islam S. Mohammad
P04	1	1613	Multimodal Retrieval QA	Visual Evidence Collapse as the Missing Signal in Adaptive Multimodal Retrieval: A Position	Göktuğ Aslanoğlu
P05	1	1614	Multimodal Retrieval QA	Look Before You Zoom: Adaptive Routing for the Resolution-Context Trade-off in Visual RAG	Oanh N Tran; Thanh Quoc Hung Le; Oscar Chew; Kuan-Hao Huang; Khoa Doan
P06	1	1614	Multimodal Retrieval QA	PixelRAG: Retrieval and Reading in Pixel Space over Millions of Web Screenshots	Yichuan Wang; Zhifei Li; Zirui Wang; Paul Teiletche; Lesheng Jin; Matei Zaharia; Joseph E Gonzalez; Sewon Min
P07	1	1615	Multimodal Retrieval QA	Read or Look? Textification Safety Boundaries for Efficient Multimodal Document QA	Pan Xie
P08	1	1615	Multimodal Retrieval QA	What Do Chart Question Answering Benchmarks Measure? Task-Specific Visual Dependence in Efficient Multimodal QA	Mary Le
P09	1	1616	Visual Token Efficiency	One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA	Zhi Zheng; Ziqiao Meng; Hao Luan; Wei Liu; Wee Sun Lee
P10	1	1616	Visual Token Efficiency	A Picture is Worth a Thousand Tokens: A Practitioner’s Guide to Visual Token Compression	Vipul Joshi; Vikash Sharma; Amir Raza; Peyush Jain; Mayank Jauhari; Anurag Tripathi
P11	1	1617	Visual Token Efficiency	Efficient Visual Token Compression for 3D Question Answering	Changwoo Baek; Kyeongbo Kong
P12	1	1617	Visual Token Efficiency	LARE: Low-Attention Region Encoding for Text–Image Retrieval	Muhammad K Khan; Abdulmalik Alquwayfili; Faisal AlMeshal; Jumanah Almajnouni; Leena Alotaibi; Huda A Alamri; Raied Aljadaany; Faisal alhajari; Alreem Almuhrij; Mohammed Alkhrashi; Abdullah Aldwyish
P13	1	1700	Data-Efficient QA	Stop Thinking, Start Looking: Efficient Post-Training for Multimodal Document Question Answering via Reasoning-Free Alignment	HARIKRISHNAN PUTHAN MADATHIL; Goutham Vignesh; Ganesh Parab; Saisubramaniam Gopalakrishnan; Vishal Vaddina; Varun V; Rohit Agrawal
P14	1	1700	Data-Efficient QA	Data-Efficient Curation for Multimodal Reasoning under Fixed Training Protocols	Yosub Shin; Michael Buriek; Boris Sobolev; Pavel Bushuyeu; Vikas Kumar; Samuel Watson; Haoyang Xu; Igor Molybog
P15	1	1701	Confidence and Abstention	Confidence-Aware Incremental Multimodal QA with Early Exit Reasoning	Kaustubha V; Shravan K; Harini S
P16	1	1701	Confidence and Abstention	Calibrated Confidence Is Hard to Beat: A Negative Result on Evidence-Accumulation Buzzing for Zero-Shot Incremental Question Answering	Arihant Jain
P17	1	1702	Confidence and Abstention	Cheap When Confident: The Per-Model Limits of Selective Answering in Multimodal Geometry QA	Vipra Bindal
P18	1	1702	Confidence and Abstention	Context-Aware Conformal Prediction for VLM-Based Driving Scene QA	Yang-Jun Joo; Younggun Kim; Sung-Yeon Park; Mohamed Abdel-Aty
P19	1	1703	Medical QA	Answer, Clarify, or Abstain: Fine-Grained Selective Prediction for Medical VLMs	Duy A Nguyen; Tuan T Nguyen; Minh Do; Khoa Doan
P20	1	1703	Medical QA	Efficient Multimodal Clinical Question Answering for Pulmonary Embolism Risk Assessment	Xiangyuan Xue; Yang Yu; Yan Gao; Junyan Wang; Bin Chen; Lingyan Ruan; Ting Dang; Hong Jia
P21	1	1704	Medical QA	Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering	Xupeng Chen; Binbin Shi; Chenqian Le; Jiaqi Zhang; Kewen Wang; Ran Gong; Jinhan Zhang; Chihang Wang
P22	1	1704	Medical QA	Pocket-Dentist: Efficient Multimodal QA for On-Device Dental Image Understanding	Kai Bian; Xucheng Guo; Lingyan Ruan; Bin Chen; Yiran Shen; Ting Dang; Hong Jia
P23	1	1705	Missing Modality QA	Recovering a Missing Modality in Multimodal QA via Low-Rank Completion	Jing Wang; Huan Xu; Jie Shen
P24	2	1612	VLM Robustness Benchmarks	DistortBench: Benchmarking Vision Language Models on Image Distortion Identification	Divyanshu Goyal; Akhil Eppa; Vanya Bannihatti Kumar
P25	2	1612	VLM Robustness Benchmarks	AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs	Pranay Goel; Aahana Basappa; Anusri Karra; Anish Karra; Kevin Zhu
P26	2	1613	VLM Robustness Benchmarks	VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following	Hyesoo Hong; Min Soo Kim; Wonje Jeung; Sangyeon Yoon; Dongjae Jeon; Albert No
P27	2	1613	VLM Failure Analysis	Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models	Oscar Chew; Serhii Honcharenko; Qian-Hui Chen; Patricia Lu; Dishant P Zaveri; Khoa Doan; Kuan-Hao Huang
P28	2	1614	VLM Failure Analysis	Detection-Guided Attention Steering for Vision Language Models	Alan W Zhang; Rui Pan; Mike Wong; Ravi Netravali
P29	2	1614	VLM Failure Analysis	Mirage Probes: Selective Construction of False Images in Vision Language Model Inference	Daniel Ben-Levi; Judah A Goldfeder; Weiliang Zhao; Raz Lapid; Amit LeVi; Allen Roush; Ravid Shwartz-Ziv; Hod Lipson
P30	2	1615	Temporal Multimodal Understanding	WorldMemArena: Evaluating Multimodal Agent Memory Through Action–World Interaction	Chengzhi Liu; Yuzhe YANG; Sophia X Pu; Yepeng Liu; Lin Long; Yichen Guo; Nuo Chen; zhaotian Weng; yiheng zhong; Angxiao Zong; Elena Kochkina; Simerjot Kaur; Charese Smiley; Xiaomo Liu; James Zou; Sheng Liu; Yuheng Bu; Songyou Peng; Xin (Eric) Wang
P31	2	1615	Temporal Multimodal Understanding	Multimodal Hidden Markov Models for Persistent Emotional State Tracking	Anamika Ragu; Aneesh Jonelagadda
P32	2	1616	Temporal Multimodal Understanding	Augmenting Video Models with Pose Tokens for Human-Centric Understanding	Isabella Zhu; Stella Su; David Dai; Paul Pu Liang
P33	2	1616	Generative Media Detection	Position: Efficient Dual-View AI-Generated Video Detection Needs a Visual-Language Dual View	Dylan Xinming Hou; Juntian Zhang; Xu Gu; Yichen WU; Nils Lukas; Gus xia; Xiuying Chen; Yuhan Liu
P34	2	1617	Text-to-Image Evaluation	FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models	Youngsun Lim; Cusuh Ham; Pin-Yu Chen; Deepti Ghadiyaram
P35	2	1617	Text-to-Image Evaluation	CCDiff: Inverse Canonical Correlation Analysis for Discovering Visual Differences in Natural Language	Neelesh Bisht; Xingjian Li; Zihan Li; Bo Jiang; RUNMIN JIANG; Mostofa Rafid Uddin; Yang Liu; Min Xu
P36	2	1700	Multimodal Generation Tools	MMDiff: Multimodal Model Diffing for Feature Discovery and Control	Lachin Naghashyar; Hunar Batra; Ashkan Khakzar; Phil Torr; Ronald Clark; Christian Schroeder de Witt; Constantin Venhoff
P37	2	1700	Multimodal Generation Tools	Any2Poster: Any-Source Poster Generation Across Modalities and Domains	Amogh Vinaykumar; Suozhi Huang; Aiden Y Li; Shilong Liu
P38	2	1701	Multimodal Generation Tools	From Synopsis to Storyboard: Enhancing Prompt Expressiveness for Multi-shot Video Generation	Xu Gu; Yuhan Liu; Wei Ji; Ruihua Song; Roger Zimmermann
P39	2	1701	Reasoning Evaluation	Auditable Step Verification for Vision-Language Reasoning	Shervin Ghasemlou
P40	2	1702	Reasoning Evaluation	Does Verbose Chain-of-Thought Really Help? A Dual-Validator Replication and Mechanistic Analysis	Wenlong Wang; Fergal Reid
P41	2	1702	Reasoning Evaluation	The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces	Navid Rezazadeh; Arash Gholamidavoodi
P42	2	1703	Reasoning Evaluation	Can LLMs Reconstruct Why an Answer Is Correct? A Step-Level Evaluation of Answer-Conditioned Reasoning	Lei Sun
P43	2	1703	Reasoning Evaluation	Coherence Under Commitment: Probing Generalization and Vacuous Memorization in LLM Logical Reasoning	Noor Islam S. Mohammad; Mahmudul Hasan
P44	2	1704	Model Alignment	Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients	Minhyuk Seo; Taeheon Kim; Hankook Lee; Jonghyun Choi; Tinne Tuytelaars
P45	2	1704	Model Alignment	SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport	Simon Roschmann; Paul KRZAKALA; Sonia Mazelet; Quentin Bouniot; Zeynep Akata

Sponsors/Acknowledgements

This workshop is partially supported by Horizon EU programme through project ELOQUENCE, grant no. 101135916.