48k Grounded Explainations over 8k Edits

Multimodal disinformation, from deepfakes to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems.

We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. Our dataset serves as a testbed for the utility of artifical intelligence models in battling visual misinformation.

Paper

> read on arxiv

Authors

This work was done by a team of researchers from the Allen Institute for AI, University of Washington, Stanford University, and the University of Michigan.

Jeff Da, Allen Institute for AI
Max Forbes, Allen Institute for AI + University of Washington
Rowan Zellers, Allen Institute for AI + University of Washington
Anthony Zheng, University of Michigan
Jena Hwang, Allen Institute for AI
Antoine Bosselut, Allen Institute for AI + Stanford University
Yejin Choi, Allen Institute for AI + University of Washington