Black Swan Suite

Dataset Splits:

BlackSwanSuite has three types of task variants: MCQ, Y/N and Generative.
We provide validation and test splits, which you can find as a split under each Huggingface Dataset card.

Validation Set: The Validation subset is available for development work, where ground truth labels are provided.

Test Set: The Test subset is available for evaluation. Ground truth labels are not provided, to prevent misuse of the dataset. Please submit to the public leaderboard to evaluate your model's performance on MCQ and Y/N variants. For the generative variant, please send us an email for an LLM Match score. We may take a few days to respond, so please be patient.

🤗 Access Data Leaderboard

Note: When using the leaderboard, once you are logged in, please go to participate > select team > accept licence > then the submit tab shows up (with an example format for submission).

The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no tasks, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies.

BibTeX

@inproceedings{chinchure2025black,
        title={Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events},
        author={Chinchure, Aditya and Ravi, Sahithya and Ng, Raymond and Shwartz, Vered and Li, Boyang and Sigal, Leonid},
        booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
        pages={24201--24210},
        year={2025}
      }

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

Dataset Information

Dataset Splits:

Abstract

Examples from BlackSwan

How BlackSwan was constructed

BibTeX