CVPR 2026 Denver, Colorado SpookyBench · time-only benchmark
Vision-Language Models · Temporal Reasoning · Benchmark

Time Blindness.

Why Video-Language Models can't see what humans can.

1KAUST    2VILA Lab, MBZUAI    3DocPanel Technologies  ·  *Equal contribution  ·  Corresponding authors
KAUST MBZUAI DocPanel CVPR 2026
98%
Humans
6 annotators · perceptibility 4.7 / 5
0%
Every Video-VLM
25+ models · 2 B–78 B · open + closed
Come say hi at CVPR 2026 👋
June 3–7, 2026 · Denver, Colorado  —  find us at Poster Session 5 & the Exhibit Hall (ExHall F), the morning of June 7.
Overview Video
A two-minute walkthrough of SpookyBench and the time-blindness phenomenon — why every Video-VLM scores 0% on stimuli humans read at 98%.
01The Question

If a video carries information only in how frames change over time — and each frame is pure noise — can today's Video-VLMs see it?

Nature already does this. Fireflies communicate through flash sequences, EEG pathologies emerge from temporal patterns, Morse code carries language without spatial form. Yet modern video understanding pipelines extract frame features first and treat time as an afterthought.

02The Spatial-Reliance Gap
Spatial reliance of VLMs
Frame sampling discards time. Visual encoders bias toward spatial features (objects, layouts); temporal features (motion, causality) stay under-represented — opening the coherence gap we exploit.
03SpookyBench — Time-Only Stimuli

A synthetic benchmark where each frame is pure noise and the content emerges only from how the noise moves.

  • 451 videos · 960×540 · avg 7.1 s · ~333 frames
  • 3 categories: Text (46.6%), Object Images (40.8%), Dynamic Scenes (12.6%)
  • Opposing-motion noise for text/images; threshold-gated motion for depth videos
  • Generator releases an indefinitely extendable dataset
04See It For Yourself — Play the Stimuli

Each clip below is pure noise frame-by-frame. Play it and the content snaps into view; pause it and every frame is static. This is the effect VLMs cannot perceive.

Text · 46.6%
Words encoded as opposing-motion noise. 210 videos.
Shapes · object silhouettes
Geometric forms revealed by co-moving pixels.
Object Images · 40.8%
SAM2 silhouettes of generated objects. 184 videos.
Dynamic Scenes · 12.6%
Depth maps from LaSOT / OTB2015. 57 videos.
Each frame is noise. Content emerges only when played — content-mask animation for text/objects, threshold-gated motion for depth scenes.
05How the Signal Hides in Time
Noise merging diagram
Opposing motion is the trick. A content mask drives foreground noise up and background noise down. The human visual system groups co-moving pixels (Gestalt common fate) and the content emerges — only while the video plays. Pause it, and every frame looks like static.
SNR threshold · step function

Detection is a step function. Below ~2.5 dB SNR, accuracy is ~0%; above it, accuracy jumps to ≥85%. Comparable thresholds: images at 6.0 dB, dynamic scenes at 9.0 dB.

Common fate. Humans bind pixels that move together into a single object — a temporal grouping mechanism current VLMs simply do not have.
06Frames Carry Nothing
Category Basic SNR Perceptual Temp. Coh. Motion C.
Images−46.95−47.288.007.17
Dynamic Scenes−48.95−63.4321.91−3.18
Text−39.27−49.187.848.26

Read: Dynamic Scenes have the highest temporal coherence (21.9) yet near-zero motion contrast — the regime current VLMs cannot exploit.

Basic SNR
Motion energy ÷ frame variance.
Perceptual SNR
Weighted to human contrast sensitivity.
Temporal Coherence
Motion vectors aligned over time.
Motion Contrast
Foreground vs background motion.
07The 0% Wall

Humans read these stimuli at 98%. Every Video-VLM we evaluated — open or closed, 2 B to 78 B parameters, with or without chain-of-thought — collapses to a flat 0%.

Model Direct Prompt Chain‑of‑Thought Params
Human Performance98.0 ± 0.6
Open-Source Video-VLMs
VideoLLaMA3-7B0.00.07B
VideoLLaMA3-2B0.00.02B
TimeChat-7B0.00.07B
MiniGPT4-Video0.00.07B
MovieChat0.00.07B
Video-ChatGPT-7B0.00.07B
VideoGPT-plus-Phi3-mini-4k0.00.07B
VILA1.5-13B0.00.013B
ShareGPT4Video-8B0.00.08B
VideoLLaMA2-7B0.00.07B
Video-LLaVA0.00.07B
LLaVA-NeXT-Video0.00.08B
InternVL2-40B0.00.040B
InternVL2-8B0.00.08B
InternVL2.5-78B0.00.078B
InternVL2.5-8B0.00.08B
InternVideo2.5-Chat-8B0.00.08B
InternVideo2-Chat-8B0.00.08B
Qwen2-VL-2B-Instruct0.00.02B
Qwen2-VL-7B-Instruct0.00.07B
Qwen2-VL-72B-Instruct0.00.072B
Qwen2.5-VL-3B-Instruct0.00.03B
Qwen2.5-VL-7B-Instruct0.00.07B
Qwen2.5-VL-72B-Instruct0.00.072B
Closed-Source Frontier
Gemini 1.5 Pro0.00.0
Gemini 2.0 Flash0.00.0
GPT-4o0.00.0

Accuracy (%) on SpookyBench. Failure is invariant to model size, family, and prompting strategy — not a single model exceeds chance.

08It's Architectural — Not Data, Not Prompting
  • Fine-tuning fails. InternVL2.5-8B & Qwen2-VL-7B on 400 videos × 30 epochs → still 0%.
  • VJEPA-2 & DINOv3 can't even overfit "is there content?" — loss saturates at chance.
  • Explicit motion cues unlock it. Overlaying Farneback motion boundaries lifts Qwen2-VL-7B → 51.5% and GPT-4o → 59.1%.
Diagnosis: Today's VLMs lack a mechanism for frame differencing / temporal integration. The information is computationally extractable — it just never reaches the language model.
09What Unlocks It

Pre-compute motion boundaries with classical optical flow, overlay them on the noisy frames, and the same models suddenly score in the 50s.

Qwen2-VL-7B
Frames only
0.0
+ motion overlay
51.5
GPT-4o
Frames only
0.0
+ motion overlay
59.1

Confirms the failure is architectural: temporal info is computationally extractable but VLMs can't reach it without spatial hand-holding. Dynamic Scenes stay hard (≤3.5%).

10Take-Aways
  • "Time-blindness" is a robust, architecture-level failure of every modern Video-VLM.
  • Future models need dedicated temporal pathways — distributed timing, motion-energy estimation, motion-based figure-ground.
  • SpookyBench + generator released to catalyze this direction.
Citation
@inproceedings{upadhyay2026timeblindness, title = {Time Blindness: Why Video-Language Models Can't See What Humans Can}, author = {Upadhyay, Ujjwal and Ranjan, Mukul and Shen, Zhiqiang and Elhoseiny, Mohamed}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026}, eprint = {2505.24867}, archivePrefix = {arXiv} }