Time Blindness: Why Video-Language Models Can't See What Humans Can

Overview Video

A two-minute walkthrough of SpookyBench and the time-blindness phenomenon — why every Video-VLM scores 0% on stimuli humans read at 98%.

01The Question

If a video carries information only in how frames change over time — and each frame is pure noise — can today's Video-VLMs see it?

Nature already does this. Fireflies communicate through flash sequences, EEG pathologies emerge from temporal patterns, Morse code carries language without spatial form. Yet modern video understanding pipelines extract frame features first and treat time as an afterthought.

02The Spatial-Reliance Gap

Frame sampling discards time. Visual encoders bias toward spatial features (objects, layouts); temporal features (motion, causality) stay under-represented — opening the coherence gap we exploit.

03SpookyBench — Time-Only Stimuli

A synthetic benchmark where each frame is pure noise and the content emerges only from how the noise moves.

451 videos · 960×540 · avg 7.1 s · ~333 frames
3 categories: Text (46.6%), Object Images (40.8%), Dynamic Scenes (12.6%)
Opposing-motion noise for text/images; threshold-gated motion for depth videos
Generator releases an indefinitely extendable dataset

04See It For Yourself — Play the Stimuli

Each clip below is pure noise frame-by-frame. Play it and the content snaps into view; pause it and every frame is static. This is the effect VLMs cannot perceive.

Text · 46.6%

Words encoded as opposing-motion noise. 210 videos.

Shapes · object silhouettes

Geometric forms revealed by co-moving pixels.

Object Images · 40.8%

SAM2 silhouettes of generated objects. 184 videos.

Dynamic Scenes · 12.6%

Depth maps from LaSOT / OTB2015. 57 videos.

Each frame is noise. Content emerges only when played — content-mask animation for text/objects, threshold-gated motion for depth scenes.

05How the Signal Hides in Time

Opposing motion is the trick. A content mask drives foreground noise up and background noise down. The human visual system groups co-moving pixels (Gestalt common fate) and the content emerges — only while the video plays. Pause it, and every frame looks like static.

SNR threshold · step function

Detection is a step function. Below ~2.5 dB SNR, accuracy is ~0%; above it, accuracy jumps to ≥85%. Comparable thresholds: images at 6.0 dB, dynamic scenes at 9.0 dB.

Common fate. Humans bind pixels that move together into a single object — a temporal grouping mechanism current VLMs simply do not have.

06Frames Carry Nothing

Category	Basic SNR	Perceptual	Temp. Coh.	Motion C.
Images	−46.95	−47.28	8.00	7.17
Dynamic Scenes	−48.95	−63.43	21.91	−3.18
Text	−39.27	−49.18	7.84	8.26

Read: Dynamic Scenes have the highest temporal coherence (21.9) yet near-zero motion contrast — the regime current VLMs cannot exploit.

Basic SNR

Motion energy ÷ frame variance.

Perceptual SNR

Weighted to human contrast sensitivity.

Temporal Coherence

Motion vectors aligned over time.

Motion Contrast

Foreground vs background motion.

07The 0% Wall

Humans read these stimuli at 98%. Every Video-VLM we evaluated — open or closed, 2 B to 78 B parameters, with or without chain-of-thought — collapses to a flat 0%.

Model	Direct Prompt	Chain‑of‑Thought	Params
Human Performance	98.0 ± 0.6	—	—
Open-Source Video-VLMs
VideoLLaMA3-7B	0.0	0.0	7B
VideoLLaMA3-2B	0.0	0.0	2B
TimeChat-7B	0.0	0.0	7B
MiniGPT4-Video	0.0	0.0	7B
MovieChat	0.0	0.0	7B
Video-ChatGPT-7B	0.0	0.0	7B
VideoGPT-plus-Phi3-mini-4k	0.0	0.0	7B
VILA1.5-13B	0.0	0.0	13B
ShareGPT4Video-8B	0.0	0.0	8B
VideoLLaMA2-7B	0.0	0.0	7B
Video-LLaVA	0.0	0.0	7B
LLaVA-NeXT-Video	0.0	0.0	8B
InternVL2-40B	0.0	0.0	40B
InternVL2-8B	0.0	0.0	8B
InternVL2.5-78B	0.0	0.0	78B
InternVL2.5-8B	0.0	0.0	8B
InternVideo2.5-Chat-8B	0.0	0.0	8B
InternVideo2-Chat-8B	0.0	0.0	8B
Qwen2-VL-2B-Instruct	0.0	0.0	2B
Qwen2-VL-7B-Instruct	0.0	0.0	7B
Qwen2-VL-72B-Instruct	0.0	0.0	72B
Qwen2.5-VL-3B-Instruct	0.0	0.0	3B
Qwen2.5-VL-7B-Instruct	0.0	0.0	7B
Qwen2.5-VL-72B-Instruct	0.0	0.0	72B
Qwen3-VL-8B-Instruct	0.0	0.0	8B
Closed-Source Frontier
Gemini 2.5 Pro	0.0	0.0	—
Gemini 1.5 Pro	0.0	0.0	—
Gemini 2.0 Flash	0.0	0.0	—
GPT-4o	0.0	0.0	—

Accuracy (%) on SpookyBench. Failure is invariant to model size, family, and prompting strategy — not a single model exceeds chance.

08It's Architectural — Not Data, Not Prompting

Fine-tuning fails. InternVL2.5-8B & Qwen2-VL-7B on 400 videos × 30 epochs → still 0%.
VJEPA-2 & DINOv3 can't even overfit "is there content?" — loss saturates at chance.
Explicit motion cues unlock it. Overlaying Farneback motion boundaries lifts Qwen2-VL-7B → 51.5% and GPT-4o → 59.1%.

Diagnosis: Today's VLMs lack a mechanism for frame differencing / temporal integration. The information is computationally extractable — it just never reaches the language model.

09What Unlocks It

Pre-compute motion boundaries with classical optical flow, overlay them on the noisy frames, and the same models suddenly score in the 50s.

Qwen2-VL-7B

Frames only

0.0

+ motion overlay

51.5

GPT-4o

Frames only

0.0

+ motion overlay

59.1

Confirms the failure is architectural: temporal info is computationally extractable but VLMs can't reach it without spatial hand-holding. Dynamic Scenes stay hard (≤3.5%).

10Take-Aways

"Time-blindness" is a robust, architecture-level failure of every modern Video-VLM.
Future models need dedicated temporal pathways — distributed timing, motion-energy estimation, motion-based figure-ground.
SpookyBench + generator released to catalyze this direction.

❞Citation

@inproceedings{upadhyay2026timeblindness, title = {Time Blindness: Why Video-Language Models Can't See What Humans Can}, author = {Upadhyay, Ujjwal and Ranjan, Mukul and Shen, Zhiqiang and Elhoseiny, Mohamed}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026}, eprint = {2505.24867}, archivePrefix = {arXiv} }