Vision–Language Models are increasingly used as judges of synthetic driving videos, but the best open VLMs score below 40% on edge-case driving questions — and Qwen3-VL alone reaches only 21%. I built SynthDriveEval, a benchmark of 1,597 synthetic clips and 7,371 expert-annotated questions, plus a training-free tool-augmented agent that grounds the VLM in optical flow (RAFT), instance segmentation (SAM 3), and a frequency-domain generation detector. The agent reaches 60.1% overall accuracy — nearly tripling its own backbone and beating the strongest baseline by +20 pp.
21% → 60.1%
Qwen3-VL alone vs. our agent
+20 pp
over the strongest base VLM
1,597 / 7,371
videos / annotated questions
Why this matters
A growing body of research proposes Vision–Language Models as components of autonomous-driving pipelines — from VLM-RL[1], which turns a pre-trained VLM into a reward signal for safe-driving reinforcement learning, to DriveLLM-V[2], an explainable end-to-end framework that fuses vision and language to support interpretable driving decisions. In parallel, simulation itself is shifting from physics engines to generative world models — systems like GEM[3] and NVIDIA's Cosmos family that synthesise realistic driving rollouts conditioned on scene context and prompts. This matters precisely because the hard cases — running a red light, aggressive overtakes, failure to stop — are the ones we cannot safely film on real roads, so we have to generate them.
These two trajectories collide on a specific question: if we use a VLM to audit what a generative simulator produces — to screen synthetic datasets, rank world models, or provide feedback during training — can we trust its judgement? “Good driving” is temporal, context-dependent, and hard to reduce to explicit metrics (a form of Polanyi's paradox for driving behaviour). My thesis answers the question empirically: no, not reliably — off-the-shelf VLMs exhibit systematic biases on synthetic driving footage. And it proposes a way to fix it without retraining.
The benchmark
SynthDriveEval is built on two NVIDIA Cosmos generators with very different characters. Cosmos-Drive-Dreams[4] produces high-fidelity clips conditioned on HD maps and natural-language prompts — perfect for staging traffic violations that would be unsafe or impossible to film. I rewrite each prompt across 8 weather and time-of-day variants, then manually filter the outputs because the generator silently rewrites road layouts when it can't satisfy the prompt.
Cosmos-Drive-Dreams pipeline. HD maps and textual prompts describing specific traffic violations are fed to the generator; prompt rewriting diversifies weather and time-of-day. A manual curation step filters out clips that silently suppress the requested violation.
Cosmos-Predict1[5], by contrast, is a lower-fidelity general-purpose video generator conditioned on a single image plus a text prompt. Its weaker fidelity surfaces morphing, warping, and pop-in artifacts — exactly what we need to stress-test artifact perception. The conditioning frames are real dashcam shots[6], weather-augmented with Qwen-Image-Edit[7], then manually pruned of cartoon-like outputs.
Cosmos-Predict1 pipeline. A real dashcam frame is weather-augmented via Qwen-Image-Edit and passed as conditioning to the generator. The model's lower fidelity surfaces morphing, warping, and pop-in artifacts — well suited to probing artifact detection.
Each curated clip is paired with one or more questions drawn from a six-category taxonomy designed to probe complementary VLM capabilities — from binary “is this real?” calls to fine-grained spatio-temporal reasoning.
Reality detection
Artifact recognition
Safety assessment
Traffic-law compliance
Spatio-temporal
Visual understanding
How VLMs fail on driving video
Across six open-source VLMs and GPT-5.4-mini, accuracy ranges from 10% to 39% — but the failure profile is more interesting than the numbers. The same biases recur across models:
Always-Real bias. Across the benchmark, every model answers “Real” on the vast majority of reality-check items — bare Qwen3-VL does so on more than 96% of them — even though every clip in SynthDriveEval is generated.
Compliance hallucinated from static cues. On clips where the ego car visibly crosses a red light, VLMs will confidently state that the ego car “stopped because the traffic light was red” — inferring compliance from the presence of the traffic light rather than observing what actually happened.
Midpoint-anchored Likert collapse (GPT-5.4-mini). When asked to rate the safety of a video on a scale from 1 to 3 (1 being unsafe and 3 totally safe), GPT-5.4-mini never emits a “1” across 1,061 items: its output distribution is {2 : 964, 3 : 97}, so every ground-truth-1 clip is automatically wrong. The realism Likert never under-rates.
Perception–labelling disconnect on overtakes. For GPT-5.4-mini, 91.7% of ground-truth-Right overtake failures are answered “Left”, even when the model correctly describes ego-relative positions in its free-text reasoning. The error lives at the labelling step, not in perception.
A training-free tool-augmented agent
These failures point to an imbalance in the data they were trained on. The driving corpora were very likely overwhelmingly compliant: cars stop at red lights, overtakes happen on the left, drivers obey stop signs. With vanishingly few counter-examples during training, a VLM ends up learning scene statistics rather than behaviour, and at inference time it defaults to whatever the training distribution says. Fine-tuning on targeted negatives would plausibly close the gap — but collecting and annotating enough edge-case footage to train on is slow and expensive.
I take a training-free route instead. I pick Qwen3-VL[8] as the backbone — a widely-used reference open VLM that natively supports tool-calling — and wrap it in an agent harness with three tools, each targeting one of the failure modes. The VLM keeps its general semantic capability; the tools inject the evidences it was lacking.
1. Optical flow with RAFT[9] — for motion grounding
Original clip
RAFT optical flow
Many driving questions hinge on motion cues the VLM doesn't track. RAFT computes per-pixel optical flow that Qwen3-VL can actually reason over.
2. SAM 3[10] crop-and-zoom — for localized inspection
SAM 3 crop-and-zoom tool. Given a question about a clip's artifacts, the agent requests segmentation masks for a chosen concept class; each instance is cropped and upsampled so that the target object dominates the VLM's view. Subtle texture and boundary inconsistencies that are invisible for it at native scale become discriminable after zoom.
Many artifacts are subtle at native scale for a VLM: texture drift on a distant vehicle, a pedestrian outline morphing across frames, garbled sign text. SAM 3 lets the agent request open-vocabulary instance masks (“cars”, “pedestrians”, “traffic signs”); each instance is cropped and upsampled so the target object dominates the VLM's view.
3. FFT directional anisotropy — for provenance
Optical flow and segmentation give no signal on “is this real or generated?”. So I built a frequency-domain detector: for each frame, the tool computes the 2D FFT magnitude spectrum, defines a high-frequency annular ring (30–90% of Nyquist), sweeps 72 angular bins, and returns the coefficient of variation of energy across bins — a directional anisotropy score. Real footage shows smooth elliptical falloff (~0.040); AI-generated clips show bright cross-shaped streaks and directional concentration (~0.055). A 0.048 threshold, calibrated on 36 validation videos, separates the two regimes.
Real footage — smooth, isotropic falloff
Generated clip — bright cross-shaped streaks
Design lesson — numeric beats visual. My first attempt fed the spectrum video to the VLM and asked it to “identify anomalous patterns”. Accuracy on the binary was 22% — worse than the bare baseline, because Qwen3-VL hallucinates “smooth natural falloff” even on clearly anisotropic spectra. Once I let the tool compute the anisotropy and return a short textual verdict (real / generated, with confidence), accuracy jumped to 74.2%. In that case Qwen3-VL reads numbers and follows threshold rules far better than it interprets FFT visualisations.
Results
Across all six categories, the tool-augmented agent reaches 60.1% overall accuracy, almost tripling its own Qwen3-VL baseline (21%) and beating the strongest base VLM by over 20 pp. The largest gains come exactly where the tools target failure modes: +54 pp on reality detection (FFT) and +43 pp on artifact recognition (SAM crops).
Overall accuracy on SynthDriveEval. The tool-augmented agent outperforms every base VLM by more than 20 pp.
Accuracy vs. wall-clock inference time (log scale). The agent (top right) trades latency for a large accuracy gain over every base VLM.
A counter-intuitive finding. Adding subsidiary “hint” questions before the main one helps base VLMs by 12–14 pp — but regresses the tool-augmented agent by 9 pp. Hints and tools compete: once the agent has grounded evidence from its tools, hint sub-answers add noise rather than guidance.
Release
The full benchmark is publicly available on Hugging Face. The agent code is currently a private repository and will be released alongside the paper. The arXiv preprint will be linked here as soon as it's online.
[28]G. Zhao et al. (2025). DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation. Proceedings of the AAAI Conference on Artificial Intelligence.
[29]Y. Zhou et al. (2026). DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving. arXiv preprint arXiv:2601.01528.