SynthDriveEval

Master's thesis — VITA Lab, EPFL

Paper in preparationwith Mariam Hassan & Alexandre Alahi

🤗 DatasetCode (coming soon)arXiv (coming soon)

TL;DR

Vision–Language Models are increasingly used as judges of synthetic driving videos, but the best open VLMs score below 40% on edge-case driving questions — and Qwen3-VL alone reaches only 21%. I built SynthDriveEval, a benchmark of 1,597 synthetic clips and 7,371 expert-annotated questions, plus a training-free tool-augmented agent that grounds the VLM in optical flow (RAFT), instance segmentation (SAM 3), and a frequency-domain generation detector. The agent reaches 60.1% overall accuracy — nearly tripling its own backbone and beating the strongest baseline by +20 pp.

21% → 60.1%

Qwen3-VL alone vs. our agent

+20 pp

over the strongest base VLM

1,597 / 7,371

videos / annotated questions

Why this matters

A growing body of research proposes Vision–Language Models as components of autonomous-driving pipelines — from VLM-RL^[1], which turns a pre-trained VLM into a reward signal for safe-driving reinforcement learning, to DriveLLM-V^[2], an explainable end-to-end framework that fuses vision and language to support interpretable driving decisions. In parallel, simulation itself is shifting from physics engines to generative world models — systems like GEM^[3] and NVIDIA's Cosmos family that synthesise realistic driving rollouts conditioned on scene context and prompts. This matters precisely because the hard cases — running a red light, aggressive overtakes, failure to stop — are the ones we cannot safely film on real roads, so we have to generate them.

These two trajectories collide on a specific question: if we use a VLM to audit what a generative simulator produces — to screen synthetic datasets, rank world models, or provide feedback during training — can we trust its judgement? “Good driving” is temporal, context-dependent, and hard to reduce to explicit metrics (a form of Polanyi's paradox for driving behaviour). My thesis answers the question empirically: no, not reliably — off-the-shelf VLMs exhibit systematic biases on synthetic driving footage. And it proposes a way to fix it without retraining.

The benchmark

SynthDriveEval is built on two NVIDIA Cosmos generators with very different characters. Cosmos-Drive-Dreams^[4] produces high-fidelity clips conditioned on HD maps and natural-language prompts — perfect for staging traffic violations that would be unsafe or impossible to film. I rewrite each prompt across 8 weather and time-of-day variants, then manually filter the outputs because the generator silently rewrites road layouts when it can't satisfy the prompt.

Cosmos-Drive-Dreams pipeline. HD maps and textual prompts describing specific traffic violations are fed to the generator; prompt rewriting diversifies weather and time-of-day. A manual curation step filters out clips that silently suppress the requested violation.

Cosmos-Predict1^[5], by contrast, is a lower-fidelity general-purpose video generator conditioned on a single image plus a text prompt. Its weaker fidelity surfaces morphing, warping, and pop-in artifacts — exactly what we need to stress-test artifact perception. The conditioning frames are real dashcam shots^[6], weather-augmented with Qwen-Image-Edit^[7], then manually pruned of cartoon-like outputs.

Cosmos-Predict1 pipeline. A real dashcam frame is weather-augmented via Qwen-Image-Edit and passed as conditioning to the generator. The model's lower fidelity surfaces morphing, warping, and pop-in artifacts — well suited to probing artifact detection.

Each curated clip is paired with one or more questions drawn from a six-category taxonomy designed to probe complementary VLM capabilities — from binary “is this real?” calls to fine-grained spatio-temporal reasoning.

Reality detection

Artifact recognition

Safety assessment

Traffic-law compliance

Spatio-temporal

Visual understanding

How VLMs fail on driving video

Across six open-source VLMs and GPT-5.4-mini, accuracy ranges from 10% to 39% — but the failure profile is more interesting than the numbers. The same biases recur across models:

Always-Real bias. Across the benchmark, every model answers “Real” on the vast majority of reality-check items — bare Qwen3-VL does so on more than 96% of them — even though every clip in SynthDriveEval is generated.
Compliance hallucinated from static cues. On clips where the ego car visibly crosses a red light, VLMs will confidently state that the ego car “stopped because the traffic light was red” — inferring compliance from the presence of the traffic light rather than observing what actually happened.
Midpoint-anchored Likert collapse (GPT-5.4-mini). When asked to rate the safety of a video on a scale from 1 to 3 (1 being unsafe and 3 totally safe), GPT-5.4-mini never emits a “1” across 1,061 items: its output distribution is {2 : 964, 3 : 97}, so every ground-truth-1 clip is automatically wrong. The realism Likert never under-rates.
Perception–labelling disconnect on overtakes. For GPT-5.4-mini, 91.7% of ground-truth-Right overtake failures are answered “Left”, even when the model correctly describes ego-relative positions in its free-text reasoning. The error lives at the labelling step, not in perception.

A training-free tool-augmented agent

These failures point to an imbalance in the data they were trained on. The driving corpora were very likely overwhelmingly compliant: cars stop at red lights, overtakes happen on the left, drivers obey stop signs. With vanishingly few counter-examples during training, a VLM ends up learning scene statistics rather than behaviour, and at inference time it defaults to whatever the training distribution says. Fine-tuning on targeted negatives would plausibly close the gap — but collecting and annotating enough edge-case footage to train on is slow and expensive.

I take a training-free route instead. I pick Qwen3-VL^[8] as the backbone — a widely-used reference open VLM that natively supports tool-calling — and wrap it in an agent harness with three tools, each targeting one of the failure modes. The VLM keeps its general semantic capability; the tools inject the evidences it was lacking.

1. Optical flow with RAFT^[9] — for motion grounding

Original clip

RAFT optical flow

Many driving questions hinge on motion cues the VLM doesn't track. RAFT computes per-pixel optical flow that Qwen3-VL can actually reason over.

2. SAM 3^[10] crop-and-zoom — for localized inspection

SAM 3 crop-and-zoom tool. Given a question about a clip's artifacts, the agent requests segmentation masks for a chosen concept class; each instance is cropped and upsampled so that the target object dominates the VLM's view. Subtle texture and boundary inconsistencies that are invisible for it at native scale become discriminable after zoom.

Many artifacts are subtle at native scale for a VLM: texture drift on a distant vehicle, a pedestrian outline morphing across frames, garbled sign text. SAM 3 lets the agent request open-vocabulary instance masks (“cars”, “pedestrians”, “traffic signs”); each instance is cropped and upsampled so the target object dominates the VLM's view.

3. FFT directional anisotropy — for provenance

Optical flow and segmentation give no signal on “is this real or generated?”. So I built a frequency-domain detector: for each frame, the tool computes the 2D FFT magnitude spectrum, defines a high-frequency annular ring (30–90% of Nyquist), sweeps 72 angular bins, and returns the coefficient of variation of energy across bins — a directional anisotropy score. Real footage shows smooth elliptical falloff (~0.040); AI-generated clips show bright cross-shaped streaks and directional concentration (~0.055). A 0.048 threshold, calibrated on 36 validation videos, separates the two regimes.

Real footage — smooth, isotropic falloff

Generated clip — bright cross-shaped streaks

Design lesson — numeric beats visual. My first attempt fed the spectrum video to the VLM and asked it to “identify anomalous patterns”. Accuracy on the binary was 22% — worse than the bare baseline, because Qwen3-VL hallucinates “smooth natural falloff” even on clearly anisotropic spectra. Once I let the tool compute the anisotropy and return a short textual verdict (real / generated, with confidence), accuracy jumped to 74.2%. In that case Qwen3-VL reads numbers and follows threshold rules far better than it interprets FFT visualisations.

Results

Across all six categories, the tool-augmented agent reaches 60.1% overall accuracy, almost tripling its own Qwen3-VL baseline (21%) and beating the strongest base VLM by over 20 pp. The largest gains come exactly where the tools target failure modes: +54 pp on reality detection (FFT) and +43 pp on artifact recognition (SAM crops).

Overall accuracy on SynthDriveEval. The tool-augmented agent outperforms every base VLM by more than 20 pp.

Accuracy vs. wall-clock inference time (log scale). The agent (top right) trades latency for a large accuracy gain over every base VLM.

A counter-intuitive finding. Adding subsidiary “hint” questions before the main one helps base VLMs by 12–14 pp — but regresses the tool-augmented agent by 9 pp. Hints and tools compete: once the agent has grounded evidence from its tools, hint sub-answers add noise rather than guidance.

Release

The full benchmark is publicly available on Hugging Face. The agent code is currently a private repository and will be released alongside the paper. The arXiv preprint will be linked here as soon as it's online.

References

[1] Z. Huang et al. (2025). VLM-RL: A unified vision language models and reinforcement learning framework for safe autonomous driving. Transportation Research Part C: Emerging Technologies 180, 105321.
[2] J. Hou et al. (2025). An explainable end-to-end autonomous driving framework based on large language model and vision modality fusion: design and application of DriveLLM-V. Transportation Research Part C: Emerging Technologies 181, 105368.
[3] M. Hassan et al. (2024). GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control. arXiv:2412.11198.
[4] X. Ren et al. (2025). Cosmos-Drive-Dreams: Scalable Synthetic Driving Data Generation with World Foundation Models. arXiv:2506.09042.
[5] N. Agarwal et al. (2025). Cosmos World Foundation Model Platform for Physical AI. arXiv:2501.03575.
[6] W. Bao, Q. Yu, Y. Kong (2020). Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning. ACM Multimedia Conference.
[7] Alibaba (2025). Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency. Online.
[8] S. Bai et al. (2025). Qwen3-VL Technical Report. arXiv:2511.21631.
[9] Z. Teed, J. Deng (2020). RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. arXiv:2003.12039.
[10] N. Carion et al. (2025). SAM 3: Segment Anything with Concepts. arXiv:2511.16719.
[11] A. Azzolini et al. (2025). Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning. arXiv:2503.15558.
[12] É. Beauchamp, N. Saunier, M. Cloutier (2022). Study of automated shuttle interactions in city traffic using surrogate measures of safety. Transportation Research Part C: Emerging Technologies 135, 103465.
[13] H. Cheng et al. (2025). Emergency Index (EI): A two-dimensional surrogate safety measure considering vehicles' interaction depth. Transportation Research Part C: Emerging Technologies, 104981.
[14] Z. Huang et al. (2024). VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving. arXiv:2412.15544.
[15] D. Li et al. (2025). Aria: An Open Multimodal Native Mixture-of-Experts Model. arXiv:2410.05993.
[16] B. Li et al. (2024). LLaVA-OneVision: Easy Visual Task Transfer. arXiv:2408.03326.
[17] M. Liu, W. Zhang (2025). Is Your Video Language Model a Reliable Judge?. arXiv:2503.05977.
[18] Y. Ma et al. (2024). Evolving testing scenario generation and intelligence evaluation for automated vehicles. Transportation Research Part C: Emerging Technologies 163, 104620.
[19] S. Motamed et al. (2025). TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility. arXiv:2510.07550.
[20] R. Qian et al. (2022). Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. arXiv:2207.07646.
[21] Y. Qu et al. (2025). VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving. arXiv:2505.16377.
[22] K. Ren et al. (2025). Intelligent testing environment generation for autonomous vehicles with implicit distributions of traffic behaviors. Transportation Research Part C: Emerging Technologies 174, 105106.
[23] W. Wang et al. (2025). InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. arXiv:2508.18265.
[24] Wayve (2025). GAIA-3: Scaling World Models to Power Safety and Evaluation. Online.
[25] S. Xie et al. (2025). Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives. arXiv:2501.04003.
[26] J. Xu et al. (2025). Qwen3-Omni Technical Report. arXiv:2509.17765.
[27] R. Yu, X. Ma, X. Wang (2025). Introducing Visual Perception Token into Multimodal Large Language Model. arXiv:2502.17425.
[28] G. Zhao et al. (2025). DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation. Proceedings of the AAAI Conference on Artificial Intelligence.
[29] Y. Zhou et al. (2026). DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving. arXiv preprint arXiv:2601.01528.