Collections

Vision models

Vision models analyze and reason about images and videos. Upload a photo and ask questions about it, extract data from charts, identify objects, or get detailed descriptions of complex scenes.

Unlike the image captioning collection which focuses on generating descriptions, this collection is about reasoning — asking questions, analyzing documents, understanding diagrams, and having conversations about what's in an image.

Models we recommend

Best for speed: Gemini 3 Flash

Gemini 3 Flash is 3x faster than previous models while matching their intelligence. It handles images, video, and audio — ask it about any of them. The best default for most visual reasoning tasks where you need fast answers.

Best for depth: Claude 4.5 Sonnet

Claude 4.5 Sonnet excels at nuanced visual analysis. It picks up on subtle details in composition, can reason about complex scenes, and writes thorough descriptions. Particularly strong at analyzing UI screenshots, code, and technical diagrams.

Most capable: GPT-5.4

GPT-5.4 is the most powerful model for complex visual reasoning — charts, spreadsheets, multi-page documents, and technical drawings. Features a 1 million token context window and configurable reasoning depth. Use it when simpler models aren't getting the job done.

Best all-around: GPT-5

GPT-5 handles images well across a wide range of tasks — from casual "what's in this photo?" to structured data extraction. Configurable reasoning effort lets you balance speed and depth.

Budget pick: GPT-4o Mini

GPT-4o Mini is a solid choice for lighter visual tasks where cost matters. Good for basic image understanding, simple Q&A, and high-volume processing.

Open source: Moondream 2B

Moondream 2B is a lightweight vision model you can self-host. Great for basic captioning and visual Q&A when you need to run on your own hardware.

What you can do

  • Visual Q&A: Ask questions about photos, screenshots, or diagrams and get natural language answers.
  • Document analysis: Extract information from charts, tables, receipts, and forms.
  • UI understanding: Analyze app screenshots, wireframes, and design mockups.
  • Image reasoning: Understand spatial relationships, count objects, compare images, and follow visual instructions.

Looking for simple image captions or alt text? Check out our image captioning collection →

Frequently asked questions

Which model should I start with?

google/gemini-3-flash is the best default — it's fast, handles images and video, and produces accurate answers. For more nuanced analysis, use anthropic/claude-4.5-sonnet or openai/gpt-5.

Which models are the fastest?

google/gemini-3-flash is 3x faster than previous generation models. openai/gpt-4o-mini is also fast and cheap for lighter tasks.

What can I use vision models for?

  • Ask questions about photos ("What breed is this dog?", "How many people are in this image?")
  • Analyze charts, graphs, and data visualizations
  • Extract text and data from documents, receipts, and forms
  • Understand UI screenshots and wireframes
  • Compare images and spot differences
  • Reason about spatial relationships and object positions

What's the difference between this collection and Caption Images?

Caption Images focuses on generating descriptions — you give it an image and get text back. Vision models are for interactive reasoning — you have a conversation about an image, ask follow-up questions, or give complex instructions about what to analyze.

Which model handles the most complex visual tasks?

openai/gpt-5.4 is the most capable — it has a 1 million token context window, configurable reasoning depth, and excels at multi-page documents, spreadsheets, and complex technical drawings. Use it when other models aren't detailed enough.

Can I analyze video with these models?

google/gemini-3-flash natively supports video input. Other models can analyze individual frames extracted from video.

Can I self-host a vision model?

lucataco/moondream2 is a lightweight open-source option for basic visual Q&A and captioning. It's not as capable as the official models but runs on consumer hardware.

Can I use these models commercially?

Yes — official models from OpenAI, Anthropic, and Google all support commercial use. Check each model's license page for specifics.