Vision models analyze and reason about images and videos. Upload a photo and ask questions about it, extract data from charts, identify objects, or get detailed descriptions of complex scenes.
Unlike the image captioning collection which focuses on generating descriptions, this collection is about reasoning — asking questions, analyzing documents, understanding diagrams, and having conversations about what's in an image.
Gemini 3 Flash is 3x faster than previous models while matching their intelligence. It handles images, video, and audio — ask it about any of them. The best default for most visual reasoning tasks where you need fast answers.
Claude 4.5 Sonnet excels at nuanced visual analysis. It picks up on subtle details in composition, can reason about complex scenes, and writes thorough descriptions. Particularly strong at analyzing UI screenshots, code, and technical diagrams.
GPT-5.4 is the most powerful model for complex visual reasoning — charts, spreadsheets, multi-page documents, and technical drawings. Features a 1 million token context window and configurable reasoning depth. Use it when simpler models aren't getting the job done.
GPT-5 handles images well across a wide range of tasks — from casual "what's in this photo?" to structured data extraction. Configurable reasoning effort lets you balance speed and depth.
GPT-4o Mini is a solid choice for lighter visual tasks where cost matters. Good for basic image understanding, simple Q&A, and high-volume processing.
Moondream 2B is a lightweight vision model you can self-host. Great for basic captioning and visual Q&A when you need to run on your own hardware.
Looking for simple image captions or alt text? Check out our image captioning collection →
Featured models

openai/gpt-5.4OpenAI's most capable frontier model for complex professional work, coding, and multi-step reasoning.
Updated 1 month, 1 week ago
39K runs

Google's most intelligent model built for speed with frontier intelligence, superior search, and grounding
Updated 2 months, 3 weeks ago
1.3M runs

openai/gpt-5OpenAI's new model excelling at coding, writing, and reasoning.
Updated 2 months, 3 weeks ago
1.7M runs

Claude Sonnet 4.5 is the best coding model to date, with significant improvements across the entire development lifecycle
Updated 6 months, 2 weeks ago
1.1M runs

openai/gpt-4o-miniLow latency, low cost version of OpenAI's GPT-4o model
Updated 8 months ago
37.5M runs
Recommended Models
google/gemini-3-flash is the best default — it's fast, handles images and video, and produces accurate answers. For more nuanced analysis, use anthropic/claude-4.5-sonnet or openai/gpt-5.
google/gemini-3-flash is 3x faster than previous generation models. openai/gpt-4o-mini is also fast and cheap for lighter tasks.
Caption Images focuses on generating descriptions — you give it an image and get text back. Vision models are for interactive reasoning — you have a conversation about an image, ask follow-up questions, or give complex instructions about what to analyze.
openai/gpt-5.4 is the most capable — it has a 1 million token context window, configurable reasoning depth, and excels at multi-page documents, spreadsheets, and complex technical drawings. Use it when other models aren't detailed enough.
google/gemini-3-flash natively supports video input. Other models can analyze individual frames extracted from video.
lucataco/moondream2 is a lightweight open-source option for basic visual Q&A and captioning. It's not as capable as the official models but runs on consumer hardware.
Yes — official models from OpenAI, Anthropic, and Google all support commercial use. Check each model's license page for specifics.
Recommended Models

Google’s hybrid “thinking” AI model optimized for speed and cost-efficiency
Updated 2 months ago
3.9M runs

openai/gpt-4oOpenAI's high-intelligence chat model
Updated 2 months, 3 weeks ago
637.5K runs

openai/gpt-4.1-miniFast, affordable version of GPT-4.1
Updated 2 months, 3 weeks ago
2.1M runs

Claude Sonnet 4 is a significant upgrade to 3.7, delivering superior coding and reasoning while responding more precisely to your instructions
Updated 10 months ago
2.9M runs

lucataco/qwen2.5-omni-7bQwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner.
Updated 1 year ago
32.4K runs

The most intelligent Claude model and the first hybrid reasoning model on the market (claude-3-7-sonnet-20250219)
Updated 1 year, 1 month ago
4.1M runs

lucataco/qwen2-vl-7b-instructLatest model in the Qwen family for chatting with video and image models
Updated 1 year, 3 months ago
416.6K runs

lucataco/ollama-llama3.2-vision-90bOllama Llama 3.2 Vision 90B
Updated 1 year, 4 months ago
4.4K runs

lucataco/ollama-llama3.2-vision-11bOllama Llama 3.2 Vision 11B
Updated 1 year, 4 months ago
5.1K runs

lucataco/moondream2moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 1 year, 8 months ago
10.9M runs

yorickvp/llava-13bVisual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 1 year, 9 months ago
34.9M runs

daanelson/minigpt-4A model which generates text in response to an input image and prompt.
Updated 1 year, 11 months ago
1.8M runs

yorickvp/llava-v1.6-vicuna-13bLLaVA v1.6: Large Language and Vision Assistant (Vicuna-13B)
Updated 2 years, 2 months ago
3.8M runs

yorickvp/llava-v1.6-mistral-7bLLaVA v1.6: Large Language and Vision Assistant (Mistral-7B)
Updated 2 years, 2 months ago
5M runs

zsxkib/uform-gen🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 2 years, 2 months ago
2.4K runs

adirik/kosmos-gKosmos-G: Generating Images in Context with Multimodal Large Language Models
Updated 2 years, 4 months ago
4.5K runs

cjwbw/cogvlmpowerful open-source visual language model
Updated 2 years, 4 months ago
1.5M runs

lucataco/bakllavaBakLLaVA-1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture
Updated 2 years, 5 months ago
39.9K runs

lucataco/qwen-vl-chatA multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years, 6 months ago
826K runs

adirik/owlvit-base-patch32Zero-shot / open vocabulary object detection
Updated 2 years, 6 months ago
25.1K runs

cjwbw/internlm-xcomposerAdvanced text-image comprehension and composition based on InternLM
Updated 2 years, 6 months ago
164.4K runs