These models generate text descriptions and captions from images. Use them for alt text, image search, content indexing, training data preparation, and accessibility.
Claude 4.5 Sonnet produces the most detailed, nuanced image descriptions. It understands composition, style, mood, and context — not just what's in the image, but why it matters. Great for detailed alt text, editorial descriptions, and cases where caption quality matters more than speed.
GPT-5 combines strong visual understanding with excellent instruction following. Tell it exactly what kind of caption you want — short and punchy, detailed and technical, or structured with specific fields — and it delivers. Supports configurable reasoning effort so you can trade depth for speed.
Gemini 3 Flash is built for speed. It processes images quickly while still producing accurate, useful descriptions. Also understands video, so you can caption frames from video content. A great default for high-volume captioning workflows.
GPT-5.4 is the most capable model for analyzing charts, diagrams, documents, technical drawings, and complex visual scenes. Use it when you need more than a caption — when you need the model to reason about what it sees. Features a 1 million token context window.
GPT-5 Nano is the cheapest official option that still handles images well. Good for bulk captioning where you need accurate descriptions at scale without high costs.
Moondream 2B is a lightweight vision model you can self-host. It's fast and cheap, and produces decent captions for most common images. A good choice if you need to run captioning on your own hardware.
Looking for interactive visual reasoning? Check out our vision models collection →
Featured models

openai/gpt-5.4OpenAI's most capable frontier model for complex professional work, coding, and multi-step reasoning.
Updated 1 month, 2 weeks ago
41.3K runs

Google's most intelligent model built for speed with frontier intelligence, superior search, and grounding
Updated 2 months, 3 weeks ago
1.3M runs

openai/gpt-5OpenAI's new model excelling at coding, writing, and reasoning.
Updated 2 months, 4 weeks ago
1.7M runs

openai/gpt-5-nanoFastest, most cost-effective GPT-5 model from OpenAI
Updated 2 months, 4 weeks ago
10.1M runs

Claude Sonnet 4.5 is the best coding model to date, with significant improvements across the entire development lifecycle
Updated 6 months, 3 weeks ago
1.1M runs

lucataco/moondream2moondream2 is a small vision language model designed to run efficiently on edge devices
Updated 1 year, 8 months ago
11M runs
Recommended Models
For most use cases, google/gemini-3-flash is the best default — it's fast, accurate, and handles a wide range of images. For the most detailed and nuanced descriptions, use anthropic/claude-4.5-sonnet. For the cheapest option that still works well, use openai/gpt-5-nano.
google/gemini-3-flash and openai/gpt-5-nano are both built for speed. For a self-hostable option, lucataco/moondream2 is lightweight and quick.
Yes — all the recommended models support visual question answering. Upload an image and ask "What's in this photo?", "How many people are there?", or "Describe the lighting in this scene." openai/gpt-5 is particularly good at following specific instructions about what kind of answer you want.
openai/gpt-5.4 is the most capable model for this — it can reason about charts, extract data from tables, interpret technical drawings, and analyze multi-page documents with its 1 million token context window.
Yes — all these models work via API, so you can process images programmatically. For bulk captioning of training datasets, fofr/deprecated-batch-image-captioning processes ZIP archives using GPT, Claude, or Gemini.
lucataco/moondream2 is a lightweight open-source vision model that you can run on your own hardware. It's not as capable as the official models but works well for basic captioning and tagging.
Yes — most official models (GPT-5, Claude, Gemini) support commercial use. Check the license on each model page for specifics.
Recommended Models

lucataco/qwen2-vl-7b-instructLatest model in the Qwen family for chatting with video and image models
Updated 1 year, 4 months ago
419K runs

lucataco/ollama-llama3.2-vision-90bOllama Llama 3.2 Vision 90B
Updated 1 year, 4 months ago
4.4K runs

lucataco/ollama-llama3.2-vision-11bOllama Llama 3.2 Vision 11B
Updated 1 year, 4 months ago
5.2K runs

lucataco/smolvlm-instructSmolVLM-Instruct by HuggingFaceTB
Updated 1 year, 4 months ago
8.3K runs

lucataco/llama-3-vision-alphaProjection module trained to add vision capabilties to Llama 3 using SigLIP
Updated 1 year, 5 months ago
6.8K runs

zsxkib/molmo-7ballenai/Molmo-7B-D-0924, Answers questions and caption about images
Updated 1 year, 6 months ago
1.3M runs

zsxkib/idefics3Idefics3-8B-Llama3, Answers questions and caption about images
Updated 1 year, 8 months ago
2.7K runs

fofr/deprecated-batch-image-captioningA wrapper model for captioning multiple images using GPT, Claude or Gemini, useful for lora training
Updated 1 year, 8 months ago
1.6K runs

yorickvp/llava-13bVisual instruction tuning towards large language and vision models with GPT-4 level capabilities
Updated 1 year, 9 months ago
35.1M runs

lucataco/florence-2-baseFlorence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Updated 1 year, 9 months ago
132.6K runs

lucataco/sdxl-clip-interrogatorCLIP Interrogator for SDXL optimizes text prompts to match a given image
Updated 1 year, 11 months ago
848.8K runs

daanelson/minigpt-4A model which generates text in response to an input image and prompt.
Updated 1 year, 11 months ago
1.8M runs

zsxkib/blip-3Blip 3 / XGen-MM, Answers questions about images ({blip3,xgen-mm}-phi3-mini-base-r-v1)
Updated 1 year, 11 months ago
1.3M runs

zsxkib/uform-gen🖼️ Super fast 1.5B Image Captioning/VQA Multimodal LLM (Image-to-Text) 🖋️
Updated 2 years, 2 months ago
2.4K runs

andreasjansson/blip-2Answers questions about images
Updated 2 years, 5 months ago
31.6M runs

lucataco/fuyu-8bFuyu-8B is a multi-modal text and image transformer trained by Adept AI
Updated 2 years, 6 months ago
14.6K runs

lucataco/qwen-vl-chatA multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.
Updated 2 years, 6 months ago
826K runs

pharmapsychotic/clip-interrogatorThe CLIP Interrogator is a prompt engineering tool that combines OpenAI's CLIP and Salesforce's BLIP to optimize text prompts to match a given image. Use the resulting prompts with text-to-image models like Stable Diffusion to create cool art!
Updated 2 years, 7 months ago
4.9M runs

nohamoamary/image-captioning-with-visual-attentiondatasets: Flickr8k
Updated 2 years, 11 months ago
11.3K runs

salesforce/blipGenerate image captions
Updated 3 years, 6 months ago
172.6M runs

rmokady/clip_prefix_captionSimple image captioning model using CLIP and GPT-2
Updated 3 years, 6 months ago
1.7M runs

methexis-inc/img2promptGet an approximate text prompt, with style, matching an image. (Optimized for stable-diffusion (clip ViT-L/14))
Updated 3 years, 7 months ago
2.7M runs

j-min/clip-caption-rewardFine-grained Image Captioning with CLIP Reward
Updated 3 years, 10 months ago
296.1K runs