Readme
Dataset Preparation Guide
Text-to-Image Training (AI Toolkit)
This trainer endpoint fine-tunes a text-to-image diffusion model using ai-toolkit in the background. To achieve good results, your dataset must follow the structure described below.
1. Dataset Format (Required)
Your dataset must be a single folder containing images, where each image has a matching caption file.
Folder Structure
dataset/
├── image_001.jpg
├── image_001.txt
├── image_002.png
├── image_002.txt
├── image_003.webp
├── image_003.txt
Rules
- Every image must have a
.txtcaption file - Caption files must share the exact same base filename as the image
- Images without captions will be ignored
- Caption files without images will be ignored
2. Caption Files (.txt)
Each .txt file contains the text description of the image, similar to a prompt used at inference time.
Example
image_001.txt
a photo of <my_concept> wearing a black hoodie, studio lighting, high detail
Caption Guidelines
- Describe what you want the model to learn
- Be clear and concise
- Natural language works best
- Multi-line captions are allowed, but treated as plain text
3. Trigger Words (Strongly Recommended)
If you are training a: - person - character - product - specific visual concept
use a unique trigger word in every caption.
Example
a portrait photo of <my_concept>, 35mm lens, shallow depth of field
Later, you can prompt the trained model with:
a cinematic portrait of <my_concept>
Trigger Word Rules
- Must be unique
- Should not exist in the base model vocabulary
- Must be used consistently in all captions
4. Image Requirements
ai-toolkit automatically handles resizing and bucketing, so manual preprocessing is not required.
Recommended
- Minimum resolution: 512×512
- Formats:
.jpg,.png,.webp - High-quality images with varied:
- angles
- lighting
- backgrounds
- poses or expressions
Avoid
- Duplicates
- Watermarks or logos
- Text overlays
- Very low-resolution or blurry images
5. Dataset Size Recommendations
| Use Case | Recommended Images |
|---|---|
| Person / Character | 15–40 |
| Product | 20–50 |
| Style / Aesthetic | 30–100 |
| General Concept | 50+ |
Quality and diversity matter more than raw quantity.
6. What Not to Include
- Nested folders inside the dataset
- Missing or mismatched caption files
- Reused or copy-pasted captions
- Copyrighted material you do not own the rights to
- NSFW or disallowed content
7. Uploading the Dataset
Once your dataset folder is ready:
- Upload the dataset folder to the trainer endpoint
- Specify your trigger word (if applicable)
- Start training
The trainer will validate the dataset before launching the training job.
8. Minimal Example Dataset
my_dataset/
├── img1.jpg
├── img1.txt → a photo of <my_concept> smiling, outdoor lighting
├── img2.jpg
├── img2.txt → side profile of <my_concept>, soft light, 85mm lens
├── img3.jpg
├── img3.txt → studio portrait of <my_concept>, neutral background
Need Help?
If you are unsure whether your dataset is correctly structured or want feedback on captions, reach out before starting training — fixing dataset issues early saves time and compute.