This article was originally published on Best GPU for AI. The full version with interactive tools, FAQ, and live pricing is on the original site.
The right GPU for Kohya_ss depends on what you are training. LoRA fine-tuning for Stable Diffusion XL or Flux.1, DreamBooth for character consistency, and full model fine-tuning all have different hardware demands. This guide breaks down what you actually need for each scenario.
Quick answer: For LoRA training, 16GB VRAM hits the sweet spot. The RTX 4090 is the fastest consumer training GPU. The RTX 4060 Ti 16GB is the best budget pick for VRAM headroom. The RTX 3060 12GB works for light LoRA but gets tight fast.
See the recommended pick on the original guide
VRAM needs by training task
| Training task | Minimum VRAM | Recommended VRAM | Notes |
|---|---|---|---|
| SD 1.5 LoRA | 6GB | 8GB | Any modern GPU works |
| SD XL LoRA | 10GB | 12GB+ | 8GB requires aggressive gradient checkpointing |
| Flux.1 LoRA | 16GB | 24GB | Flux is memory-hungry |
| DreamBooth SD XL | 16GB | 24GB | Higher batch sizes need more VRAM |
| DreamBooth Flux.1 | 24GB | 32GB | Very demanding |
| Full model fine-tune | 24GB+ | 40GB+ | Rarely done on consumer hardware |
GPU recommendations by scenario
Scenario 1: Flux.1 LoRA training (most demanding popular task)
Flux.1 LoRA training in Kohya_ss is the current standard for high-quality character and style training. It needs 16GB minimum and runs best with 24GB.
Top pick: RTX 4090 — 24GB GDDR6X trains Flux.1 LoRAs comfortably. A typical 1500-step run completes in 20-30 minutes. Fast enough to iterate quickly.
Value pick: RTX 4070 Ti Super — 16GB is tight for Flux.1 but works with gradient checkpointing enabled. Training takes 50-80% longer than the 4090.
See the recommended pick on the original guide
See the recommended pick on the original guide
Scenario 2: SD XL LoRA training (most common task)
SD XL LoRA is forgiving. 12GB VRAM handles it, and 16GB gives comfortable headroom for higher resolution or larger batch sizes.
Top pick: RTX 4090 — Fastest training times, excellent VRAM headroom.
Value pick: RTX 4060 Ti 16GB — 16GB GDDR6 is exactly right for SD XL LoRA. Significantly cheaper than the 4090. Training is slower but perfectly usable for personal projects.
See the recommended pick on the original guide
Scenario 3: Budget SD 1.5 / SD 2.1 training
For older model training, the requirements drop significantly. A 12GB GPU handles everything comfortably.
Budget pick: RTX 3060 12GB — 12GB at the lowest price point. Trains SD 1.5 LoRAs without issues. Struggles with Flux.1 and higher-VRAM tasks, but for basic character LoRA work it gets the job done.
See the recommended pick on the original guide
Training time comparison (SD XL LoRA, 1500 steps, batch 1)
| GPU | VRAM | Approx. training time | Relative speed |
|---|---|---|---|
| RTX 4090 | 24GB | ~12 min | 1x (baseline) |
| RTX 4070 Ti Super | 16GB | ~20 min | 0.6x |
| RTX 4060 Ti 16GB | 16GB | ~30 min | 0.4x |
| RTX 3060 12GB | 12GB | ~55 min | 0.22x |
Estimates at 1024x1024 resolution with network cache enabled. Flux.1 LoRA times are 2-3x longer across all GPUs.
What about the RTX 5090?
The RTX 5090 trains faster than the 4090 — roughly 1.5-2x depending on the task. For pure training throughput, it is the fastest consumer option. But the 4090 is already fast enough that the extra speed rarely justifies $400+ more cost for personal training work.
GPU tier list available at the original article
See also: Best GPU for LoRA training, Best GPU for fine-tuning, and Best GPU for DreamBooth.
Which GPU should YOU buy?
- Training Flux.1 LoRAs regularly? RTX 4090 (24GB) is the minimum comfortable option. The 4060 Ti 16GB works but is slow.
- Mostly SD XL LoRA training? RTX 4060 Ti 16GB is the best value — 16GB VRAM, much cheaper than the 4090.
- On a tight budget running SD 1.5? RTX 3060 12GB works. Expect slower training times.
- Professional training pipeline with iteration speed critical? RTX 4090 or RTX 5090 (if budget allows).
- Training infrequently? Cloud GPUs are worth considering — RunPod hourly rates often beat buying hardware for light use.
Common mistakes to avoid
- Buying an 8GB GPU to save money then hitting constant out-of-memory errors in Kohya_ss — 16GB is the proper minimum for modern tasks
- Running Flux.1 LoRA training on 12GB and expecting a smooth experience — it technically runs but barely
- Ignoring gradient checkpointing settings — enabling them on a 16GB GPU can mean the difference between a training run working or failing
- Using a slow HDD for dataset storage — Kohya_ss reads training images repeatedly, and slow storage adds meaningful time across thousands of steps
- Skipping the network cache step — caching latents before training dramatically speeds up LoRA runs and is often overlooked by beginners
Final verdict
| Use case | Best pick | Budget pick |
|---|---|---|
| Flux.1 LoRA | RTX 4090 | RTX 4070 Ti Super |
| SD XL LoRA | RTX 4090 | RTX 4060 Ti 16GB |
| SD 1.5 LoRA | RTX 4060 Ti 16GB | RTX 3060 12GB |
| DreamBooth | RTX 4090 | RTX 4060 Ti 16GB |
For most Kohya_ss users, the RTX 4060 Ti 16GB offers the best balance of VRAM capacity and price. If you train Flux.1 seriously, save up for the RTX 4090.
See the recommended pick on the original guide
In Kohya_ss, VRAM capacity determines what you can run. Training speed determines how fast you can iterate. Both matter.
Related guides on Best GPU for AI
- Best GPU for LoRA Training in 2026 (5 Picks Ranked)
- Best GPU for AI Research in 2026 (Picks From $400)
- Best GPU for Fine-Tuning AI Models in 2026 (Ranked)
The full version lives on Best GPU for AI — VRAM calculator, GPU comparison table, and live Amazon pricing.











