How to Think About LLM Pricing
LLM pricing looks simple on the surface — dollars per million tokens — but the real cost of using AI for coding depends on factors most developers overlook. This guide explains how pricing actually works and how to estimate what you'll really pay.
The Basics: Tokens, Input, and Output
Every LLM provider charges based on tokens — roughly 0.75 words per token in English, or about 3–4 characters. Code can be more variable. You're charged separately for:
- Input tokens: Everything you send to the model — your prompt, the code context, conversation history, system instructions.
- Output tokens: Everything the model generates — code completions, explanations, function calls.
Most providers charge 3–5x more for output than input, because generation is more computationally expensive. A typical coding interaction might have 5,000 input tokens (your code + prompt) and 2,000 output tokens (the model's response).
Current Pricing Landscape (July 2026)
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Relative Cost vs Cheapest |
|---|---|---|---|
| Gemini 2.5 Flash | ~$0.15 | ~$0.60 | 1x (baseline) |
| DeepSeek-V3 | ~$0.27 | ~$0.40 | ~1.5x |
| Yi-Lightning | ~$0.14 | ~$0.43 | ~1x |
| Claude Haiku 4.5 | $0.80 | $4.00 | ~6x |
| GPT-5 Mini | ~$1.50 | ~$6.00 | ~10x |
| Claude Sonnet 5 | $3.00 | $15.00 | ~20x |
| Qwen3-Coder | ~$0.50 | ~$2.00 | ~3x |
| GPT-5 | ~$15.00 | ~$60.00 | ~100x |
| Claude Opus 4.8 | $15.00 | $75.00 | ~110x |
Prices are approximate and subject to change. Check official provider pricing pages for current rates.
What Actually Drives Your Costs
1. Context Window Utilization
The biggest hidden cost driver is how much context you send with each request. A 200K token context window is powerful, but if you're sending 50K tokens of codebase context with every query, your input costs multiply fast:
- Light context (2K input per query): $0.006/query on Sonnet 5
- Medium context (10K input per query): $0.03/query on Sonnet 5
- Full-repo context (100K input per query): $0.30/query on Sonnet 5
At 500 queries per day with full-repo context, that's $150/day — or $4,500/month. The same usage with targeted 5K context drops to ~$225/month.
2. Prompt Caching
Both Anthropic and OpenAI support prompt caching, which can reduce input costs by up to 90% for repeated context. If you frequently send the same codebase context or system instructions, caching is essential:
- Claude: Cache read costs $0.30/M tokens (vs $3.00 for base input on Sonnet)
- GPT: Cached input is 50% cheaper than uncached
💡 Key insight: Prompt caching is the single biggest cost lever for AI coding. If you're not using it and you send similar context repeatedly, you're likely paying 5–10x more than necessary.
3. Output Length Control
Longer outputs cost more. Most providers charge 3–5x for output vs input. Strategies to control output costs:
- Set reasonable
max_tokenslimits based on your actual needs - Ask for concise responses when detailed explanations aren't needed
- Use models with efficient tokenization for code (Claude and GPT are both good here)
4. Usage Patterns
Your actual cost depends heavily on how you use AI:
- Autocomplete-style: Many small requests (100–500 tokens each), high volume. Use a cheap, fast model.
- Chat-style: Conversational debugging with longer context. Mid-tier models work well.
- Deep analysis: Few requests but very large context (full files/repos). Premium models justified for accuracy.
- Batch processing: Documentation, tests, migrations. Use the cheapest capable model — DeepSeek-V3 or Gemini Flash.
Real Monthly Cost Estimates
Based on a developer making ~500 AI-assisted coding interactions per day (typical for heavy AI tool users):
| Stack | Monthly Input Cost | Monthly Output Cost | Total (approx.) |
|---|---|---|---|
| DeepSeek-V3 only | ~$2.00 | ~$3.00 | ~$5/mo |
| Gemini 2.5 Flash only | ~$1.10 | ~$4.50 | ~$6/mo |
| GPT-5 Mini only | ~$11.00 | ~$45.00 | ~$56/mo |
| Claude Sonnet 5 only | ~$22.50 | ~$112.50 | ~$135/mo |
| Sonnet 5 + DeepSeek mix (70/30) | ~$16.00 | ~$80.00 | ~$96/mo |
| Sonnet 5 + caching | ~$6.00 | ~$112.50 | ~$119/mo |
Assumptions: 500 queries/day, avg 5K input + 2K output per query, 22 working days/month. Caching assumes 70% cache hit rate on input. Mix assumes 70% of queries on cheaper model.
Practical Cost-Saving Strategies
- Route by task complexity: Send simple completions to DeepSeek-V3 or Haiku, complex reasoning to Sonnet or Opus. A 70/30 split can cut costs by 40–60%.
- Enable prompt caching: If you use Claude or GPT via API, structure your prompts so repeated context (system instructions, codebase context) is cacheable.
- Trim context: Don't send your entire codebase when the relevant file is 500 lines. Be selective about what context the model actually needs.
- Use free tiers: Gemini 2.5 Flash has a generous free tier. GitHub Copilot includes a certain number of completions in the subscription.
- Monitor usage: Set up usage alerts. It's easy to accidentally leave a long-context session running and rack up unexpected costs.
- Consider subscriptions: For heavy users, flat-rate subscriptions (Copilot, Claude Code) may be cheaper than per-token API pricing — do the math for your usage level.
Related Reading
- Claude vs GPT vs DeepSeek for Coding — model comparison
- Why Multi-Model Workflows Matter — cost optimization through routing
- How to Choose an AI Coding Stack — complete decision framework