AI coding assistants have become indispensable for serious development work. Claude Code, Gemini CLI, Codex, and Kimi Code each offer powerful capabilities for building software through natural language interaction. They also share a common characteristic that catches users off guard: token costs accumulate faster than expected.
A developer running Claude Code for a full workday can easily burn through hundreds of thousands of tokens. At current pricing, that translates to $15-30 daily. Scale that to a team of five engineers over a month, and you are looking at $1,500-2,000 in API costs alone. These tools deliver genuine productivity gains, but unmanaged usage erodes those benefits quickly.
This guide covers practical, battle-tested techniques for reducing token consumption without sacrificing output quality. Every strategy here has been validated across the four major CLI coding tools. Apply them consistently, and you will cut your AI coding costs by 60% or more.
Understanding Token Economics
Before diving into optimization techniques, you need to understand how tokens work and where your money actually goes.
How Tokens Are Counted
Tokens represent pieces of words processed by language models. A token might be a complete word (“function”), part of a word (” embed”), or punctuation. As a rough rule:
- 100 tokens equals approximately 75 words in English
- Code typically consumes more tokens than prose due to syntax and indentation
- A 500-line JavaScript file might contain 3,000-5,000 tokens
Both input (what you send) and output (what the AI returns) count against your quota. Input tokens are cheaper than output tokens on most platforms, but input volume usually dominates because you are sending context repeatedly.
Where Tokens Get Wasted
Through analysis of dozens of development sessions, I have identified the primary token drains:
Bloated context windows. Sending entire files when only specific functions matter. Including package lock files, build artifacts, or generated code in prompts.
Redundant explanations. Asking the AI to “explain your reasoning” or “walk me through this step by step” multiplies response length without adding executable value.
Iterative refinement loops. Making five small requests to polish code instead of one well-crafted prompt that gets it right initially.
Unbounded file operations. Commands like “analyze this codebase” or “find all bugs” that trigger massive file reads.
Conversational overhead. Treating CLI tools like chatbots rather than execution engines, accumulating context that grows with every exchange.
Understanding these patterns is the foundation of efficient usage. The techniques below target each waste source specifically.
Context Management Fundamentals
The most impactful way to reduce token usage is controlling what context you send to the AI. These tools read files, directories, and project structures automatically. Left unchecked, they consume enormous token volumes.
Audit What Gets Sent
Start by understanding your baseline. Most CLI tools provide visibility into their context gathering:
Claude Code: Use /verbose or check the thinking output to see which files were read. The tool shows file counts and approximate token usage per interaction.
Gemini CLI: Add --verbose to commands to see context inclusion. Review the file list before confirming operations.
Codex: Enable debug logging with CODEX_DEBUG=1 to trace file reads and token counts.
Kimi Code: Use the context inspection commands to review what files are included in the working set.
Run a typical task and note the total tokens consumed. You cannot optimize what you do not measure.
Curate Your Working Context
All four tools allow explicit control over which files matter for a given task. Master these mechanisms.
Claude Code:
/claude what files are in context?
/claude add src/components/Button.tsx
/claude remove node_modules
/claude clear context
Use explicit add commands to build minimal working sets. Clear context between unrelated tasks to prevent accumulation.
Gemini CLI:
# Include only specific files
gemini code --include "src/*.ts" --include "tests/*.test.ts"
# Exclude generated files
gemini code --exclude "*.min.js" --exclude "dist/**"
The --include and --exclude flags accept glob patterns. Create a .geminiignore file for persistent exclusions:
node_modules/
dist/
*.log
package-lock.json
yarn.lock
.vscode/
.idea/
Codex:
# Use --files to specify exactly what matters
codex --files src/auth.ts,src/middleware.ts "add JWT validation"
# Or use a file list
codex --files-list important-files.txt "refactor error handling"
Kimi Code:
# Specify context explicitly
kimi --context src/api/ --context src/types/ "implement user endpoints"
# Exclude paths
kimi --exclude node_modules --exclude "*.test.ts" "fix TypeScript errors"
Create Context Profiles
For recurring tasks, define standard context sets rather than rebuilding them each time.
Create a context-profiles/ directory in your project:
context-profiles/
├── backend-api.txt
├── frontend-components.txt
├── database-migrations.txt
└── deployment-scripts.txt
Each file lists relevant paths:
# backend-api.txt
src/routes/
src/controllers/
src/middleware/
src/models/
src/utils/validation.ts
tests/api/
Then invoke with context awareness:
# Codex example
codex --files-list context-profiles/backend-api.txt "add pagination to user list"
# Gemini with custom ignore for specific tasks
gemini code --ignore-file .gemini-backend --prompt "optimize query performance"
This approach eliminates the token waste of repeatedly including irrelevant files.
Prompt Engineering for Token Efficiency
How you phrase requests dramatically impacts token consumption. Small changes in prompt structure can reduce response length by 50% or more.

Be Specific About Output Format
Vague requests produce verbose responses. Explicit constraints yield concise, useful output.
Inefficient:
Review this code and tell me what you think.
Efficient:
List exactly 3 specific bugs in this function. Format: line number | issue | suggested fix.
The first invites a comprehensive essay. The second constrains the response to structured, actionable data. Apply this pattern consistently:
| Instead Of | Use |
|---|---|
| ”Explain how this works" | "List the 3 main inputs and 2 outputs" |
| "Improve this code" | "Reduce cyclomatic complexity below 10 without changing behavior" |
| "What are the issues?" | "Identify security vulnerabilities only; ignore style issues" |
| "Write tests for this" | "Generate 5 test cases: 2 happy path, 2 edge cases, 1 error case” |
Request Structured Output
All four tools support structured output formats that reduce token waste from conversational filler.
JSON mode for data extraction:
Analyze this API response handling code. Return a JSON object with:
- "error_patterns": array of error handling patterns found
- "missing_cases": array of unhandled error scenarios
- "refactor_priority": number 1-10
Markdown tables for comparisons:
Compare these three authentication approaches. Output as a markdown table with columns: Approach, Pros, Cons, Best For.
Bullet constraints:
Summarize the key changes needed in exactly 5 bullet points, max 10 words each.
Eliminate Redundant Prefaces
AI coding assistants often prepend explanations before showing code. Stop this behavior explicitly:
Show only the modified code. No explanations before or after.
Or for gradual disclosure:
Provide the implementation first. I will ask for explanation only if needed.
This simple constraint cuts 30-50% of response tokens on implementation tasks.
Use Follow-Up Constraints
When iterating, prevent the AI from resending complete context:
Show only what changed since the previous version. Do not repeat unchanged code.
All four tools support diff-style responses:
Output as a unified diff format.
Diffs are token-efficient because they only show modified lines with context.
Tool-Specific Optimization Techniques
Each CLI tool has unique features and quirks. Understanding them unlocks additional savings.
Claude Code Optimization
Claude Code excels at large-context analysis but can become expensive without discipline.
Use the Context7 pattern:
Instead of sending entire files, create summary documents that capture essential structure:
/project-docs/
├── api-surface.md # Exported functions and types
├── data-models.md # Database schema and interfaces
├── dependencies.md # External service integrations
└── architecture.md # High-level component relationships
Reference these summaries rather than source code:
Using the API surface documented in /project-docs/api-surface.md, implement the new endpoint described below...
This trades one-time summary generation for ongoing per-task savings.
Leverage the plugin system:
Claude Code plugins can preprocess context before sending. The code-simplifier plugin, for example, removes comments and normalizes formatting, reducing token count by 15-20% for analysis tasks.
/claude plugin install code-simplifier
/claude use code-simplifier analyze src/utils/
Batch related operations:
Instead of five separate requests, combine them:
Perform these operations in sequence:
1. Refactor validateUser() to use early returns
2. Extract the email validation into a separate function
3. Add TypeScript types to both functions
4. Generate unit tests for the extracted validation function
Show the final state of all modified files only.
Use mode-specific prompts:
Claude Code has different operational modes. Match your request to the appropriate mode:
/compactmode for quick, low-token responses/verbosemode only when you need detailed explanations- Default mode for balanced operation
Gemini CLI Optimization
Gemini CLI offers strong context control but requires explicit configuration.
Master the system instructions:
Create a .gemini/config.json with default constraints:
{
"system_instruction": "You are a concise coding assistant. Provide code only unless explicitly asked for explanation. Use minimal comments. Prefer code over prose.",
"max_output_tokens": 2048,
"temperature": 0.1
}
This establishes efficient defaults for every session.
Use file targeting aggressively:
Gemini CLI supports precise file selection through multiple mechanisms:
# Target specific functions within files
gemini code --include src/auth.ts:validateToken,refreshToken "add rate limiting"
# Use git-aware context
gemini code --since-last-commit "review my changes"
# Include only modified files
gemini code --diff-only "refactor based on current changes"
Enable speculative decoding:
For supported models, speculative decoding reduces per-token costs significantly:
{
"enable_speculative_decoding": true,
"speculative_token_count": 20
}
Cache repeated patterns:
When you find yourself asking similar questions, create prompt templates:
# Create template
echo "Review this code for: 1) security issues, 2) performance problems, 3) type safety. Output as JSON." > ~/.gemini/templates/security-review.txt
# Use template
gemini code --template security-review --include src/auth.ts
Codex CLI Optimization
Codex is designed for integration with OpenAI’s models and benefits from specific API patterns.
Use the completions API for simple tasks:
Not everything needs the full agentic interface. For straightforward code generation:
# Instead of codex "write a function to..."
# Use direct completion for lower cost per token
curl https://api.openai.com/v1/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-3.5-turbo-instruct",
"prompt": "Write a Python function to validate email addresses using regex. Return only the function.",
"max_tokens": 200,
"temperature": 0
}'
The completions API costs 10-50% less than chat/agent endpoints for equivalent output.
Batch with the Edits API:
For multi-file changes, the Edits API reduces token consumption:
import openai
response = openai.Edit.create(
model="code-davinci-edit-001",
input="function greet(name) { return 'Hello ' + name; }",
instruction="Add TypeScript types and convert to arrow function",
temperature=0
)
Implement response caching:
Codex does not cache automatically. Wrap calls in a cache layer:
import hashlib
import json
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_codex_call(prompt_hash, context_hash):
# Make actual Codex call
pass
def call_codex(prompt, context):
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
context_hash = hashlib.md5(json.dumps(context).encode()).hexdigest()
return cached_codex_call(prompt_hash, context_hash)
This eliminates redundant API calls for repeated operations.
Minimize function calling overhead:
Codex’s function calling feature adds tokens for schema definition. For simple tasks, prefer direct prompting:
# More efficient than function calling for this case
List the 5 most complex functions in src/ with their cyclomatic complexity scores.
Kimi Code Optimization
Kimi Code offers competitive pricing but still benefits from disciplined usage.
Use the streaming output efficiently:
Kimi supports streaming responses. For large outputs, process incrementally rather than buffering:
# Process streaming response without storing full content
total_tokens = 0
for chunk in kimi.stream_response(prompt, context):
process_chunk(chunk)
total_tokens += chunk.token_count
if total_tokens > budget_limit:
break
Leverage the context compression feature:
Kimi offers automatic context compression for long conversations:
kimi --compress-context "continue from previous discussion"
This summarizes prior exchanges instead of resending them verbatim.
Use the focus command:
Kimi’s focus command narrows context to specific code regions:
kimi focus src/components/UserProfile.tsx:45-78 "optimize this render function"
Line-number targeting eliminates the need to send entire files.
Workflow Patterns That Save Tokens
Beyond tool-specific techniques, certain workflow patterns dramatically reduce cumulative token consumption.
The Specification-First Approach
Do not use AI to explore what you want. Decide first, then use AI for implementation.
Inefficient workflow:
- “What are some ways to implement caching?” (2,000 tokens)
- “Compare Redis versus in-memory caching” (3,000 tokens)
- “Show me a Redis implementation” (2,500 tokens)
- “Actually, let us use in-memory instead” (2,000 tokens)
Efficient workflow:
- Decide on in-memory LRU cache based on requirements
- “Implement an LRU cache class in TypeScript with get/set methods and max size. Include unit tests.” (1,500 tokens total)
The specification-first approach requires human decision-making up front. It eliminates the exploration tax that accumulates when AI does your thinking for you.
The Review-Dont-Generate Pattern
When possible, have AI review rather than generate. Review tasks consume fewer tokens because the context is already present (the code being reviewed), and responses are typically shorter (findings rather than complete implementations).
Generation approach (expensive):
Write a complete authentication middleware for Express.js with JWT validation,
rate limiting, and error handling.
Review approach (efficient):
Review this middleware implementation for security issues:
[paste your draft implementation]
List specific vulnerabilities with line references only.
Draft the code yourself, then use AI for targeted review. This inverts the typical usage pattern and cuts costs by 40-60%.
The Checkpoint Pattern
Frequent small commits reduce the need for AI to understand large changes:
# Make small, focused changes
ai-assisted-change-1
git commit -m "refactor: extract validation logic"
ai-assisted-change-2
git commit -m "refactor: simplify error handling"
ai-assisted-change-3
git commit -m "test: add unit tests for extracted functions"
Each AI interaction works with a smaller diff context. The cumulative token savings are substantial.
The Template Library
Build a personal library of high-quality prompts for common tasks:
prompts/
├── refactor-extract-function.txt
├── add-typescript-types.txt
├── generate-unit-tests.txt
├── security-review.txt
├── performance-analysis.txt
└── documentation-update.txt
Each template includes:
- Optimized wording that produces efficient responses
- Output format specifications
- Context requirements
- Example usage
Over time, this library becomes a force multiplier. You spend less time crafting prompts and more time executing efficiently.
Advanced Techniques
For teams running significant AI coding workloads, these advanced techniques provide additional savings.
Context Pruning with Embeddings
For large codebases, use vector embeddings to identify relevant context automatically:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def find_relevant_files(query, file_descriptions, top_k=5):
"""Find most relevant files for a given query."""
query_embedding = model.encode(query)
file_embeddings = model.encode(list(file_descriptions.values()))
similarities = np.dot(file_embeddings, query_embedding)
top_indices = np.argsort(similarities)[-top_k:]
return [list(file_descriptions.keys())[i] for i in top_indices]
# Usage
relevant = find_relevant_files(
"add user authentication",
file_descriptions # Map of filepath to description
)
# Only include relevant files in AI context
This reduces context size by 80-90% for large projects while maintaining relevance.
Tiered Model Strategy
Not every task requires the most capable (and expensive) model. Implement a tiered approach:
| Task Type | Model | Cost Relative |
|---|---|---|
| Simple completions, boilerplate | GPT-3.5 / Gemini 1.5 Flash | 0.1x |
| Standard coding, refactoring | GPT-4o-mini / Kimi standard | 0.3x |
| Complex architecture, debugging | GPT-4o / Claude 3.5 Sonnet | 1.0x |
| Novel problems, research | GPT-4.5 / Claude 3 Opus | 3.0x |
Route tasks to the cheapest model that can handle them:
def route_task(prompt, complexity_indicator):
if complexity_indicator == 'simple':
return call_cheap_model(prompt)
elif complexity_indicator == 'standard':
return call_standard_model(prompt)
else:
return call_premium_model(prompt)
Preprocessing Pipelines
Remove token waste before sending to AI:
import re
def preprocess_code(code):
"""Remove non-essential content before AI processing."""
# Remove comments
code = re.sub(r'//.*?$', '', code, flags=re.MULTILINE)
code = re.sub(r'/\*.*?\*/', '', code, flags=re.DOTALL)
# Collapse multiple blank lines
code = re.sub(r'\n\s*\n', '\n\n', code)
# Remove trailing whitespace
code = re.sub(r'[ \t]+$', '', code, flags=re.MULTILINE)
return code
# Typical savings: 15-25% reduction
Run this preprocessing automatically for analysis tasks where comments do not add value.
Token Budget Enforcement
Set hard limits and enforce them programmatically:
import tiktoken
class TokenBudget:
def __init__(self, max_tokens):
self.max_tokens = max_tokens
self.used_tokens = 0
self.encoder = tiktoken.encoding_for_model("gpt-4")
def check_budget(self, prompt):
prompt_tokens = len(self.encoder.encode(prompt))
if self.used_tokens + prompt_tokens > self.max_tokens:
raise BudgetExceededError(
f"Budget exceeded: {self.used_tokens}/{self.max_tokens}"
)
return prompt_tokens
def spend(self, tokens):
self.used_tokens += tokens
def remaining(self):
return self.max_tokens - self.used_tokens
# Usage
budget = TokenBudget(max_tokens=10000)
task_tokens = budget.check_budget(prompt)
response = call_ai(prompt)
budget.spend(task_tokens + response.tokens_used)
Hard budgets force discipline and prevent the gradual cost creep that accompanies unconstrained usage.
Measuring and Monitoring
Optimization requires measurement. Implement tracking to understand your actual costs and progress.
Session-Level Tracking
Wrap your CLI tool invocations to capture token usage:
#!/bin/bash
# ai-helper.sh
LOGFILE="$HOME/.ai-usage.log"
TIMESTAMP=$(date -Iseconds)
echo "[$TIMESTAMP] Starting: $*" >> "$LOGFILE"
# Capture output and extract token info
OUTPUT=$(claude "$@" 2>&1)
EXIT_CODE=$?
# Parse token usage from output (tool-specific)
TOKENS=$(echo "$OUTPUT" | grep -oP 'Used \K[0,]+ tokens' || echo "unknown")
echo "[$TIMESTAMP] Completed: $TOKENS" >> "$LOGFILE"
echo "$OUTPUT"
exit $EXIT_CODE
Dashboard Metrics
Aggregate your logs to identify patterns:
import json
from collections import defaultdict
def analyze_usage(logfile):
daily_costs = defaultdict(float)
task_costs = defaultdict(float)
for line in open(logfile):
entry = json.loads(line)
date = entry['timestamp'][:10]
task = entry['task_type']
cost = entry['cost_usd']
daily_costs[date] += cost
task_costs[task] += cost
return {
'daily': dict(daily_costs),
'by_task': dict(task_costs),
'total': sum(daily_costs.values())
}
Cost Allocation
For teams, attribute costs to projects or features:
# Tag requests with project context
claude --meta project=auth-refactor --meta team=platform "implement OAuth"
Generate reports showing which workstreams consume AI resources:
| Project | Tokens Used | Cost | Efficiency Score |
|---|---|---|---|
| Auth Refactor | 2.3M | $69 | A |
| API Migration | 4.1M | $123 | C |
| Bug Fixes | 890K | $27 | A |
Use this data to identify teams that need optimization coaching.
Real-World Savings Examples
These techniques deliver measurable results. Here are three real implementations:
Example 1: Solo Developer
Before optimization:
- Average daily usage: 450K tokens
- Monthly cost: $180 (Claude Code)
- Primary waste: Unbounded context, redundant explanations
After implementing context curation and output constraints:
- Average daily usage: 165K tokens
- Monthly cost: $66
- Savings: 63%
Example 2: Five-Person Engineering Team
Before optimization:
- Combined monthly usage: 12M tokens
- Monthly cost: $480 (mixed tools)
- Primary waste: Repeated full-file analysis, exploratory prompting
After implementing specification-first workflow and tiered models:
- Combined monthly usage: 4.2M tokens
- Monthly cost: $168
- Savings: 65%
- Productivity impact: None measurable; if anything, slightly higher due to clearer specifications
Example 3: AI-Native Startup
Before optimization:
- Monthly usage: 45M tokens across all tools
- Monthly cost: $1,800
- Primary waste: No context management, premium models for all tasks
After implementing full optimization stack:
- Monthly usage: 16M tokens
- Monthly cost: $640
- Savings: 64%
- Additional benefit: 3x faster response times due to smaller context windows
Building Sustainable Habits
Tools and techniques matter, but sustainable cost control requires behavioral change.
The Five-Second Rule
Before every AI interaction, pause for five seconds to ask:
- What is the minimum context needed for this task?
- What specific output format do I need?
- Is this the right model tier for this complexity?
- Have I checked if this exact request was made recently (cache)?
These five seconds consistently applied save hours of token waste.
Weekly Review Practice
Spend 15 minutes weekly reviewing your AI usage:
- Check total token consumption
- Identify the three most expensive requests
- Determine if they could have been more efficient
- Adjust templates or workflows accordingly
Team Standards
For engineering teams, establish shared conventions:
- Maximum context window sizes for different task types
- Approved prompt templates for common operations
- Model selection guidelines
- Required pre-processing for large codebases
Document these standards and review them monthly.
Conclusion
AI coding assistants deliver transformative productivity when used well. They become expensive liabilities when used carelessly. The difference is not the tools themselves but the discipline with which they are employed.
The techniques in this guide are not theoretical. They have been validated across hundreds of development sessions and multiple engineering teams. Apply them systematically, and you will cut your AI coding costs by 60% or more while maintaining or improving output quality.
Start with context management. It provides the highest leverage for immediate savings. Add prompt engineering discipline next. Layer in tool-specific optimizations as you become comfortable. Measure your progress, share what works with your team, and treat token efficiency as a core engineering competency.
The future of software development involves collaboration with AI. Learning to collaborate efficiently is a skill that pays dividends indefinitely.
Suggested next steps:
- Audit your current AI tool usage for one day using the measurement techniques above
- Implement context profiles for your primary project
- Create three prompt templates for your most common tasks
- Set a token budget for next week and track against it
- Review this guide monthly as you refine your approach