Large language models have transformed what software can do. Applications can now understand natural language, generate content, reason through complex problems, and interact with users conversationally. But integrating LLMs into production applications is not as simple as calling an API. Response times are measured in seconds, not milliseconds. Costs scale with token usage. Model outputs are probabilistic, not deterministic. Context windows limit how much information you can provide.
I have built LLM-powered features for customer support, content generation, and data analysis. I have implemented retrieval-augmented generation that grounds responses in company knowledge bases. I have designed agentic systems that use tools to accomplish multi-step tasks. I have optimized for latency, cost, and reliability. This guide covers the patterns that work: prompt engineering that produces consistent results, RAG architecture that grounds responses in your data, tool calling that extends LLM capabilities, streaming responses for better UX, and production patterns for monitoring and error handling.
Understanding LLM Fundamentals
Figure 1: LLM architecture overview showing how input text flows through transformer layers with attention mechanisms to produce contextual output.
Tokenization
LLMs process text in tokens, not characters or words. A token is roughly 4 characters or 0.75 words in English.
Implications:
- Input and output lengths are measured in tokens
- Pricing is per-token
- Context windows have token limits (4K, 8K, 128K, etc.)
- Different models have different tokenizers
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Hello, world!")
count = len(tokens) # 4 tokens
Temperature and Sampling
Temperature: Controls randomness. 0 = deterministic, 1 = creative, >1 = chaotic.
# Consistent output (good for classification, extraction)
response = client.chat.completions.create(
model="gpt-5.5", messages=messages, temperature=0.0)
# Creative output (good for brainstorming, content generation)
response = client.chat.completions.create(
model="gpt-5.5", messages=messages, temperature=0.7)
Top-p (nucleus sampling): Alternative to temperature. Considers tokens whose cumulative probability exceeds threshold p.
Context Windows
The maximum input length an LLM can process.
| Model | Context Window |
|---|---|
| GPT-5.5 | 256K |
| Claude 4.8 (Opus) | 256K |
| Claude 4.6 (Sonnet) | 200K |
| Claude 4.5 (Haiku) | 200K |
| Gemini 3.5 | 2M |
| Llama 4 Maverick | 256K |
| Llama 4 Scout | 128K |
| Mistral Medium 3.5 | 128K |
| Mistral Small 4 | 128K |
Strategies for long contexts: chunking, summarization, and RAG (retrieve only relevant context).
Model Landscape (June 2026)
The LLM ecosystem has evolved rapidly. Here are the major models available for production use as of June 2026:
OpenAI: GPT-5.5 is the current flagship, offering 256K context windows and improved reasoning over previous generations. GPT-5.5-mini provides a cost-effective option for simpler tasks. OpenAI also offers specialized models for image generation (DALL-E 4), speech (Whisper v4), and embedding (text-embedding-3-large).
Anthropic: Claude 4.8 (Opus) handles complex reasoning and deep research with 256K context. Claude 4.6 (Sonnet) is the workhorse for everyday tasks — writing, analysis, and automation. Claude 4.5 (Haiku) delivers fast, lightweight responses for quick queries and web search. Anthropic also offers Claude Code for IDE integration and Claude Cowork for agentic task completion.
Google: Gemini 3.5 leads with a 2M token context window, making it ideal for document analysis and long-form content processing. Gemini Nano Banana targets edge devices. Google’s ecosystem includes Veo for video generation, Lyria 3 for audio, and Imagen for images.
Meta (Open Source): Llama 4 Maverick and Llama 4 Scout are natively multimodal, handling text and vision through early fusion architecture. Llama 3.3 remains popular for multilingual use cases, while Llama 3.2 serves edge deployments. All are freely available for fine-tuning and self-hosting.
Mistral: Mistral Medium 3.5 and Mistral Small 4 offer strong European alternatives with 128K context. Mistral’s Vibe agent and Vibe for Code provide agentic workflows for long-horizon tasks and terminal/IDE integration.
Emerging: DeepSeek v4, Qwen 3, and Kimi k2.6 continue to push open-source performance. For specialized domains, models like Med-PaLM 3 (healthcare), AlphaCode 3 (competitive programming), and CodeQwen 2.5 (software engineering) offer domain-tuned capabilities.
Selection criteria: Choose based on context window needs, cost per token, latency requirements, and whether you need multimodal capabilities (vision, audio) or specialized reasoning. For most production applications, GPT-5.5, Claude 4.6 (Sonnet), or Gemini 3.5 provide the best balance of capability and reliability.
Prompt Engineering
Figure 2: Traditional AI rule-based systems versus modern LLM neural networks — architecture and capability differences.
Prompt engineering is the art of structuring inputs to get desired outputs.
Basic Patterns
Zero-shot: No examples, just instructions.
Classify the sentiment of this text as positive, negative, or neutral:
Text: "I love this product!"
Sentiment:
Few-shot: Include examples in prompt.
Classify the sentiment:
Text: "Amazing service!"
Sentiment: positive
Text: "Terrible experience"
Sentiment: negative
Text: "I love this product!"
Sentiment:
System prompts: Set behavior and constraints.
messages = [
{"role": "system", "content": "You are a helpful customer support agent. Be polite, concise, and always ask clarifying questions if the user's request is unclear."},
{"role": "user", "content": "My order hasn't arrived."}
]
Advanced Techniques
Chain-of-Thought: Encourage step-by-step reasoning.
Solve this step by step:
Question: A store has 100 apples. They sell 20 and get a shipment of 30. How many apples do they have?
Step 1: Start with 100 apples
Step 2: Subtract 20 sold = 80 apples
Step 3: Add 30 from shipment = 110 apples
Answer: 110 apples
Structured Output: Request JSON for programmatic use.
Extract the following information as JSON:
- name: string
- age: number
- email: string or null
Text: "John is 25 years old. Contact him at [email protected]"
{"name": "John", "age": 25, "email": "[email protected]"}
ReAct (Reasoning + Acting): Combine reasoning with tool use.
You can use these tools:
- search(query): Search the web
- calculate(expression): Calculate mathematical expressions
Question: What is the population of Paris divided by the population of London?
Thought: I need to find the populations of both cities.
Action: search("population of Paris 2026")
Observation: Paris has a population of 2.1 million
Thought: Now I need London's population.
Action: search("population of London 2026")
Observation: London has a population of 8.9 million
Thought: Now I can calculate the ratio.
Action: calculate("2.1 / 8.9")
Observation: 0.236
Answer: The population of Paris is approximately 23.6% of London's population.
Retrieval-Augmented Generation (RAG)
Figure 3: Overview of LLM application areas across business and development contexts.
RAG grounds LLM responses in your proprietary data.
Architecture Overview
Document ingestion:
Documents -> Chunks -> Embeddings -> Vector Database
Query processing:
Query -> Embedding -> Similarity Search -> Retrieved Chunks
Generation:
Query + Retrieved Chunks -> LLM -> Response
Document Chunking
Split documents into semantically coherent chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
Chunking strategies: fixed size, recursive (respects natural boundaries), semantic (uses embeddings), and agentic (LLM decides boundaries).
Vector Database
Store and retrieve embeddings efficiently.
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
client = chromadb.PersistentClient(path="./chroma_db")
embedding_function = OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection(
name="documents", embedding_function=embedding_function)
collection.add(
documents=[chunk.page_content for chunk in chunks],
metadatas=[chunk.metadata for chunk in chunks],
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
results = collection.query(
query_texts=["How do I reset my password?"], n_results=5)
Retrieval Strategies
Basic similarity search:
results = collection.query(query_texts=[user_query], n_results=5)
context = "\n\n".join(results["documents"][0])
Hybrid search (vector + keyword with metadata filtering):
results = collection.query(
query_texts=[user_query], n_results=10,
where={"category": "support"})
Reranking: Retrieve many candidates, then rerank with cross-encoder.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
initial_results = collection.query(query_texts=[query], n_results=20)
candidates = initial_results["documents"][0]
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_contexts = [doc for doc, score in reranked[:5]]
RAG Prompt Template
RAG_PROMPT = """Answer the question based on the provided context.
Context:
{context}
Question: {question}
Instructions:
- Answer only using information from the context
- If the context does not contain the answer, say "I don't have enough information to answer that"
- Cite specific sections from the context when possible
Answer:"""
response = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": RAG_PROMPT.format(
context=context, question=user_query)}]
)
Tool Calling (Function Calling)
Figure 4: LLM-powered customer service chatbot interface with sentiment analysis and automated response suggestions.
Extend LLM capabilities by letting them use external tools.
Defining Tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country, e.g., 'Paris, France'"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "celsius"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_orders",
"description": "Search customer orders by date or product",
"parameters": {
"type": "object",
"properties": {
"customer_id": {"type": "string"},
"date_from": {"type": "string", "format": "date"},
"date_to": {"type": "string", "format": "date"},
"product_name": {"type": "string"}
},
"required": ["customer_id"]
}
}
}
]
Tool Use Flow
def process_message(user_message):
messages = [{"role": "user", "content": user_message}]
response = client.chat.completions.create(
model="gpt-5.5", messages=messages, tools=tools, tool_choice="auto")
message = response.choices[0].message
if message.tool_calls:
for tool_call in message.tool_calls:
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
result = globals()[function_name](**arguments)
messages.append(message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
final_response = client.chat.completions.create(
model="gpt-5.5", messages=messages)
return final_response.choices[0].message.content
return message.content
Agentic Patterns
Multi-step reasoning with tool use.
class Agent:
def __init__(self, tools, model="gpt-5.5"):
self.tools = {tool.name: tool for tool in tools}
self.model = model
self.messages = []
def run(self, user_input):
self.messages.append({"role": "user", "content": user_input})
for _ in range(10):
response = client.chat.completions.create(
model=self.model, messages=self.messages,
tools=[tool.schema for tool in self.tools.values()])
message = response.choices[0].message
self.messages.append(message)
if not message.tool_calls:
return message.content
for tool_call in message.tool_calls:
result = self.execute_tool(tool_call)
self.messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
raise Exception("Max iterations reached")
def execute_tool(self, tool_call):
tool = self.tools.get(tool_call.function.name)
if not tool:
return {"error": f"Tool {tool_call.function.name} not found"}
arguments = json.loads(tool_call.function.arguments)
return tool.run(**arguments)
Streaming Responses
Stream tokens as they are generated for better UX.
Server-Sent Events (SSE)
from flask import Flask, Response, stream_with_context
@app.route('/chat', methods=['POST'])
def chat():
user_message = request.json['message']
def generate():
stream = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": user_message}],
stream=True)
for chunk in stream:
if chunk.choices[0].delta.content:
data = {"token": chunk.choices[0].delta.content,
"finish_reason": chunk.choices[0].finish_reason}
yield f"data: {json.dumps(data)}\n\n"
yield "data: [DONE]\n\n"
return Response(stream_with_context(generate()),
mimetype='text/event-stream')
Frontend Integration
const eventSource = new EventSource('/chat?message=' + encodeURIComponent(message));
let response = '';
eventSource.onmessage = (event) => {
if (event.data === '[DONE]') { eventSource.close(); return; }
const data = JSON.parse(event.data);
response += data.token;
updateUI(response);
};
eventSource.onerror = () => { eventSource.close(); showError(); };
Production Patterns
Retry and Fallback Logic
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
retry=retry_if_exception_type((RateLimitError, TimeoutError)))
def call_llm_with_retry(messages, model="gpt-5.5"):
return client.chat.completions.create(
model=model, messages=messages, timeout=30)
def call_with_fallback(messages):
try:
return call_llm_with_retry(messages, model="gpt-5.5")
except Exception as e:
return call_llm_with_retry(messages, model="gpt-5.5-mini")
Caching
Cache LLM responses for identical inputs.
import hashlib
import redis
cache = redis.Redis()
def cached_llm_call(messages, model="gpt-5.5", ttl=3600):
messages_str = json.dumps(messages, sort_keys=True)
cache_key = f"llm:{model}:{hashlib.md5(messages_str.encode()).hexdigest()}"
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
response = client.chat.completions.create(
model=model, messages=messages)
result = {"content": response.choices[0].message.content,
"usage": response.usage}
cache.setex(cache_key, ttl, json.dumps(result))
return result
Cost Tracking
Monitor token usage and costs.
import tiktoken
class LLMTracker:
def __init__(self):
self.total_tokens = 0
self.total_cost = 0
def track_call(self, model, messages, response):
encoding = tiktoken.encoding_for_model(model)
input_tokens = sum(len(encoding.encode(m["content"])) for m in messages)
output_tokens = response.usage.completion_tokens
prices = {
"gpt-4": {"input": 0.03, "output": 0.06},
"gpt-5.5-mini": {"input": 0.0015, "output": 0.002}}
model_prices = prices.get(model, prices["gpt-5.5-mini"])
cost = (input_tokens / 1000 * model_prices["input"] +
output_tokens / 1000 * model_prices["output"])
self.total_tokens += input_tokens + output_tokens
self.total_cost += cost
logger.info(f"LLM call: {model}, tokens: {input_tokens + output_tokens}, cost: ${cost:.4f}")
Error Handling
class LLMError(Exception): pass
class ContextLengthExceeded(LLMError): pass
class RateLimitExceeded(LLMError): pass
def safe_llm_call(messages, model="gpt-5.5", max_retries=3):
encoding = tiktoken.encoding_for_model(model)
total_tokens = sum(len(encoding.encode(m["content"])) for m in messages)
max_tokens = {"gpt-4": 8192, "gpt-5.5": 128000, "gpt-5.5-mini": 4096}
if total_tokens > max_tokens.get(model, 4096) * 0.9:
raise ContextLengthExceeded(
f"Context length {total_tokens} near limit for {model}")
try:
return call_llm_with_retry(messages, model, max_retries)
except ContextLengthExceeded:
messages = truncate_messages(messages)
return safe_llm_call(messages, model, max_retries)
except RateLimitError:
raise RateLimitExceeded("Service temporarily unavailable")
except Exception as e:
raise LLMError(f"Failed to get response: {e}")
Common Pitfalls
Pitfall 1: Ignoring Token Costs Sending entire documents in every request. Costs escalate quickly. Use RAG to limit context.
Pitfall 2: No Input Validation Passing user input directly to LLM without sanitization. Risk of prompt injection.
# Bad
def summarize(text):
prompt = f"Summarize: {text}" # User can inject instructions
# Good
def summarize(text):
prompt = f"Summarize the following text. Text: {text}\nSummary:"
# Additional: Use system prompts that constrain behavior
Pitfall 3: Expecting Determinism Same input can produce different outputs. Do not rely on exact string matching.
Pitfall 4: No Timeout Handling LLM calls can take 10-30 seconds. Always set timeouts and handle gracefully.
Pitfall 5: Not Monitoring Costs Unlimited API usage without tracking. Costs can surprise you at scale.
Pitfall 6: Ignoring Rate Limits No backoff strategy. Application fails under load instead of gracefully degrading.
Conclusion
Building LLM-powered applications requires different patterns than traditional software. Prompt engineering is the primary interface. RAG grounds responses in your data. Tool calling extends capabilities. Streaming improves perceived performance.
Design for failure: implement retries, fallbacks, and graceful degradation. Monitor costs religiously. Cache aggressively. Validate and sanitize inputs.
LLMs are powerful tools but not magic. They hallucinate, have latency, and cost money. Use them where they add value: understanding natural language, generating content, reasoning over complex problems. Combine them with traditional software for reliability.
The field evolves rapidly. Today’s best practices may change tomorrow. Build modular architectures that can swap models, adjust prompts, and adapt to new capabilities.
Further Reading
- OpenAI API documentation: Best practices and patterns
- LangChain documentation: Framework for LLM applications
- “Building LLM Apps” by Chip Huyen: Production patterns
- Pinecone documentation: Vector search
- “Prompt Engineering Guide” by DAIR.AI: Comprehensive prompt techniques