Large language models have transformed what software can do. Applications can now understand natural language, generate content, reason through complex problems, and interact with users conversationally. But integrating LLMs into production applications is not as simple as calling an API. Response times are measured in seconds, not milliseconds. Costs scale with token usage. Model outputs are probabilistic, not deterministic. Context windows limit how much information you can provide.

I have built LLM-powered features for customer support, content generation, and data analysis. I have implemented retrieval-augmented generation that grounds responses in company knowledge bases. I have designed agentic systems that use tools to accomplish multi-step tasks. I have optimized for latency, cost, and reliability. This guide covers the patterns that work: prompt engineering that produces consistent results, RAG architecture that grounds responses in your data, tool calling that extends LLM capabilities, streaming responses for better UX, and production patterns for monitoring and error handling.

Understanding LLM Fundamentals

Diagram showing neural network language model architecture with transformer layers, attention mechanisms, and token flow from input to output Figure 1: LLM architecture overview showing how input text flows through transformer layers with attention mechanisms to produce contextual output.

Tokenization

LLMs process text in tokens, not characters or words. A token is roughly 4 characters or 0.75 words in English.

Implications:

  • Input and output lengths are measured in tokens
  • Pricing is per-token
  • Context windows have token limits (4K, 8K, 128K, etc.)
  • Different models have different tokenizers
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-4")
tokens = encoder.encode("Hello, world!")
count = len(tokens)  # 4 tokens

Temperature and Sampling

Temperature: Controls randomness. 0 = deterministic, 1 = creative, >1 = chaotic.

# Consistent output (good for classification, extraction)
response = client.chat.completions.create(
    model="gpt-5.5", messages=messages, temperature=0.0)

# Creative output (good for brainstorming, content generation)
response = client.chat.completions.create(
    model="gpt-5.5", messages=messages, temperature=0.7)

Top-p (nucleus sampling): Alternative to temperature. Considers tokens whose cumulative probability exceeds threshold p.

Context Windows

The maximum input length an LLM can process.

ModelContext Window
GPT-5.5256K
Claude 4.8 (Opus)256K
Claude 4.6 (Sonnet)200K
Claude 4.5 (Haiku)200K
Gemini 3.52M
Llama 4 Maverick256K
Llama 4 Scout128K
Mistral Medium 3.5128K
Mistral Small 4128K

Strategies for long contexts: chunking, summarization, and RAG (retrieve only relevant context).

Model Landscape (June 2026)

The LLM ecosystem has evolved rapidly. Here are the major models available for production use as of June 2026:

OpenAI: GPT-5.5 is the current flagship, offering 256K context windows and improved reasoning over previous generations. GPT-5.5-mini provides a cost-effective option for simpler tasks. OpenAI also offers specialized models for image generation (DALL-E 4), speech (Whisper v4), and embedding (text-embedding-3-large).

Anthropic: Claude 4.8 (Opus) handles complex reasoning and deep research with 256K context. Claude 4.6 (Sonnet) is the workhorse for everyday tasks — writing, analysis, and automation. Claude 4.5 (Haiku) delivers fast, lightweight responses for quick queries and web search. Anthropic also offers Claude Code for IDE integration and Claude Cowork for agentic task completion.

Google: Gemini 3.5 leads with a 2M token context window, making it ideal for document analysis and long-form content processing. Gemini Nano Banana targets edge devices. Google’s ecosystem includes Veo for video generation, Lyria 3 for audio, and Imagen for images.

Meta (Open Source): Llama 4 Maverick and Llama 4 Scout are natively multimodal, handling text and vision through early fusion architecture. Llama 3.3 remains popular for multilingual use cases, while Llama 3.2 serves edge deployments. All are freely available for fine-tuning and self-hosting.

Mistral: Mistral Medium 3.5 and Mistral Small 4 offer strong European alternatives with 128K context. Mistral’s Vibe agent and Vibe for Code provide agentic workflows for long-horizon tasks and terminal/IDE integration.

Emerging: DeepSeek v4, Qwen 3, and Kimi k2.6 continue to push open-source performance. For specialized domains, models like Med-PaLM 3 (healthcare), AlphaCode 3 (competitive programming), and CodeQwen 2.5 (software engineering) offer domain-tuned capabilities.

Selection criteria: Choose based on context window needs, cost per token, latency requirements, and whether you need multimodal capabilities (vision, audio) or specialized reasoning. For most production applications, GPT-5.5, Claude 4.6 (Sonnet), or Gemini 3.5 provide the best balance of capability and reliability.

Prompt Engineering

Comparison diagram showing traditional AI rule-based system versus modern LLM neural network approach, with flowcharts highlighting differences in architecture and capability Figure 2: Traditional AI rule-based systems versus modern LLM neural networks — architecture and capability differences.

Prompt engineering is the art of structuring inputs to get desired outputs.

Basic Patterns

Zero-shot: No examples, just instructions.

Classify the sentiment of this text as positive, negative, or neutral:
Text: "I love this product!"
Sentiment:

Few-shot: Include examples in prompt.

Classify the sentiment:

Text: "Amazing service!"
Sentiment: positive

Text: "Terrible experience"
Sentiment: negative

Text: "I love this product!"
Sentiment:

System prompts: Set behavior and constraints.

messages = [
    {"role": "system", "content": "You are a helpful customer support agent. Be polite, concise, and always ask clarifying questions if the user's request is unclear."},
    {"role": "user", "content": "My order hasn't arrived."}
]

Advanced Techniques

Chain-of-Thought: Encourage step-by-step reasoning.

Solve this step by step:
Question: A store has 100 apples. They sell 20 and get a shipment of 30. How many apples do they have?

Step 1: Start with 100 apples
Step 2: Subtract 20 sold = 80 apples
Step 3: Add 30 from shipment = 110 apples

Answer: 110 apples

Structured Output: Request JSON for programmatic use.

Extract the following information as JSON:
- name: string
- age: number
- email: string or null

Text: "John is 25 years old. Contact him at [email protected]"

{"name": "John", "age": 25, "email": "[email protected]"}

ReAct (Reasoning + Acting): Combine reasoning with tool use.

You can use these tools:
- search(query): Search the web
- calculate(expression): Calculate mathematical expressions

Question: What is the population of Paris divided by the population of London?

Thought: I need to find the populations of both cities.
Action: search("population of Paris 2026")
Observation: Paris has a population of 2.1 million
Thought: Now I need London's population.
Action: search("population of London 2026")
Observation: London has a population of 8.9 million
Thought: Now I can calculate the ratio.
Action: calculate("2.1 / 8.9")
Observation: 0.236

Answer: The population of Paris is approximately 23.6% of London's population.

Retrieval-Augmented Generation (RAG)

Infographic showing diverse LLM application areas including customer service chatbots, code generation, content creation, data analysis, and language translation with icons and brief descriptions Figure 3: Overview of LLM application areas across business and development contexts.

RAG grounds LLM responses in your proprietary data.

Architecture Overview

Document ingestion:
Documents -> Chunks -> Embeddings -> Vector Database

Query processing:
Query -> Embedding -> Similarity Search -> Retrieved Chunks

Generation:
Query + Retrieved Chunks -> LLM -> Response

Document Chunking

Split documents into semantically coherent chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)

Chunking strategies: fixed size, recursive (respects natural boundaries), semantic (uses embeddings), and agentic (LLM decides boundaries).

Vector Database

Store and retrieve embeddings efficiently.

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

client = chromadb.PersistentClient(path="./chroma_db")
embedding_function = OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small"
)

collection = client.get_or_create_collection(
    name="documents", embedding_function=embedding_function)

collection.add(
    documents=[chunk.page_content for chunk in chunks],
    metadatas=[chunk.metadata for chunk in chunks],
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)

results = collection.query(
    query_texts=["How do I reset my password?"], n_results=5)

Retrieval Strategies

Basic similarity search:

results = collection.query(query_texts=[user_query], n_results=5)
context = "\n\n".join(results["documents"][0])

Hybrid search (vector + keyword with metadata filtering):

results = collection.query(
    query_texts=[user_query], n_results=10,
    where={"category": "support"})

Reranking: Retrieve many candidates, then rerank with cross-encoder.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
initial_results = collection.query(query_texts=[query], n_results=20)
candidates = initial_results["documents"][0]
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)

reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
top_contexts = [doc for doc, score in reranked[:5]]

RAG Prompt Template

RAG_PROMPT = """Answer the question based on the provided context.

Context:
{context}

Question: {question}

Instructions:
- Answer only using information from the context
- If the context does not contain the answer, say "I don't have enough information to answer that"
- Cite specific sections from the context when possible

Answer:"""

response = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": RAG_PROMPT.format(
        context=context, question=user_query)}]
)

Tool Calling (Function Calling)

Screenshot of modern customer service chatbot interface showing AI-powered conversation with real-time sentiment analysis and automated response suggestions Figure 4: LLM-powered customer service chatbot interface with sentiment analysis and automated response suggestions.

Extend LLM capabilities by letting them use external tools.

Defining Tools

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country, e.g., 'Paris, France'"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "default": "celsius"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_orders",
            "description": "Search customer orders by date or product",
            "parameters": {
                "type": "object",
                "properties": {
                    "customer_id": {"type": "string"},
                    "date_from": {"type": "string", "format": "date"},
                    "date_to": {"type": "string", "format": "date"},
                    "product_name": {"type": "string"}
                },
                "required": ["customer_id"]
            }
        }
    }
]

Tool Use Flow

def process_message(user_message):
    messages = [{"role": "user", "content": user_message}]
    response = client.chat.completions.create(
        model="gpt-5.5", messages=messages, tools=tools, tool_choice="auto")
    message = response.choices[0].message

    if message.tool_calls:
        for tool_call in message.tool_calls:
            function_name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments)
            result = globals()[function_name](**arguments)
            messages.append(message)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })
        final_response = client.chat.completions.create(
            model="gpt-5.5", messages=messages)
        return final_response.choices[0].message.content
    return message.content

Agentic Patterns

Multi-step reasoning with tool use.

class Agent:
    def __init__(self, tools, model="gpt-5.5"):
        self.tools = {tool.name: tool for tool in tools}
        self.model = model
        self.messages = []

    def run(self, user_input):
        self.messages.append({"role": "user", "content": user_input})
        for _ in range(10):
            response = client.chat.completions.create(
                model=self.model, messages=self.messages,
                tools=[tool.schema for tool in self.tools.values()])
            message = response.choices[0].message
            self.messages.append(message)
            if not message.tool_calls:
                return message.content
            for tool_call in message.tool_calls:
                result = self.execute_tool(tool_call)
                self.messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(result)
                })
        raise Exception("Max iterations reached")

    def execute_tool(self, tool_call):
        tool = self.tools.get(tool_call.function.name)
        if not tool:
            return {"error": f"Tool {tool_call.function.name} not found"}
        arguments = json.loads(tool_call.function.arguments)
        return tool.run(**arguments)

Streaming Responses

Stream tokens as they are generated for better UX.

Server-Sent Events (SSE)

from flask import Flask, Response, stream_with_context

@app.route('/chat', methods=['POST'])
def chat():
    user_message = request.json['message']
    def generate():
        stream = client.chat.completions.create(
            model="gpt-5.5",
            messages=[{"role": "user", "content": user_message}],
            stream=True)
        for chunk in stream:
            if chunk.choices[0].delta.content:
                data = {"token": chunk.choices[0].delta.content,
                        "finish_reason": chunk.choices[0].finish_reason}
                yield f"data: {json.dumps(data)}\n\n"
        yield "data: [DONE]\n\n"
    return Response(stream_with_context(generate()),
                    mimetype='text/event-stream')

Frontend Integration

const eventSource = new EventSource('/chat?message=' + encodeURIComponent(message));
let response = '';
eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') { eventSource.close(); return; }
  const data = JSON.parse(event.data);
  response += data.token;
  updateUI(response);
};
eventSource.onerror = () => { eventSource.close(); showError(); };

Production Patterns

Retry and Fallback Logic

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type((RateLimitError, TimeoutError)))
def call_llm_with_retry(messages, model="gpt-5.5"):
    return client.chat.completions.create(
        model=model, messages=messages, timeout=30)

def call_with_fallback(messages):
    try:
        return call_llm_with_retry(messages, model="gpt-5.5")
    except Exception as e:
        return call_llm_with_retry(messages, model="gpt-5.5-mini")

Caching

Cache LLM responses for identical inputs.

import hashlib
import redis

cache = redis.Redis()

def cached_llm_call(messages, model="gpt-5.5", ttl=3600):
    messages_str = json.dumps(messages, sort_keys=True)
    cache_key = f"llm:{model}:{hashlib.md5(messages_str.encode()).hexdigest()}"
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)
    response = client.chat.completions.create(
        model=model, messages=messages)
    result = {"content": response.choices[0].message.content,
              "usage": response.usage}
    cache.setex(cache_key, ttl, json.dumps(result))
    return result

Cost Tracking

Monitor token usage and costs.

import tiktoken

class LLMTracker:
    def __init__(self):
        self.total_tokens = 0
        self.total_cost = 0

    def track_call(self, model, messages, response):
        encoding = tiktoken.encoding_for_model(model)
        input_tokens = sum(len(encoding.encode(m["content"])) for m in messages)
        output_tokens = response.usage.completion_tokens
        prices = {
            "gpt-4": {"input": 0.03, "output": 0.06},
            "gpt-5.5-mini": {"input": 0.0015, "output": 0.002}}
        model_prices = prices.get(model, prices["gpt-5.5-mini"])
        cost = (input_tokens / 1000 * model_prices["input"] +
                output_tokens / 1000 * model_prices["output"])
        self.total_tokens += input_tokens + output_tokens
        self.total_cost += cost
        logger.info(f"LLM call: {model}, tokens: {input_tokens + output_tokens}, cost: ${cost:.4f}")

Error Handling

class LLMError(Exception): pass
class ContextLengthExceeded(LLMError): pass
class RateLimitExceeded(LLMError): pass

def safe_llm_call(messages, model="gpt-5.5", max_retries=3):
    encoding = tiktoken.encoding_for_model(model)
    total_tokens = sum(len(encoding.encode(m["content"])) for m in messages)
    max_tokens = {"gpt-4": 8192, "gpt-5.5": 128000, "gpt-5.5-mini": 4096}
    if total_tokens > max_tokens.get(model, 4096) * 0.9:
        raise ContextLengthExceeded(
            f"Context length {total_tokens} near limit for {model}")
    try:
        return call_llm_with_retry(messages, model, max_retries)
    except ContextLengthExceeded:
        messages = truncate_messages(messages)
        return safe_llm_call(messages, model, max_retries)
    except RateLimitError:
        raise RateLimitExceeded("Service temporarily unavailable")
    except Exception as e:
        raise LLMError(f"Failed to get response: {e}")

Common Pitfalls

Pitfall 1: Ignoring Token Costs Sending entire documents in every request. Costs escalate quickly. Use RAG to limit context.

Pitfall 2: No Input Validation Passing user input directly to LLM without sanitization. Risk of prompt injection.

# Bad
def summarize(text):
    prompt = f"Summarize: {text}"  # User can inject instructions

# Good
def summarize(text):
    prompt = f"Summarize the following text. Text: {text}\nSummary:"
    # Additional: Use system prompts that constrain behavior

Pitfall 3: Expecting Determinism Same input can produce different outputs. Do not rely on exact string matching.

Pitfall 4: No Timeout Handling LLM calls can take 10-30 seconds. Always set timeouts and handle gracefully.

Pitfall 5: Not Monitoring Costs Unlimited API usage without tracking. Costs can surprise you at scale.

Pitfall 6: Ignoring Rate Limits No backoff strategy. Application fails under load instead of gracefully degrading.

Conclusion

Building LLM-powered applications requires different patterns than traditional software. Prompt engineering is the primary interface. RAG grounds responses in your data. Tool calling extends capabilities. Streaming improves perceived performance.

Design for failure: implement retries, fallbacks, and graceful degradation. Monitor costs religiously. Cache aggressively. Validate and sanitize inputs.

LLMs are powerful tools but not magic. They hallucinate, have latency, and cost money. Use them where they add value: understanding natural language, generating content, reasoning over complex problems. Combine them with traditional software for reliability.

The field evolves rapidly. Today’s best practices may change tomorrow. Build modular architectures that can swap models, adjust prompts, and adapt to new capabilities.


Further Reading

  • OpenAI API documentation: Best practices and patterns
  • LangChain documentation: Framework for LLM applications
  • “Building LLM Apps” by Chip Huyen: Production patterns
  • Pinecone documentation: Vector search
  • “Prompt Engineering Guide” by DAIR.AI: Comprehensive prompt techniques