Anthropic's Claude 3.5 Sonnet has set a new benchmark for intelligence, speed, and cost-effectiveness. But there is a massive gulf between running a simple API script and deploying a production-ready chatbot.
In a production environment, your chatbot needs to handle state (conversation memory), stream responses in real-time to prevent user drop-off, and gracefully handle API rate limits and network failures.
In this guide, we will build a production-grade Claude chatbot in Python using the official Anthropic SDK, complete with memory management, streaming, and robust error handling.
Prerequisites
First, make sure you have the Anthropic Python SDK installed and your API key set as an environment variable.
pip install anthropic
export ANTHROPIC_API_KEY="your-api-key-here"
Step 1: Building the Core Streaming and Memory Engine
A production chatbot cannot make users wait 10 seconds for a full paragraph to generate. We must stream responses chunk-by-chunk. Additionally, we need to maintain a thread history so Claude remembers the context of the conversation.
Here is the core engine utilizing Python generators to stream tokens in real-time:
import os
from typing import Generator, List, Dict
from anthropic import Anthropic
class ClaudeChatEngine:
def __init__(self, system_prompt: str = "You are a helpful, concise assistant."):
# Initialize the client. It automatically looks for ANTHROPIC_API_KEY in env
self.client = Anthropic()
self.model = "claude-3-5-sonnet-20241022"
self.system_prompt = system_prompt
self.history: List[Dict[str, str]] = []
def send_message(self, user_message: str) -> Generator[str, None, None]:
"""
Sends a message to Claude, updates history, and yields tokens as they arrive.
"""
# Append user message to state history
self.history.append({"role": "user", "content": user_message})
assistant_response = ""
# Initiate a streaming request
with self.client.messages.stream(
model=self.model,
max_tokens=1024,
system=self.system_prompt,
messages=self.history
) as stream:
for text in stream.text_stream:
assistant_response += text
yield text
# Append Claude's completed response to history to maintain context
self.history.append({"role": "assistant", "content": assistant_response})
# Usage Example
if __name__ == "__main__":
bot = ClaudeChatEngine()
print("Chatbot initialized. Type 'exit' to quit.\n")
while True:
user_input = input("You: ")
if user_input.lower() == "exit":
break
print("Claude: ", end="", flush=True)
for token in bot.send_message(user_input):
print(token, end="", flush=True)
print("\n")
Step 2: Making it Production-Ready (Error Handling & Resilience)
The code above works perfectly under ideal conditions. But in production, network requests fail, APIs rate-limit you, and unexpected errors occur.
To make this production-ready, we need to wrap our API calls in a resilient layer that catches:
- RateLimitError: When you exceed your tokens-per-minute (TPM) or requests-per-minute (RPM).
- APIConnectionError: When network issues prevent connection to Anthropic's servers.
- APIStatusError: When the API returns a non-200 HTTP code (e.g., overloaded servers or invalid requests).
Let's refactor our engine to handle these scenarios gracefully, incorporating exponential backoff for rate limits.
import time
import logging
from anthropic import (
Anthropic,
APIConnectionError,
RateLimitError,
APIStatusError
)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("ClaudeChat")
class ResilientClaudeChatEngine(ClaudeChatEngine):
def send_message_safe(self, user_message: str, max_retries: int = 3) -> Generator[str, None, None]:
"""
Sends a message with exponential backoff for rate limits and robust error handling.
"""
self.history.append({"role": "user", "content": user_message})
assistant_response = ""
delay = 1.0 # Initial delay for backoff
for attempt in range(max_retries):
try:
with self.client.messages.stream(
model=self.model,
max_tokens=1024,
system=self.system_prompt,
messages=self.history
) as stream:
for text in stream.text_stream:
assistant_response += text
yield text
# If we successfully finished streaming, break out of retry loop
break
except RateLimitError as e:
logger.warning(f"Rate limited (429). Retrying in {delay}s... Error: {e}")
if attempt == max_retries - 1:
yield "\n[Error: The system is currently busy. Please try again in a moment.]"
return
time.sleep(delay)
delay *= 2 # Exponential backoff
except APIConnectionError as e:
logger.error(f"Failed to connect to [Anthropic API](https://console.anthropic.com): {e}")
yield "\n[Error: Connection issue. Please check your internet connection.]"
return
except APIStatusError as e:
logger.error(f"Anthropic API returned status code {e.status_code}: {e.message}")
yield f"\n[Error: An API error occurred (Status {e.status_code}).]"
return
except Exception as e:
logger.error(f"Unexpected error: {str(e)}")
yield "\n[Error: An unexpected system error occurred.]"
return
# Save the history only if we got a successful response
if assistant_response:
self.history.append({"role": "assistant", "content": assistant_response})
Key Takeaways for Production
- Keep Context Under Control: Claude 3.5 Sonnet has a 200k context window, but sending the entire history on every message increases latency and costs. Implement a sliding window or summarization strategy if your chat sessions exceed 20-30 messages.
-
System Prompts: Always pass system-level instructions (like persona, guardrails, and output formatting constraints) via the
systemparameter, not as a user message. This ensures the model adheres strictly to your rules. -
Environment Variables: Never hardcode your
ANTHROPIC_API_KEY. Use env vars or secret managers (like AWS Secrets Manager or HashiCorp Vault) to inject keys at runtime.
What's Next?
You now have a production-ready core engine for a Claude chatbot. Your next step is to wrap this engine in an API layer like FastAPI to expose it to your frontend, or deploy it directly to a serverless environment.
Are you building with Claude? Let me know in the comments below what challenges you're facing with context windows or latency!













