How to Build a Production-Ready Claude Chatbot in Python

Anthropic's Claude 3.5 Sonnet has set a new benchmark for intelligence, speed, and cost-effectiveness. But there is a massive gulf between running a simple API script and deploying a production-ready chatbot.

In a production environment, your chatbot needs to handle state (conversation memory), stream responses in real-time to prevent user drop-off, and gracefully handle API rate limits and network failures.

In this guide, we will build a production-grade Claude chatbot in Python using the official Anthropic SDK, complete with memory management, streaming, and robust error handling.

Prerequisites

First, make sure you have the Anthropic Python SDK installed and your API key set as an environment variable.

pip install anthropic
export ANTHROPIC_API_KEY="your-api-key-here"

Step 1: Building the Core Streaming and Memory Engine

A production chatbot cannot make users wait 10 seconds for a full paragraph to generate. We must stream responses chunk-by-chunk. Additionally, we need to maintain a thread history so Claude remembers the context of the conversation.

Here is the core engine utilizing Python generators to stream tokens in real-time:

import os
from typing import Generator, List, Dict
from anthropic import Anthropic

class ClaudeChatEngine:
    def __init__(self, system_prompt: str = "You are a helpful, concise assistant."):
        # Initialize the client. It automatically looks for ANTHROPIC_API_KEY in env
        self.client = Anthropic()
        self.model = "claude-3-5-sonnet-20241022"
        self.system_prompt = system_prompt
        self.history: List[Dict[str, str]] = []

    def send_message(self, user_message: str) -> Generator[str, None, None]:
        """
        Sends a message to Claude, updates history, and yields tokens as they arrive.
        """
        # Append user message to state history
        self.history.append({"role": "user", "content": user_message})

        assistant_response = ""

        # Initiate a streaming request
        with self.client.messages.stream(
            model=self.model,
            max_tokens=1024,
            system=self.system_prompt,
            messages=self.history
        ) as stream:
            for text in stream.text_stream:
                assistant_response += text
                yield text

        # Append Claude's completed response to history to maintain context
        self.history.append({"role": "assistant", "content": assistant_response})

# Usage Example
if __name__ == "__main__":
    bot = ClaudeChatEngine()
    print("Chatbot initialized. Type 'exit' to quit.\n")

    while True:
        user_input = input("You: ")
        if user_input.lower() == "exit":
            break

        print("Claude: ", end="", flush=True)
        for token in bot.send_message(user_input):
            print(token, end="", flush=True)
        print("\n")

Step 2: Making it Production-Ready (Error Handling & Resilience)

The code above works perfectly under ideal conditions. But in production, network requests fail, APIs rate-limit you, and unexpected errors occur.

To make this production-ready, we need to wrap our API calls in a resilient layer that catches:

RateLimitError: When you exceed your tokens-per-minute (TPM) or requests-per-minute (RPM).
APIConnectionError: When network issues prevent connection to Anthropic's servers.
APIStatusError: When the API returns a non-200 HTTP code (e.g., overloaded servers or invalid requests).

Let's refactor our engine to handle these scenarios gracefully, incorporating exponential backoff for rate limits.

import time
import logging
from anthropic import (
    Anthropic, 
    APIConnectionError, 
    RateLimitError, 
    APIStatusError
)

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("ClaudeChat")

class ResilientClaudeChatEngine(ClaudeChatEngine):
    def send_message_safe(self, user_message: str, max_retries: int = 3) -> Generator[str, None, None]:
        """
        Sends a message with exponential backoff for rate limits and robust error handling.
        """
        self.history.append({"role": "user", "content": user_message})
        assistant_response = ""
        delay = 1.0  # Initial delay for backoff

        for attempt in range(max_retries):
            try:
                with self.client.messages.stream(
                    model=self.model,
                    max_tokens=1024,
                    system=self.system_prompt,
                    messages=self.history
                ) as stream:
                    for text in stream.text_stream:
                        assistant_response += text
                        yield text

                # If we successfully finished streaming, break out of retry loop
                break

            except RateLimitError as e:
                logger.warning(f"Rate limited (429). Retrying in {delay}s... Error: {e}")
                if attempt == max_retries - 1:
                    yield "\n[Error: The system is currently busy. Please try again in a moment.]"
                    return
                time.sleep(delay)
                delay *= 2  # Exponential backoff

            except APIConnectionError as e:
                logger.error(f"Failed to connect to [Anthropic API](https://console.anthropic.com): {e}")
                yield "\n[Error: Connection issue. Please check your internet connection.]"
                return

            except APIStatusError as e:
                logger.error(f"Anthropic API returned status code {e.status_code}: {e.message}")
                yield f"\n[Error: An API error occurred (Status {e.status_code}).]"
                return

            except Exception as e:
                logger.error(f"Unexpected error: {str(e)}")
                yield "\n[Error: An unexpected system error occurred.]"
                return

        # Save the history only if we got a successful response
        if assistant_response:
            self.history.append({"role": "assistant", "content": assistant_response})

Key Takeaways for Production

Keep Context Under Control: Claude 3.5 Sonnet has a 200k context window, but sending the entire history on every message increases latency and costs. Implement a sliding window or summarization strategy if your chat sessions exceed 20-30 messages.
System Prompts: Always pass system-level instructions (like persona, guardrails, and output formatting constraints) via the system parameter, not as a user message. This ensures the model adheres strictly to your rules.
Environment Variables: Never hardcode your ANTHROPIC_API_KEY. Use env vars or secret managers (like AWS Secrets Manager or HashiCorp Vault) to inject keys at runtime.

What's Next?

You now have a production-ready core engine for a Claude chatbot. Your next step is to wrap this engine in an API layer like FastAPI to expose it to your frontend, or deploy it directly to a serverless environment.

Are you building with Claude? Let me know in the comments below what challenges you're facing with context windows or latency!