AI providers bill per token. Without per-user limits, a single user can exhaust your entire monthly budget - through prompt attacks, runaway loops, or just heavy legitimate use.
A token bucket rate limit maps directly onto how AI billing works: you estimate the cost of each request in tokens, deduct it from the user’s bucket, and deny requests when the bucket is empty. The bucket refills over time, giving each user a sustained allowance without sharp rate-limit cliffs.
Alternatively, you can use a fixed window or sliding window limit to enforce a hard cap on spend per user per day, week, or month. See the rate limiting algorithms reference for details on different approaches.
Arcjet handles bucket state across all instances of your application - no Redis or external state store required.
Get started
Section titled “Get started”In this example we use the Vercel AI SDK to create a simple AI chat endpoint with Next.js, and Arcjet to enforce per-user token budgets to prevent cost overruns. The same principles can be applied to any AI application, including those built with other frameworks.
We assume you already have a Next.js app set up.
Install the dependencies:
# Export your Arcjet API key from https://app.arcjet.comexport ARCJET_KEY="ajkey_..."
npm install @arcjet/next ai @ai-sdk/openaiCreate an AI chat endpoint:
import { openai } from "@ai-sdk/openai";import arcjet, { tokenBucket } from "@arcjet/next";import type { UIMessage } from "ai";import { convertToModelMessages, streamText } from "ai";
const aj = arcjet({ key: process.env.ARCJET_KEY!, // Get your site key from https://app.arcjet.com // Track budgets per user — replace "userId" with any stable identifier characteristics: ["userId"], rules: [ tokenBucket({ mode: "LIVE", // Blocks requests. Use "DRY_RUN" to log only refillRate: 2_000, // Refill 2,000 tokens per hour interval: "1h", capacity: 5_000, // Maximum 5,000 tokens in the bucket }), ],});
export async function POST(req: Request) { // Replace with your session/auth lookup to get a stable user ID const userId = "user-123"; const { messages }: { messages: UIMessage[] } = await req.json(); const modelMessages = await convertToModelMessages(messages);
// Estimate token cost: ~1 token per 4 characters of text (rough heuristic). // For accurate counts use https://www.npmjs.com/package/tiktoken const totalChars = modelMessages.reduce((sum, m) => { const content = typeof m.content === "string" ? m.content : JSON.stringify(m.content); return sum + content.length; }, 0); const estimate = Math.ceil(totalChars / 4);
// Deduct the estimated tokens from the user's budget const decision = await aj.protect(req, { userId, requested: estimate });
if (decision.isDenied()) { return new Response("AI usage limit exceeded", { status: 429 }); }
const result = await streamText({ model: openai("gpt-4o"), messages: modelMessages, });
return result.toUIMessageStreamResponse();}And hook it up to a chat UI:
"use client";
import { useChat } from "@ai-sdk/react";import { useState } from "react";
export default function Chat() { const [input, setInput] = useState(""); const [errorMessage, setErrorMessage] = useState<string | null>(null); const { messages, sendMessage } = useChat({ onError: async (e) => setErrorMessage(e.message), }); return ( <div className="flex flex-col w-full max-w-md py-24 mx-auto stretch"> {messages.map((message) => ( <div key={message.id} className="whitespace-pre-wrap"> {message.role === "user" ? "User: " : "AI: "} {message.parts.map((part, i) => { switch (part.type) { case "text": return <div key={`${message.id}-${i}`}>{part.text}</div>; } })} </div> ))}
{errorMessage && ( <div className="text-red-500 text-sm mb-4">{errorMessage}</div> )}
<form onSubmit={(e) => { e.preventDefault(); sendMessage({ text: input }); setInput(""); setErrorMessage(null); }} > <input className="fixed dark:bg-zinc-900 bottom-0 w-full max-w-md p-2 mb-8 border border-zinc-300 dark:border-zinc-800 rounded shadow-xl" value={input} placeholder="Say something..." onChange={(e) => setInput(e.currentTarget.value)} /> </form> </div> );}Then run the server:
npm run devYou will see requests being processed in your Arcjet dashboard in real time.
In this example we use LangChain to create a simple AI chat server with FastAPI, and Arcjet to enforce per-user token budgets to prevent cost overruns. The same principles can be applied to any AI application, including those built with other frameworks.
Set up the environment and install dependencies (uses uv, but you can also use pip to install the Arcjet Python SDK):
# Export your Arcjet API key from https://app.arcjet.comexport ARCJET_KEY="ajkey_..."export ARCJET_ENV=development
# Export your OpenAI API key (used by LangChain)export OPENAI_API_KEY="sk-..."
# Install dependenciesuv add arcjet fastapi uvicorn langchain langchain-openaiCreate the chat server:
import loggingimport mathimport os
from arcjet import Mode, arcjet, token_bucketfrom fastapi import FastAPI, Requestfrom fastapi.responses import JSONResponsefrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom pydantic import BaseModel
app = FastAPI()
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
arcjet_key = os.getenv("ARCJET_KEY")if not arcjet_key: raise RuntimeError("ARCJET_KEY is required. Get one at https://app.arcjet.com")
openai_api_key = os.getenv("OPENAI_API_KEY")if not openai_api_key: raise RuntimeError( "OPENAI_API_KEY is required. Get one at https://platform.openai.com" )
llm = ChatOpenAI(model="gpt-4o-mini", api_key=openai_api_key)
prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant."), ("human", "{message}"), ])
chain = prompt | llm | StrOutputParser()
class ChatRequest(BaseModel): message: str
# Create a single Arcjet client at startup and reuse it across requestsaj = arcjet( key=arcjet_key, # Get your key from https://app.arcjet.com rules=[ # Token bucket rate limiting is best for AI budget control token_bucket( mode=Mode.LIVE, # Blocks requests. Use Mode.DRY_RUN to log only # Track budgets per user — replace "userId" with any stable # identifier. Removing this falls back to IP-based rate limiting. characteristics=["userId"], refill_rate=2_000, # Refill 2,000 tokens per interval interval=3_600, # Refill every hour (in seconds) capacity=5_000, # Maximum 5,000 tokens in the bucket ), ],)
@app.post("/chat")async def chat(request: Request, body: ChatRequest): # Replace with your session/auth lookup to get a stable user ID user_id = "user-123"
# Estimate token cost: ~1 token per 4 characters of text (rough heuristic). # For accurate counts use https://github.com/openai/tiktoken estimate = math.ceil(len(body.message) / 4)
# Deduct the estimated tokens from the user's budget decision = await aj.protect( request, requested=estimate, characteristics={"userId": user_id}, )
if decision.is_denied(): # The token_bucket rule is the only rule configured, so the only # possible denial reason is RATE_LIMIT (429). return JSONResponse({"error": "AI usage limit exceeded"}, status_code=429)
reply = await chain.ainvoke({"message": body.message})
return {"reply": reply}Then run the server:
uv run uvicorn main:app --reloadAnd send a message to the API endpoint:
curl -X POST http://localhost:8000/chat \ -H "Content-Type: application/json" \ -d '{"message": "What is the capital of France?"}'You will see requests being processed in your Arcjet dashboard in real time.
In this example we use LangChain to create a simple AI chat server with Flask, and Arcjet to enforce per-user token budgets to prevent cost overruns. The same principles can be applied to any AI application, including those built with other frameworks.
Set up the environment and install dependencies (uses uv, but you can also use pip to install the Arcjet Python SDK):
# Export your Arcjet API key from https://app.arcjet.comexport ARCJET_KEY="ajkey_..."export ARCJET_ENV=development
# Export your OpenAI API key (used by LangChain)export OPENAI_API_KEY="sk-..."
# Install dependenciesuv add arcjet flask langchain langchain-openaiCreate the chat server:
import loggingimport mathimport os
from arcjet import Mode, arcjet_sync, token_bucketfrom flask import Flask, jsonify, requestfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAI
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
arcjet_key = os.getenv("ARCJET_KEY")if not arcjet_key: raise RuntimeError("ARCJET_KEY is required. Get one at https://app.arcjet.com")
openai_api_key = os.getenv("OPENAI_API_KEY")if not openai_api_key: raise RuntimeError( "OPENAI_API_KEY is required. Get one at https://platform.openai.com" )
llm = ChatOpenAI(model="gpt-4o-mini", api_key=openai_api_key)
prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant."), ("human", "{message}"), ])
chain = prompt | llm | StrOutputParser()
# Create a single Arcjet client at startup and reuse it across requestsaj = arcjet_sync( key=arcjet_key, # Get your key from https://app.arcjet.com rules=[ # Token bucket rate limiting is best for AI budget control token_bucket( mode=Mode.LIVE, # Blocks requests. Use Mode.DRY_RUN to log only # Track budgets per user — replace "userId" with any stable # identifier. Removing this falls back to IP-based rate limiting. characteristics=["userId"], refill_rate=2_000, # Refill 2,000 tokens per interval interval=3_600, # Refill every hour (in seconds) capacity=5_000, # Maximum 5,000 tokens in the bucket ), ],)
@app.post("/chat")def chat(): # Replace with your session/auth lookup to get a stable user ID user_id = "user-123"
body = request.get_json() message = body.get("message", "") if body else ""
# Estimate token cost: ~1 token per 4 characters of text (rough heuristic). # For accurate counts use https://github.com/openai/tiktoken estimate = math.ceil(len(message) / 4)
# Deduct the estimated tokens from the user's budget decision = aj.protect( request, requested=estimate, characteristics={"userId": user_id}, )
if decision.is_denied(): # The token_bucket rule is the only rule configured, so the only # possible denial reason is RATE_LIMIT (429). return jsonify(error="AI usage limit exceeded"), 429
reply = chain.invoke({"message": message})
return jsonify(reply=reply)
if __name__ == "__main__": app.run(debug=True)Then run the server:
uv run python app.pyAnd send a message to the API endpoint:
curl -X POST http://localhost:5000/chat \ -H "Content-Type: application/json" \ -d '{"message": "What is the capital of France?"}'You will see requests being processed in your Arcjet dashboard in real time.
Configuring the rate limit
Section titled “Configuring the rate limit”characteristics: ["userId"] - Tracks the bucket per user. Replace
"userId" with the characteristic that identifies a unique user in your
application (e.g. a session token, API key, or authenticated user ID). Pass the
value to aj.protect() as a named argument.
refillRate and interval - Set the sustained allowance. refillRate: 2_000, interval: "1h" gives each user 2,000 tokens per hour. Adjust to match your AI
provider’s pricing and your cost targets. These are hard coded in this example,
but you can also calculate them dynamically based on user subscription level or
other factors. Just pass in the calculated values to the rule.
capacity - The maximum tokens a user can accumulate. Setting capacity: 5_000 with refillRate: 2_000 lets users burst up to 5,000 tokens if they
haven’t used their allowance recently.
Token estimation
Section titled “Token estimation”The example uses a characters / 4 heuristic (~1 token per 4 characters for
common English text). This is a reasonable starting point — it avoids
introducing extra dependencies and works well enough for budget enforcement
where a small margin of error is acceptable.
For accurate counts, use a tokenizer:
- JavaScript / TypeScript:
tiktoken - Python:
tiktoken - Anthropic: provides a token counting API