AI providers bill per token. Without per-user limits, a single user can exhaust your entire monthly budget - through prompt attacks, runaway loops, or just heavy legitimate use.
A token bucket rate limit maps directly onto how AI billing works: you estimate the cost of each request in tokens, deduct it from the user’s bucket, and deny requests when the bucket is empty. The bucket refills over time, giving each user a sustained allowance without sharp rate-limit cliffs.
Alternatively, you can use a fixed window or sliding window limit to enforce a hard cap on spend per user per day, week, or month. See the rate limiting algorithms reference for details on different approaches.
Arcjet handles bucket state across all instances of your application - no Redis or external state store required.
Get started
Section titled “Get started”In this example we use the Vercel AI SDK to create a simple AI chat endpoint with Next.js, and Arcjet to enforce per-user token budgets to prevent cost overruns. The same principles can be applied to any AI application, including those built with other frameworks.
We assume you already have a Next.js app set up.
Install the dependencies:
# Export your Arcjet API key from https://app.arcjet.comexport ARCJET_KEY="ajkey_..."
npm install @arcjet/next ai @ai-sdk/openaiCreate an AI chat endpoint:
import { openai } from "@ai-sdk/openai";import arcjet, { tokenBucket } from "@arcjet/next";import type { UIMessage } from "ai";import { convertToModelMessages, streamText } from "ai";
const aj = arcjet({ key: process.env.ARCJET_KEY!, // Get your site key from https://app.arcjet.com // Track budgets per user — replace "userId" with any stable identifier characteristics: ["userId"], rules: [ tokenBucket({ mode: "LIVE", // Blocks requests. Use "DRY_RUN" to log only refillRate: 2_000, // Refill 2,000 tokens per hour interval: "1h", capacity: 5_000, // Maximum 5,000 tokens in the bucket }), ],});
export async function POST(req: Request) { // Replace with your session/auth lookup to get a stable user ID const userId = "user-123"; const { messages }: { messages: UIMessage[] } = await req.json(); const modelMessages = await convertToModelMessages(messages);
// Estimate token cost: ~1 token per 4 characters of text (rough heuristic). // For accurate counts use https://www.npmjs.com/package/tiktoken const totalChars = modelMessages.reduce((sum, m) => { const content = typeof m.content === "string" ? m.content : JSON.stringify(m.content); return sum + content.length; }, 0); const estimate = Math.ceil(totalChars / 4);
// Deduct the estimated tokens from the user's budget const decision = await aj.protect(req, { userId, requested: estimate });
if (decision.isDenied()) { return new Response("AI usage limit exceeded", { status: 429 }); }
const result = await streamText({ model: openai("gpt-4o"), messages: modelMessages, });
return result.toUIMessageStreamResponse();}And hook it up to a chat UI:
"use client";
import { useChat } from "@ai-sdk/react";import { useState } from "react";
export default function Chat() { const [input, setInput] = useState(""); const [errorMessage, setErrorMessage] = useState<string | null>(null); const { messages, sendMessage } = useChat({ onError: async (e) => setErrorMessage(e.message), }); return ( <div className="flex flex-col w-full max-w-md py-24 mx-auto stretch"> {messages.map((message) => ( <div key={message.id} className="whitespace-pre-wrap"> {message.role === "user" ? "User: " : "AI: "} {message.parts.map((part, i) => { switch (part.type) { case "text": return <div key={`${message.id}-${i}`}>{part.text}</div>; } })} </div> ))}
{errorMessage && ( <div className="text-red-500 text-sm mb-4">{errorMessage}</div> )}
<form onSubmit={(e) => { e.preventDefault(); sendMessage({ text: input }); setInput(""); setErrorMessage(null); }} > <input className="fixed dark:bg-zinc-900 bottom-0 w-full max-w-md p-2 mb-8 border border-zinc-300 dark:border-zinc-800 rounded shadow-xl" value={input} placeholder="Say something..." onChange={(e) => setInput(e.currentTarget.value)} /> </form> </div> );}Then run the server:
npm run devYou will see requests being processed in your Arcjet dashboard in real time.
In this example we use LangChain to create a simple AI chat server with FastAPI, and Arcjet to enforce per-user token budgets to prevent cost overruns. The same principles can be applied to any AI application, including those built with other frameworks.
Set up the environment and install dependencies (uses uv, but you can also use pip to install the Arcjet Python SDK):
# Export your Arcjet API key from https://app.arcjet.comexport ARCJET_KEY="ajkey_..."export ARCJET_ENV=development
# Export your OpenAI API key (used by LangChain)export OPENAI_API_KEY="sk-..."
# Install dependenciesuv add arcjet fastapi uvicorn langchain langchain-openaiCreate the chat server:
import loggingimport os
from arcjet import ( Mode, arcjet, token_bucket,)from fastapi import FastAPI, Requestfrom fastapi.responses import JSONResponsefrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom pydantic import BaseModel
app = FastAPI()
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
arcjet_key = os.getenv("ARCJET_KEY")if not arcjet_key: raise RuntimeError("ARCJET_KEY is required. Get one at https://app.arcjet.com")
openai_api_key = os.getenv("OPENAI_API_KEY")if not openai_api_key: raise RuntimeError( "OPENAI_API_KEY is required. Get one at https://platform.openai.com" )
llm = ChatOpenAI(model="gpt-4o-mini", api_key=openai_api_key)
prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant."), ("human", "{message}"), ])
chain = prompt | llm | StrOutputParser()
class ChatRequest(BaseModel): message: str
aj = arcjet( key=arcjet_key, # Get your key from https://app.arcjet.com rules=[ # Create a token bucket rate limit. Other algorithms are supported token_bucket( # Track budgets by arbitrary characteristics of the request. Here # we use user ID, but you could pass any value. Removing this will # fall back to IP-based rate limiting. characteristics=["userId"], mode=Mode.LIVE, refill_rate=5, # Refill 5 tokens per interval interval=10, # Refill every 10 seconds capacity=10, # Bucket capacity of 10 tokens ), ],)
@app.post("/chat")async def chat(request: Request, body: ChatRequest): userId = "user-123" # In a real app, identify the user from the request (e.g. auth token)
# Call protect() to evaluate the request against the rules decision = await aj.protect( request, # Deduct 5 tokens from the bucket requested=5, # Identify the user for rate limiting purposes characteristics={"userId": userId}, )
# Handle denied requests if decision.is_denied(): status = 429 if decision.reason.is_rate_limit() else 403 return JSONResponse({"error": "Denied"}, status_code=status)
# All rules passed, proceed with handling the request reply = await chain.ainvoke({"message": body.message})
return {"reply": reply}Then run the server:
uv run uvicorn main:app --reloadAnd send a message to the API endpoint:
curl -X POST http://localhost:8000/chat \ -H "Content-Type: application/json" \ -d '{"message": "What is the capital of France?"}'You will see requests being processed in your Arcjet dashboard in real time.
In this example we use LangChain to create a simple AI chat server with Flask, and Arcjet to enforce per-user token budgets to prevent cost overruns. The same principles can be applied to any AI application, including those built with other frameworks.
Set up the environment and install dependencies (uses uv, but you can also use pip to install the Arcjet Python SDK):
# Export your Arcjet API key from https://app.arcjet.comexport ARCJET_KEY="ajkey_..."export ARCJET_ENV=development
# Export your OpenAI API key (used by LangChain)export OPENAI_API_KEY="sk-..."
# Install dependenciesuv add arcjet flask langchain langchain-openaiCreate the chat server:
import loggingimport os
from arcjet import ( Mode, arcjet_sync, token_bucket,)from flask import Flask, jsonify, requestfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAI
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)
arcjet_key = os.getenv("ARCJET_KEY")if not arcjet_key: raise RuntimeError("ARCJET_KEY is required. Get one at https://app.arcjet.com")
openai_api_key = os.getenv("OPENAI_API_KEY")if not openai_api_key: raise RuntimeError( "OPENAI_API_KEY is required. Get one at https://platform.openai.com" )
llm = ChatOpenAI(model="gpt-4o-mini", api_key=openai_api_key)
prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant."), ("human", "{message}"), ])
chain = prompt | llm | StrOutputParser()
aj = arcjet_sync( key=arcjet_key, # Get your key from https://app.arcjet.com rules=[ # Create a token bucket rate limit. Other algorithms are supported token_bucket( # Track budgets by arbitrary characteristics of the request. Here # we use user ID, but you could pass any value. Removing this will # fall back to IP-based rate limiting. characteristics=["userId"], mode=Mode.LIVE, refill_rate=5, # Refill 5 tokens per interval interval=10, # Refill every 10 seconds capacity=10, # Bucket capacity of 10 tokens ), ],)
@app.post("/chat")def chat(): userId = "user-123" # In a real app, identify the user from the request (e.g. auth token)
# Call protect() to evaluate the request against the rules decision = aj.protect( request, # Deduct 5 tokens from the bucket requested=5, # Identify the user for rate limiting purposes characteristics={"userId": userId}, )
# Handle denied requests if decision.is_denied(): status = 429 if decision.reason.is_rate_limit() else 403 return jsonify(error="Denied"), status
# All rules passed, proceed with handling the request body = request.get_json() message = body.get("message", "") if body else "" reply = chain.invoke({"message": message})
return jsonify(reply=reply)
if __name__ == "__main__": app.run(debug=True)Then run the server:
uv run python app.pyAnd send a message to the API endpoint:
curl -X POST http://localhost:5000/chat \ -H "Content-Type: application/json" \ -d '{"message": "What is the capital of France?"}'You will see requests being processed in your Arcjet dashboard in real time.
Configuring the rate limit
Section titled “Configuring the rate limit”characteristics: ["userId"] - Tracks the bucket per user. Replace
"userId" with the characteristic that identifies a unique user in your
application (e.g. a session token, API key, or authenticated user ID). Pass the
value to aj.protect() as a named argument.
refillRate and interval - Set the sustained allowance. refillRate: 2_000, interval: "1h" gives each user 2,000 tokens per hour. Adjust to match your AI
provider’s pricing and your cost targets. These are hard coded in this example,
but you can also calculate them dynamically based on user subscription level or
other factors. Just pass in the calculated values to the rule.
capacity - The maximum tokens a user can accumulate. Setting capacity: 5_000 with refillRate: 2_000 lets users burst up to 5,000 tokens if they
haven’t used their allowance recently.
Token estimation
Section titled “Token estimation”The example uses a characters / 4 heuristic (~1 token per 4 characters for
common English text). This is a reasonable starting point — it avoids
introducing extra dependencies and works well enough for budget enforcement
where a small margin of error is acceptable.
For accurate counts, use a tokenizer:
- JavaScript / TypeScript:
tiktoken - Python:
tiktoken - Anthropic: provides a token counting API