Skip to content

Identifying bots

Arcjet allows you to configure a list of bots to allow or deny. To construct the list, you can specify individual bots and/or use categories to allow or deny all bots in a category.

If you are using TypeScript, these will be shown as autocomplete values to allow or deny options while writing your rules.

Individual bots

The bot list contains a list of known bots that Arcjet can identify.

For example:

detectBot({
mode: "LIVE", // will block requests. Use "DRY_RUN" to log only
// Block all bots except specific Google and Bing crawlers, and curl
allow: [
"GOOGLE_CRAWLER".
"GOOGLE_CRAWLER_NEWS",
"BING_CRAWLER",
"CURL",
],
}),

Bot categories

We provide the following categories:

  • CATEGORY:ACADEMIC: Scrape data for research purposes
  • CATEGORY:ADVERTISING: Scrape data for advertising and marketing purposes
  • CATEGORY:AI: Scrape data for AI and LLM purposes
  • CATEGORY:AMAZON: Scrape data for Amazon products and services
  • CATEGORY:ARCHIVE: Scrape data for archival purposes
  • CATEGORY:FEEDFETCHER: Request data for RSS and other feeds
  • CATEGORY:GOOGLE: Scrape data for Google products and services
  • CATEGORY:META: Scrape data for Meta/Facebook products and services
  • CATEGORY:MICROSOFT: Scrape data for Microsoft products and services
  • CATEGORY:MONITOR: Interact for monitoring purposes
  • CATEGORY:OPTIMIZER: Interact for optimization purposes
  • CATEGORY:PREVIEW: Request data for image and URL previews
  • CATEGORY:PROGRAMMATIC: Interact via programming language libraries
  • CATEGORY:SEARCH_ENGINE: Index data for search engines
  • CATEGORY:SLACK: Scrape data for Slack products and services
  • CATEGORY:SOCIAL: Scrape data for social media products and services
  • CATEGORY:TOOL: Interact via command line and GUI tools
  • CATEGORY:UNKNOWN: Undetermined purposes
  • CATEGORY:VERCEL: Scrape data for Vercel products and services
  • CATEGORY:YAHOO: Scrape data for Yahoo products and services

The bot list contains a list of categories and which bots are in each category.

detectBot({
mode: "LIVE", // will block requests. Use "DRY_RUN" to log only
// Block all bots except search engines and curl
allow: [
"CATEGORY:SEARCH_ENGINE", // Google, Bing, etc
"CURL", // You can allow specific bots in addition to categories
],
}),

Only configured categories are checked for performance reasons. Each detected bot must be compared to a category, so the worst case performance is count(detectedBot) * count(configuredCategories).

We’re continuously evaluating bots to decide if things should be reclassified. If we determine enough bots exist for a new category, we’ll consider adding new ones. Please open an issue on our arcjet/well-known-bots repository if you need a specific category.

Detection

For basic detection on the Free plan, Arcjet uses the User-Agent header to identify specific bots.

The Arcjet Pro plan or above provides additional bot verification using IP analysis, which can help if you are under attack from bots pretending to be good bots e.g. clients pretending to be Google (who you usually want to allow).

Known bots structure

The identifiers on the bot list are generated from a collection of known bots which includes details of their owner and any variations.

We welcome contributions to the arcjet/well-known-bots repository, whether you’re adding new bots or updating detection patterns. Once merged, the updates will be included in the next SDK release. Since bot detection is handled within the Arcjet WebAssembly module bundled with the SDK, new patterns must be compiled into the module as part of the release process.

Each entry in the known bots JSON represents a specific bot or crawler and includes the following fields:

  • id: A unique identifier for the bot
  • categories: An array of categories the bot belongs to (e.g. “search-engine”, “advertising”)
  • pattern: A regular expression pattern used to identify the bot in user agent strings
  • url: (optional) A URL with more information about the bot
  • verification: A list of supported methods for verifying the bot’s identity (if the bot is not verifiable it should be empty).
  • instances: An array of example user agent strings for the bot that are validated against the pattern

Verification

Each verification entry contains the following fields:

  • type: The method of verification (currently only dns is supported)
  • masks: An array of mask patterns used for verification

Verification mask patterns

The mask patterns use the following special characters:

  • \*: Represents 0 or 1 of any character
  • @: Acts as a wildcard, matching any number of characters

All other characters in the mask require an exact match.

Discussion