Identifying bots
For basic detection, Arcjet uses the User-Agent
header to identify specific
bots. Advanced bot detection supplements this with additional fingerprinting
techniques such as IP address analysis.
Arcjet identifies and maintains a list of known bots, which are available in our
bot list. If you are using TypeScript, these will
be shown as autocomplete values to allow
or deny
options while writing your
rules.
This list is used to allow developers to choose to allow or deny any or all of these bots.
Known bots
Arcjet’s list of known bots is comprised of two parts:
- The bot list shipped with the SDK provides human readable identifiers for known bots.
- The identifiers on the bot list are generated from a collection of known bots which includes details of their owner and any variations.
We welcome contributions to the arcjet/well-known-bots repository, whether you’re adding new bots or updating detection patterns. Once merged, the updates will be included in the next SDK release. Since bot detection is handled within the Arcjet WebAssembly module bundled with the SDK, new patterns must be compiled into the module as part of the release process.
Known bots structure
Each entry in the known bots JSON represents a specific bot or crawler and includes the following fields:
- id: A unique identifier for the bot
- categories: An array of categories the bot belongs to (e.g. “search-engine”, “advertising”)
- pattern: A regular expression pattern used to identify the bot in user agent strings
- url: (optional) A URL with more information about the bot
- verification: A list of supported methods for verifying the bot’s identity (if the bot is not verifiable it should be empty).
- instances: An array of example user agent strings for the bot that are validated against the
pattern
Verification
Each verification entry contains the following fields:
- type: The method of verification (currently only
dns
is supported) - masks: An array of mask patterns used for verification
Verification mask patterns
The mask patterns use the following special characters:
- *: Represents 0 or 1 of any character
- @: Acts as a wildcard, matching any number of characters
All other characters in the mask require an exact match.
Bot categories
In addition to identifying individual bots, we also group bots into various categories. You can leverage these categories for easier configuration of your allow or deny lists.
Currently, we provide the following categories. You can see which bots are in each category from the bot list:
CATEGORY:ACADEMIC
: Scrape data for research purposesCATEGORY:ADVERTISING
: Scrape data for advertising and marketing purposesCATEGORY:AI
: Scrape data for AI and LLM purposesCATEGORY:AMAZON
: Scrape data for Amazon products and servicesCATEGORY:ARCHIVE
: Scrape data for archival purposesCATEGORY:FEEDFETCHER
: Request data for RSS and other feedsCATEGORY:GOOGLE
: Scrape data for Google products and servicesCATEGORY:META
: Scrape data for Meta/Facebook products and servicesCATEGORY:MICROSOFT
: Scrape data for Microsoft products and servicesCATEGORY:MONITOR
: Interact for monitoring purposesCATEGORY:OPTIMIZER
: Interact for optimization purposesCATEGORY:PREVIEW
: Request data for image and URL previewsCATEGORY:PROGRAMMATIC
: Interact via programming language librariesCATEGORY:SEARCH_ENGINE
: Index data for search enginesCATEGORY:SLACK
: Scrape data for Slack products and servicesCATEGORY:SOCIAL
: Scrape data for social media products and servicesCATEGORY:TOOL
: Interact via command line and GUI toolsCATEGORY:UNKNOWN
: Undetermined purposesCATEGORY:VERCEL
: Scrape data for Vercel products and servicesCATEGORY:YAHOO
: Scrape data for Yahoo products and services
We’re continuously evaluating bots to decide if things should be reclassified. If we determine enough bots exist for a new category, we’ll consider adding new ones. Please open an issue on our arcjet/well-known-bots repository if you need a specific category.
Only configured categories are checked for performance reasons. Each detected
bot must be compared to a category, so the worst case performance is
count(detectedBot) * count(configuredCategories)
.