Defining server behaviour for web crawlers
User Agent
Preliminaries
This note describes how the server can be configured to handle special use cases when it detects that a web crawler is making requests.
Requests to the server are made by browsers and crawlers. There are good reasons to distinguish between the two, and to handle each in different ways. Distinguishing between the two is the task of the User Agent handler.
The server can use user-agent information to do two things: 1) it can completely block all requests, 2) it can suppress all speculative push requests.
The configured user-agent string patterns are also used in the Counters module to count requests by browser/crawler type.
A user-agent primer
Well-formed requests to the server should have a user-agent
request header. Its value consists of a set of identifiers, determined by the requestor's software, that describes itself to the server. The number of parts to the header value varies by vendor. The most generic identifiers come first, followed by more specific identifiers. Identifiers consist of a name, a version, and a comment, where the name and version are connected by a SOLIDUS, while the comment, when present, is enclosed in parentheses.
In practice, the number of different user-agent variants presents a challenge to anyone wanting to parse and classify them. The server can do the easy part: separating the parts into identifiers, and splitting the identifiers into name, version, and comment. The webmaster or software engineer has to do the hard part: determining which of these are browsers and which are crawlers. There are only a few shortcuts, and they are not guaranteed.
Crawlers
Substrings to look for to determine if a user-agent is a crawler:
- The keyword "bot" is often included somewhere in the user-agent header.
- A website URL is often included somewhere in the user-agent header.
This is a list of commercial crawlers with the string patterns that were used to identify them:
Vendor | String Patterns |
---|---|
360 Spider | 360Spider |
ADmantX Platform Analyzer | admantx |
Ahrefs Backlink Research | AhrefsBot |
Alexa | ia_archiver |
Apple | Applebot |
Baidu | Baiduspider |
Bing | bingbot |
BingPreview | BingPreview |
Bit.ly Link Checker | bitlybot |
Cốc Cốc | coccoc |
Cosmos Crawler | Cosmos |
Goo | ichiro |
DuckDuckGo | DuckDuckBot |
Exa | Exabot |
Evaliant Impressions | evaliant |
facebookexternalhit | |
Google Favicon | Google Favicon |
Googlebot | |
Heritrix Internet Archive | archive.org_bot |
Majestic-12 Distributed Search | MJ12bot |
MSN | msnbot |
Semrush | SemrushBot |
SEOProfiler | spbot |
Slack | Slackbot |
Slack-ImgProxy | Slack-ImgProxy |
Sogou | Sogou |
Sosospider | Sosospider |
TweetMeme | TweetmemeBot |
Twitterbot | |
Yahoo! Slurp | yahoo |
Yandex | YandexBot |
This is a list of special-purpose, non-browser, software with the string patterns that were used to identify them:
Tool | String Patterns |
---|---|
curl utility | curl |
dead link checker | deadlinkchecker |
Leipzig Corpora Collection | findlinks |
Scrapy | Scrapy |
wget utility | wget |
YaCy Peer-to-Peer | yacybot |
User-agent strings can easily be spoofed allowing bad actors to masquerade as legitimate crawlers. Webmasters should rely on other means to bolster their defenses against excessive crawling.
Browsers
For analytic purposes, it's useful to know what type of browsers are being used to access the server. This is a list of browser user-agent string patterns that can be used as a starting point towards classifying browsers. The order of these strings is important, since many user-agent strings include the keywords "Safari"
and "Chrome"
. Be aware that the algorithm used by the server stops searching at the first pattern matched.
Mobile Phone | String Patterns |
---|---|
Apple Mobile | iPhone |
Android Mobile | Android |
Windows Mobile | IEMobile |
Opera Mobile | Opera Mini |
Firefox Mobile | Mobile.*Firefox |
Tablets | String Patterns |
Apple Tablet | iPad |
Apple Tablet | iPod |
Amazon Tablet | Silk |
Amazon Tablet | Kindle |
Amazon Tablet | NetFront |
Firefox Tablet | Tablet.*Firefox |
Desktop | String Patterns |
Opera Desktop | Opera |
Opera Desktop | OPR |
Firefox Desktop | Firefox |
Firefox Desktop | Seamonkey |
Microsoft Desktop | MSIE |
Microsoft Desktop | Trident |
Microsoft Desktop | Edge |
Chrome Desktop | Chromium |
Chrome Desktop | Chrome |
Apple Desktop | Safari |
Unclassified | String Patterns |
Other | '*' |
Configuration
The modules
section must include a user-agent
entry with the value on
in order to enable the User Agent module.
The user-agent
section of the configuration comprises a set of user-agent entries, each of which has a string-pattern and a comma-separated list of group names.
There are two meaningful group names: noserve
which instructs the server to refuse all requests, and nopush
, which instructs the server to not use the speculative push protocol.
The request header value is sequentially tested against each configuration entry's string-pattern and the first matching entry's group names are assigned to the current work order.
Information Headers
The following information headers may be issued in conjunction with the User Agent handler:
Information Header | Trigger |
---|---|
rw-user-agent-noserve | A request was declined because the requestor was identified as a crawler. The response returns a 403 status code. |
rw-user-agent-nopush | A speculative push request was declined because the requestor was identified as a crawler. The source document is compiled and returned with a normal status code. |
EBNF
SP | ::= | U+20 |
CR | ::= | U+0D |
ASTERISK | ::= | U+2A |
COMMA | ::= | U+2C |
EQUALS-SIGN | ::= | U+3D |
LEFT-CURLY-BRACKET | ::= | U+7B |
RIGHT-CURLY-BRACKET | ::= | U+7D |
ua-common-name | ::= | ALPHA* |
ua-pattern | ::= | ASTERISK 'pattern' EQUALS-SIGN ALPHA* |
ua-groups | ::= | ASTERISK 'groups' EQUALS-SIGN ('noserve' | 'nopush' COMMA)* |
user-agent-entry | ::= | ua-common-name SP ua-pattern SP ua-groups CR |
user-agent-section | ::= | 'user-agent' SP LEFT-CURLY-BRACKET CR user-agent-entry* RIGHT-CURLY-BRACKET CR |
Cookbook
Example 1: Default patterns and groups
modules {
user-agent on
}
request {
user-agent {
// Crawlers
360-Spider *groups=crawler *pattern=360Spider
ADmantX-Platform-Analyzer *groups=crawler *pattern=admantx
Ahrefs-Backlink-Research *groups=crawler *pattern=AhrefsBot
Alexa *groups=crawler *pattern=ia_archiver
Apple *groups=crawler *pattern=Applebot
Baidu *groups=crawler *pattern=Baiduspider
Bing *groups=crawler *pattern=bingbot
BingPreview *groups=crawler *pattern=BingPreview
Bitly-Link-Checker *groups=crawler *pattern=bitlybot
Cốc-Cốc *groups=crawler *pattern=coccoc
Cosmos-Crawler *groups=crawler *pattern=Cosmos
Goo *groups=crawler *pattern=ichiro
DuckDuckGo *groups=crawler *pattern=DuckDuckBot
Exa *groups=crawler *pattern=Exabot
Evaliant-Impressions *groups=crawler *pattern=evaliant
Facebook *groups=crawler *pattern=facebookexternalhit
Google-Favicon *groups=crawler *pattern='Google Favicon'
Google *groups=crawler *pattern=Googlebot
Heritrix-Internet-Archive *groups=crawler *pattern='archive.org_bot'
Majestic-12-Distributed-Search *groups=crawler *pattern=MJ12bot
MSN *groups=crawler *pattern=msnbot
Pinterest *groups=crawler *pattern=pinterest
Semrush *groups=crawler *pattern=SemrushBot
SEOProfiler *groups=crawler *pattern=spbot
Slack *groups=crawler *pattern=Slackbot
Slack-ImgProxy *groups=crawler *pattern=Slack-ImgProxy
Sogou *groups=crawler *pattern=Sogou
Sosospider *groups=crawler *pattern=Sosospider
TweetMeme *groups=crawler *pattern=TweetmemeBot
Twitter *groups=crawler *pattern=Twitterbot
Yahoo!-Slurp *groups=crawler *pattern=yahoo
Yandex *groups=crawler *pattern=YandexBot
WhatsApp *groups=crawler *pattern=WhatsApp
// Tools
curl *groups=tool *pattern=curl
dead-link-checker *groups=tool *pattern=deadlinkchecker
Leipzig-Corpora-Collection *groups=tool *pattern=findlinks
Scrapy *groups=tool *pattern=Scrapy
wget *groups=tool *pattern=wget
YaCy-Peer-to-Peer *groups=tool *pattern=yacybot
// Mobile Phones
Apple-Mobile *groups=mobile *pattern=iPhone
Android-Mobile *groups=mobile *pattern=Android
Windows-Mobile *groups=mobile *pattern=IEMobile
Opera-Mobile *groups=mobile *pattern='Opera Mini'
Firefox-Mobile *groups=mobile *pattern='Mobile.*Firefox'
// Tablets
Apple-Tablet *groups=tablet *pattern=iPad
Apple-Tablet *groups=tablet *pattern=iPod
Amazon-Tablet *groups=tablet *pattern=Silk
Amazon-Tablet *groups=tablet *pattern=Kindle
Amazon-Tablet *groups=tablet *pattern=NetFront
Firefox-Tablet *groups=tablet *pattern='Tablet.*Firefox'
// Desktop
Opera-Desktop *groups=desktop *pattern=Opera
Opera-Desktop *groups=desktop *pattern=OPR
Firefox-Desktop *groups=desktop *pattern=Firefox
Firefox-Desktop *groups=desktop *pattern=Seamonkey
Microsoft-Desktop *groups=desktop *pattern=MSIE
Microsoft-Desktop *groups=desktop *pattern=Trident
Microsoft-Desktop *groups=desktop *pattern=Edge
Chrome-Desktop *groups=desktop *pattern=Chromium
Chrome-Desktop *groups=desktop *pattern=Chrome
Apple-Desktop *groups=desktop *pattern=Safari
}
}
Example 2: Denying all requests from a crawler
modules {
user-agent on
}
request {
user-agent {
Googlebot *pattern=Googlebot *groups=noserve
}
}
Example 3: Denying speculative push requests from a crawler
modules {
user-agent on
}
request {
user-agent {
Googlebot *pattern=Googlebot *groups=nopush
}
}
Review
Key points to remember:
- The user-agent request header may be used to distinguish between automated crawlers and browser requests.
- Crawlers can be blocked from all requests or just speculative push requests.