Defining server behaviour for web crawlers
This note describes how the server can be configured to handle special use cases when it detects that a web crawler is making requests.
Requests to the server are made by browsers and crawlers. There are good reasons to distinguish between the two, and to handle each in different ways. Distinguishing between the two is the task of the User Agent handler.
The server can use user-agent information to do two things: 1) it can completely block all requests, 2) it can suppress all speculative push requests.
The configured user-agent string patterns are also used in the Counters module to count requests by browser/crawler type.
A user-agent primer
Well-formed requests to the server should have a
user-agent request header. Its value consists of a set of identifiers, determined by the requestor's software, that describes itself to the server. The number of parts to the header value varies by vendor. The most generic identifiers come first, followed by more specific identifiers. Identifiers consist of a name, a version, and a comment, where the name and version are connected by a SOLIDUS, while the comment, when present, is enclosed in parentheses.
In practice, the number of different user-agent variants presents a challenge to anyone wanting to parse and classify them. The server can do the easy part: separating the parts into identifiers, and splitting the identifiers into name, version, and comment. The webmaster or software engineer has to do the hard part: determining which of these are browsers and which are crawlers. There are only a few shortcuts, and they are not guaranteed.
Substrings to look for to determine if a user-agent is a crawler:
- The keyword "bot" is often included somewhere in the user-agent header.
- A website URL is often included somewhere in the user-agent header.
This is a list of commercial crawlers with the string patterns that were used to identify them:
|ADmantX Platform Analyzer||admantx|
|Ahrefs Backlink Research||AhrefsBot|
|Bit.ly Link Checker||bitlybot|
|Google Favicon||Google Favicon|
|Heritrix Internet Archive||archive.org_bot|
|Majestic-12 Distributed Search||MJ12bot|
This is a list of special-purpose, non-browser, software with the string patterns that were used to identify them:
|dead link checker||deadlinkchecker|
|Leipzig Corpora Collection||findlinks|
User-agent strings can easily be spoofed allowing bad actors to masquerade as legitimate crawlers. Webmasters should rely on other means to bolster their defenses against excessive crawling.
For analytic purposes, it's useful to know what type of browsers are being used to access the server. This is a list of browser user-agent string patterns that can be used as a starting point towards classifying browsers. The order of these strings is important, since many user-agent strings include the keywords
"Chrome". Be aware that the algorithm used by the server stops searching at the first pattern matched.
|Mobile Phone||String Patterns|
|Opera Mobile||Opera Mini|
modules section must include a
user-agent entry with the value
on in order to enable the User Agent module.
user-agent section of the configuration comprises a set of user-agent entries, each of which has a string-pattern and a comma-separated list of group names.
There are two meaningful group names:
noserve which instructs the server to refuse all requests, and
nopush, which instructs the server to not use the speculative push protocol.
The request header value is sequentially tested against each configuration entry's string-pattern and the first matching entry's group names are assigned to the current work order.
The following information headers may be issued in conjunction with the User Agent handler:
| ||A request was declined because the requestor was identified as a crawler. The response returns a |
| ||A speculative push request was declined because the requestor was identified as a crawler. The source document is compiled and returned with a normal status code.|
|ua-pattern||::=||ASTERISK 'pattern' EQUALS-SIGN ALPHA*|
|ua-groups||::=||ASTERISK 'groups' EQUALS-SIGN ('noserve' | 'nopush' COMMA)*|
|user-agent-entry||::=||ua-common-name SP ua-pattern SP ua-groups CR|
|user-agent-section||::=||'user-agent' SP LEFT-CURLY-BRACKET CR|
Example 1: Denying all requests from a crawler
Googlebot *pattern=Googlebot *groups=noserve
Example 2: Denying speculative push requests from a crawler
Googlebot *pattern=Googlebot *groups=nopush
Key points to remember:
- The user-agent request header may be used to distinguish between automated crawlers and browser requests.
- Crawlers can be blocked from all requests or just speculative push requests.