Defining server behaviour for web crawlers

User Agent

Preliminaries

This note describes how the server can be configured to handle special use cases when it detects that a web crawler is making requests.

Requests to the server are made by browsers and crawlers. There are good reasons to distinguish between the two, and to handle each in different ways. Distinguishing between the two is the task of the User Agent handler.

The server can use user-agent information to do two things: 1) it can completely block all requests, 2) it can suppress all speculative push requests.

The configured user-agent string patterns are also used in the Counters module to count requests by browser/crawler type.

A user-agent primer

Well-formed requests to the server should have a user-agent request header. Its value consists of a set of identifiers, determined by the requestor's software, that describes itself to the server. The number of parts to the header value varies by vendor. The most generic identifiers come first, followed by more specific identifiers. Identifiers consist of a name, a version, and a comment, where the name and version are connected by a SOLIDUS, while the comment, when present, is enclosed in parentheses.

In practice, the number of different user-agent variants presents a challenge to anyone wanting to parse and classify them. The server can do the easy part: separating the parts into identifiers, and splitting the identifiers into name, version, and comment. The webmaster or software engineer has to do the hard part: determining which of these are browsers and which are crawlers. There are only a few shortcuts, and they are not guaranteed.

Crawlers

Substrings to look for to determine if a user-agent is a crawler:

  1. The keyword "bot" is often included somewhere in the user-agent header.
  2. A website URL is often included somewhere in the user-agent header.

This is a list of commercial crawlers with the string patterns that were used to identify them:

Vendor String Patterns
360 Spider 360Spider
ADmantX Platform Analyzer admantx
Ahrefs Backlink Research AhrefsBot
Alexa ia_archiver
Apple Applebot
Baidu Baiduspider
Bing bingbot
BingPreview BingPreview
Bit.ly Link Checker bitlybot
Cốc Cốc coccoc
Cosmos Crawler Cosmos
Goo ichiro
DuckDuckGo DuckDuckBot
Exa Exabot
Evaliant Impressions evaliant
Facebook facebookexternalhit
Google Favicon Google Favicon
Google Googlebot
Heritrix Internet Archive archive.org_bot
Majestic-12 Distributed Search MJ12bot
MSN msnbot
Pinterest pinterest
Semrush SemrushBot
SEOProfiler spbot
Slack Slackbot
Slack-ImgProxy Slack-ImgProxy
Sogou Sogou
Sosospider Sosospider
TweetMeme TweetmemeBot
Twitter Twitterbot
Yahoo! Slurp yahoo
Yandex YandexBot
WhatsApp WhatsApp

This is a list of special-purpose, non-browser, software with the string patterns that were used to identify them:

Tool String Patterns
curl utility curl
dead link checker deadlinkchecker
Leipzig Corpora Collection findlinks
Scrapy Scrapy
wget utility wget
YaCy Peer-to-Peer yacybot

User-agent strings can easily be spoofed allowing bad actors to masquerade as legitimate crawlers. Webmasters should rely on other means to bolster their defenses against excessive crawling.

Browsers

For analytic purposes, it's useful to know what type of browsers are being used to access the server. This is a list of browser user-agent string patterns that can be used as a starting point towards classifying browsers. The order of these strings is important, since many user-agent strings include the keywords "Safari" and "Chrome". Be aware that the algorithm used by the server stops searching at the first pattern matched.

Mobile Phone String Patterns
Apple Mobile iPhone
Android Mobile Android
Windows Mobile IEMobile
Opera Mobile Opera Mini
Firefox Mobile Mobile.*Firefox
Tablets String Patterns
Apple Tablet iPad
Apple Tablet iPod
Amazon Tablet Silk
Amazon Tablet Kindle
Amazon Tablet NetFront
Firefox Tablet Tablet.*Firefox
Desktop String Patterns
Opera Desktop Opera
Opera Desktop OPR
Firefox Desktop Firefox
Firefox Desktop Seamonkey
Microsoft Desktop MSIE
Microsoft Desktop Trident
Microsoft Desktop Edge
Chrome Desktop Chromium
Chrome Desktop Chrome
Apple Desktop Safari
Unclassified String Patterns
Other '*'

Configuration

The modules section must include a user-agent entry with the value on in order to enable the User Agent module.

The user-agent section of the configuration comprises a set of user-agent entries, each of which has a string-pattern and a comma-separated list of group names.

There are two meaningful group names: noserve which instructs the server to refuse all requests, and nopush, which instructs the server to not use the speculative push protocol.

The request header value is sequentially tested against each configuration entry's string-pattern and the first matching entry's group names are assigned to the current work order.

Information Headers

The following information headers may be issued in conjunction with the User Agent handler:

Information Header Trigger
rw-user-agent-noserve A request was declined because the requestor was identified as a crawler. The response returns a 403 status code.
rw-user-agent-nopush A speculative push request was declined because the requestor was identified as a crawler. The source document is compiled and returned with a normal status code.

EBNF

SP ::= U+20
CR ::= U+0D
ASTERISK ::= U+2A
COMMA ::= U+2C
EQUALS-SIGN ::= U+3D
LEFT-CURLY-BRACKET ::= U+7B
RIGHT-CURLY-BRACKET ::= U+7D
ua-common-name ::= ALPHA*
ua-pattern ::= ASTERISK 'pattern' EQUALS-SIGN ALPHA*
ua-groups ::= ASTERISK 'groups' EQUALS-SIGN ('noserve' | 'nopush' COMMA)*
user-agent-entry ::= ua-common-name SP ua-pattern SP ua-groups CR
user-agent-section ::= 'user-agent' SP LEFT-CURLY-BRACKET CR
user-agent-entry*
RIGHT-CURLY-BRACKET CR

Cookbook

Example 1: Default patterns and groups
modules {
user-agent on
}
request {
user-agent {
// Crawlers
360-Spider *groups=crawler *pattern=360Spider
ADmantX-Platform-Analyzer *groups=crawler *pattern=admantx
Ahrefs-Backlink-Research *groups=crawler *pattern=AhrefsBot
Alexa *groups=crawler *pattern=ia_archiver
Apple *groups=crawler *pattern=Applebot
Baidu *groups=crawler *pattern=Baiduspider
Bing *groups=crawler *pattern=bingbot
BingPreview *groups=crawler *pattern=BingPreview
Bitly-Link-Checker *groups=crawler *pattern=bitlybot
Cốc-Cốc *groups=crawler *pattern=coccoc
Cosmos-Crawler *groups=crawler *pattern=Cosmos
Goo *groups=crawler *pattern=ichiro
DuckDuckGo *groups=crawler *pattern=DuckDuckBot
Exa *groups=crawler *pattern=Exabot
Evaliant-Impressions *groups=crawler *pattern=evaliant
Facebook *groups=crawler *pattern=facebookexternalhit
Google-Favicon *groups=crawler *pattern='Google Favicon'
Google *groups=crawler *pattern=Googlebot
Heritrix-Internet-Archive *groups=crawler *pattern='archive.org_bot'
Majestic-12-Distributed-Search *groups=crawler *pattern=MJ12bot
MSN *groups=crawler *pattern=msnbot
Pinterest *groups=crawler *pattern=pinterest
Semrush *groups=crawler *pattern=SemrushBot
SEOProfiler *groups=crawler *pattern=spbot
Slack *groups=crawler *pattern=Slackbot
Slack-ImgProxy *groups=crawler *pattern=Slack-ImgProxy
Sogou *groups=crawler *pattern=Sogou
Sosospider *groups=crawler *pattern=Sosospider
TweetMeme *groups=crawler *pattern=TweetmemeBot
Twitter *groups=crawler *pattern=Twitterbot
Yahoo!-Slurp *groups=crawler *pattern=yahoo
Yandex *groups=crawler *pattern=YandexBot
WhatsApp *groups=crawler *pattern=WhatsApp

// Tools
curl *groups=tool *pattern=curl
dead-link-checker *groups=tool *pattern=deadlinkchecker
Leipzig-Corpora-Collection *groups=tool *pattern=findlinks
Scrapy *groups=tool *pattern=Scrapy
wget *groups=tool *pattern=wget
YaCy-Peer-to-Peer *groups=tool *pattern=yacybot

// Mobile Phones
Apple-Mobile *groups=mobile *pattern=iPhone
Android-Mobile *groups=mobile *pattern=Android
Windows-Mobile *groups=mobile *pattern=IEMobile
Opera-Mobile *groups=mobile *pattern='Opera Mini'
Firefox-Mobile *groups=mobile *pattern='Mobile.*Firefox'

// Tablets
Apple-Tablet *groups=tablet *pattern=iPad
Apple-Tablet *groups=tablet *pattern=iPod
Amazon-Tablet *groups=tablet *pattern=Silk
Amazon-Tablet *groups=tablet *pattern=Kindle
Amazon-Tablet *groups=tablet *pattern=NetFront
Firefox-Tablet *groups=tablet *pattern='Tablet.*Firefox'

// Desktop
Opera-Desktop *groups=desktop *pattern=Opera
Opera-Desktop *groups=desktop *pattern=OPR
Firefox-Desktop *groups=desktop *pattern=Firefox
Firefox-Desktop *groups=desktop *pattern=Seamonkey
Microsoft-Desktop *groups=desktop *pattern=MSIE
Microsoft-Desktop *groups=desktop *pattern=Trident
Microsoft-Desktop *groups=desktop *pattern=Edge
Chrome-Desktop *groups=desktop *pattern=Chromium
Chrome-Desktop *groups=desktop *pattern=Chrome
Apple-Desktop *groups=desktop *pattern=Safari
}
}
Example 2: Denying all requests from a crawler
modules {
user-agent on
}
request {
user-agent {
Googlebot *pattern=Googlebot *groups=noserve
}
}
Example 3: Denying speculative push requests from a crawler
modules {
user-agent on
}
request {
user-agent {
Googlebot *pattern=Googlebot *groups=nopush
}
}

Review

Key points to remember:

  • The user-agent request header may be used to distinguish between automated crawlers and browser requests.
  • Crawlers can be blocked from all requests or just speculative push requests.
Read Write Tools icon

Smart tech

READ WRITE TOOLS

Read Write Hub icon

Templates & Content

READ WRITE HUB

BLUEPHRASE icon

Rediscover HTML

BLUE PHRASE

0

Defining server behaviour for web crawlers

🔗 🔎