Defining server behaviour for web crawlers

User Agent

Preliminaries

This note describes how the server can be configured to handle special use cases when it detects that a web crawler is making requests.

Requests to the server are made by browsers and crawlers. There are good reasons to distinguish between the two, and to handle each in different ways. Distinguishing between the two is the task of the User Agent handler.

The server can use user-agent information to do two things: 1) it can completely block all requests, 2) it can suppress all speculative push requests.

The configured user-agent string patterns are also used in the Counters module to count requests by browser/crawler type.

A user-agent primer

Well-formed requests to the server should have a user-agent request header. Its value consists of a set of identifiers, determined by the requestor's software, that describes itself to the server. The number of parts to the header value varies by vendor. The most generic identifiers come first, followed by more specific identifiers. Identifiers consist of a name, a version, and a comment, where the name and version are connected by a SOLIDUS, while the comment, when present, is enclosed in parentheses.

In practice, the number of different user-agent variants presents a challenge to anyone wanting to parse and classify them. The server can do the easy part: separating the parts into identifiers, and splitting the identifiers into name, version, and comment. The webmaster or software engineer has to do the hard part: determining which of these are browsers and which are crawlers. There are only a few shortcuts, and they are not guaranteed.

Crawlers

Substrings to look for to determine if a user-agent is a crawler:

The keyword "bot" is often included somewhere in the user-agent header.
A website URL is often included somewhere in the user-agent header.

This is a list of commercial crawlers with the string patterns that were used to identify them:

Vendor	String Patterns
360 Spider	360Spider
ADmantX Platform Analyzer	admantx
Ahrefs Backlink Research	AhrefsBot
Alexa	ia_archiver
Apple	Applebot
Baidu	Baiduspider
Bing	bingbot
BingPreview	BingPreview
Bit.ly Link Checker	bitlybot
Cốc Cốc	coccoc
Cosmos Crawler	Cosmos
Goo	ichiro
DuckDuckGo	DuckDuckBot
Exa	Exabot
Evaliant Impressions	evaliant
Facebook	facebookexternalhit
Google Favicon	Google Favicon
Google	Googlebot
Heritrix Internet Archive	archive.org_bot
Majestic-12 Distributed Search	MJ12bot
MSN	msnbot
Pinterest	pinterest
Semrush	SemrushBot
SEOProfiler	spbot
Slack	Slackbot
Slack-ImgProxy	Slack-ImgProxy
Sogou	Sogou
Sosospider	Sosospider
TweetMeme	TweetmemeBot
Twitter	Twitterbot
Yahoo! Slurp	yahoo
Yandex	YandexBot
WhatsApp	WhatsApp

This is a list of special-purpose, non-browser, software with the string patterns that were used to identify them:

Tool	String Patterns
curl utility	curl
dead link checker	deadlinkchecker
Leipzig Corpora Collection	findlinks
Scrapy	Scrapy
wget utility	wget
YaCy Peer-to-Peer	yacybot

User-agent strings can easily be spoofed allowing bad actors to masquerade as legitimate crawlers. Webmasters should rely on other means to bolster their defenses against excessive crawling.

Browsers

For analytic purposes, it's useful to know what type of browsers are being used to access the server. This is a list of browser user-agent string patterns that can be used as a starting point towards classifying browsers. The order of these strings is important, since many user-agent strings include the keywords "Safari" and "Chrome". Be aware that the algorithm used by the server stops searching at the first pattern matched.

Mobile Phone	String Patterns
Apple Mobile	iPhone
Android Mobile	Android
Windows Mobile	IEMobile
Opera Mobile	Opera Mini
Firefox Mobile	Mobile.*Firefox
Tablets	String Patterns
Apple Tablet	iPad
Apple Tablet	iPod
Amazon Tablet	Silk
Amazon Tablet	Kindle
Amazon Tablet	NetFront
Firefox Tablet	Tablet.*Firefox
Desktop	String Patterns
Opera Desktop	Opera
Opera Desktop	OPR
Firefox Desktop	Firefox
Firefox Desktop	Seamonkey
Microsoft Desktop	MSIE
Microsoft Desktop	Trident
Microsoft Desktop	Edge
Chrome Desktop	Chromium
Chrome Desktop	Chrome
Apple Desktop	Safari
Unclassified	String Patterns
Other	'*'

Configuration

The modules section must include a user-agent entry with the value on in order to enable the User Agent module.

The user-agent section of the configuration comprises a set of user-agent entries, each of which has a string-pattern and a comma-separated list of group names.

There are two meaningful group names: noserve which instructs the server to refuse all requests, and nopush, which instructs the server to not use the speculative push protocol.

The request header value is sequentially tested against each configuration entry's string-pattern and the first matching entry's group names are assigned to the current work order.

Information Headers

The following information headers may be issued in conjunction with the User Agent handler:

Information Header	Trigger
`rw-user-agent-noserve`	A request was declined because the requestor was identified as a crawler. The response returns a `403` status code.
`rw-user-agent-nopush`	A speculative push request was declined because the requestor was identified as a crawler. The source document is compiled and returned with a normal status code.

EBNF

SP	::=	U+20
CR	::=	U+0D
ASTERISK	::=	U+2A
COMMA	::=	U+2C
EQUALS-SIGN	::=	U+3D
LEFT-CURLY-BRACKET	::=	U+7B
RIGHT-CURLY-BRACKET	::=	U+7D
ua-common-name	::=	ALPHA*
ua-pattern	::=	ASTERISK 'pattern' EQUALS-SIGN ALPHA*
ua-groups	::=	ASTERISK 'groups' EQUALS-SIGN ('noserve' \| 'nopush' COMMA)*
user-agent-entry	::=	ua-common-name SP ua-pattern SP ua-groups CR
user-agent-section	::=	'user-agent' SP LEFT-CURLY-BRACKET CR user-agent-entry* RIGHT-CURLY-BRACKET CR

Cookbook

Example 1: Default patterns and groups

modules {
    user-agent    on
}
request {
    user-agent {
        // Crawlers
        360-Spider                     *groups=crawler   *pattern=360Spider
        ADmantX-Platform-Analyzer      *groups=crawler   *pattern=admantx
        Ahrefs-Backlink-Research       *groups=crawler   *pattern=AhrefsBot
        Alexa                          *groups=crawler   *pattern=ia_archiver
        Apple                          *groups=crawler   *pattern=Applebot
        Baidu                          *groups=crawler   *pattern=Baiduspider
        Bing                           *groups=crawler   *pattern=bingbot
        BingPreview                    *groups=crawler   *pattern=BingPreview
        Bitly-Link-Checker             *groups=crawler   *pattern=bitlybot
        Cốc-Cốc                        *groups=crawler   *pattern=coccoc
        Cosmos-Crawler                 *groups=crawler   *pattern=Cosmos
        Goo                            *groups=crawler   *pattern=ichiro
        DuckDuckGo                     *groups=crawler   *pattern=DuckDuckBot
        Exa                            *groups=crawler   *pattern=Exabot
        Evaliant-Impressions           *groups=crawler   *pattern=evaliant
        Facebook                       *groups=crawler   *pattern=facebookexternalhit
        Google-Favicon                 *groups=crawler   *pattern='Google Favicon'
        Google                         *groups=crawler   *pattern=Googlebot
        Heritrix-Internet-Archive      *groups=crawler   *pattern='archive.org_bot'
        Majestic-12-Distributed-Search *groups=crawler   *pattern=MJ12bot
        MSN                            *groups=crawler   *pattern=msnbot
        Pinterest                      *groups=crawler   *pattern=pinterest
        Semrush                        *groups=crawler   *pattern=SemrushBot
        SEOProfiler                    *groups=crawler   *pattern=spbot
        Slack                          *groups=crawler   *pattern=Slackbot
        Slack-ImgProxy                 *groups=crawler   *pattern=Slack-ImgProxy
        Sogou                          *groups=crawler   *pattern=Sogou
        Sosospider                     *groups=crawler   *pattern=Sosospider
        TweetMeme                      *groups=crawler   *pattern=TweetmemeBot
        Twitter                        *groups=crawler   *pattern=Twitterbot
        Yahoo!-Slurp                   *groups=crawler   *pattern=yahoo
        Yandex                         *groups=crawler   *pattern=YandexBot
        WhatsApp                       *groups=crawler   *pattern=WhatsApp
                
        // Tools
        curl                           *groups=tool      *pattern=curl
        dead-link-checker              *groups=tool      *pattern=deadlinkchecker
        Leipzig-Corpora-Collection     *groups=tool      *pattern=findlinks
        Scrapy                         *groups=tool      *pattern=Scrapy
        wget                           *groups=tool      *pattern=wget
        YaCy-Peer-to-Peer              *groups=tool      *pattern=yacybot

        // Mobile Phones
        Apple-Mobile                   *groups=mobile    *pattern=iPhone
        Android-Mobile                 *groups=mobile    *pattern=Android
        Windows-Mobile                 *groups=mobile    *pattern=IEMobile
        Opera-Mobile                   *groups=mobile    *pattern='Opera Mini'
        Firefox-Mobile                 *groups=mobile    *pattern='Mobile.*Firefox'

        // Tablets
        Apple-Tablet                   *groups=tablet    *pattern=iPad
        Apple-Tablet                   *groups=tablet    *pattern=iPod
        Amazon-Tablet                  *groups=tablet    *pattern=Silk
        Amazon-Tablet                  *groups=tablet    *pattern=Kindle
        Amazon-Tablet                  *groups=tablet    *pattern=NetFront
        Firefox-Tablet                 *groups=tablet    *pattern='Tablet.*Firefox'

        // Desktop
        Opera-Desktop                  *groups=desktop   *pattern=Opera
        Opera-Desktop                  *groups=desktop   *pattern=OPR
        Firefox-Desktop                *groups=desktop   *pattern=Firefox
        Firefox-Desktop                *groups=desktop   *pattern=Seamonkey
        Microsoft-Desktop              *groups=desktop   *pattern=MSIE
        Microsoft-Desktop              *groups=desktop   *pattern=Trident
        Microsoft-Desktop              *groups=desktop   *pattern=Edge
        Chrome-Desktop                 *groups=desktop   *pattern=Chromium
        Chrome-Desktop                 *groups=desktop   *pattern=Chrome
        Apple-Desktop                  *groups=desktop   *pattern=Safari
    }        
}

Example 2: Denying all requests from a crawler

modules {
    user-agent    on
}
request {
    user-agent {
        Googlebot  *pattern=Googlebot  *groups=noserve
    }
}

Example 3: Denying speculative push requests from a crawler

modules {
    user-agent    on
}
request {
    user-agent {
        Googlebot  *pattern=Googlebot  *groups=nopush
    }
}

Review

Key points to remember:

The user-agent request header may be used to distinguish between automated crawlers and browser requests.
Crawlers can be blocked from all requests or just speculative push requests.