Declaring which character set to use with documents, style sheets, and text files

Charsets

Preliminaries

This note describes how character set declarations can be added to response headers for documents, style sheets, and text files.

Most new text documents served on the web today are prepared using the Unicode character set with UTF-8 encoding. But not all. The transition from ISO/IEC 8859-x character sets to UTF-8 has been underway for years, and will continue indefinitely. Likewise for Windows-125x character sets. And while older DOS-based code-page encodings (CP437 and its cousins) are rarely seen any more, the Shift_JIS, BIG5, EUC-KR character encodings for Japanese, Chinese, and Korean are still in active use.

It is a common practice to add a meta tag like this to an HTML's <head> section:

 <meta charset="UTF-8">

or a rule like this at the top of a CSS file:

@charset "UTF-8";

or an attribute added to a <script> tag:

<script type="text/javascript" charset="UTF-8" src="xyz.js"></script>

or an explicit declaration added to an XML document:

<?xml version ="1.0" encoding="UTF-8"?>

These are necessary only when a server fails to properly declare, in its response headers, which character set is used for the content.

When properly configured, these statements can be removed, because the server communicates directly to the browser, via response headers, which character set encoding is applicable.

Character set identifiers should always be used exactly as specified, with proper capitalization and hyphenation. For example, UTF-8, ISO-8859-1, Windows-1252, and Shift_JIS are all properly specified. Lower case variants should not be used.

For more about character set identifiers refer to IETF RFC 2978 IANA Charset Registration Procedures.

Configuration

The charset configuration section is used to declare character set identifiers for all documents of a given MIME-type. It comprises a collection of two-part entries: the left hand side is a MIME-type, and the right-hand side is the character set identifier.

If the website has some documents encoded in one character set and other documents (of the same MIME-type) encoded in a different character set the charset configuration section should not be used, and the common practice outlined above should be followed.

When a requested file is served, the declared charset identifier is appended to the content-type response header. For example, an HTML file encoded in UTF-8 would have a header of text/html; charset=UTF-8.

Charsets should only be configured for text and application media types; they are not meaningful for image, audio, or video.

Note that each content type has a different set of rules regarding which charsets may be used. For example, JSON can only be UTF-8, UTF-16, or UTF-32.

Placement

The charset configuration section may appear in either the server/response subsection or a host/response subsection. When values occur in both the server/response and host/response subsections, they are merged according to the standard rules defined for the merge attribute.

EBNF

SP ::= U+20
CR ::= U+0D
SOLIDUS ::= U+2F
ASTERISK ::= U+2A
LEFT-CURLY-BRACKET ::= U+7B
RIGHT-CURLY-BRACKET ::= U+7D
media-type ::= 'text' | 'application'
subtype ::= (ALPHA | DIGIT | )*
MIME-type ::= media-type SOLIDUS subtype
charset-identifier ::= (ALPHA | DIGIT | ††)*
charset-entry ::= MIME-type SP charset-identifier CR
charset-section ::= 'charset' SP LEFT-CURLY-BRACKET CR
charset-entry*
RIGHT-CURLY-BRACKET CR

† See section 4.2 of RFC 6838 for exact rules

†† See IETF RFC 2978 for guidance

Cookbook

Example 1: Using the UTF-8 charset throughout
server {
response {
charset {
text/css UTF-8
text/html UTF-8
text/blue UTF-8
application/javascript UTF-8
application/json UTF-8
application/xhtml+xml UTF-8
application/xml UTF-8
}
}
}
Example 2: Using the ISO-8859-1 charset for HTML documents
server {
response {
charset {
text/html ISO-8859-1
}
}
}

Review

Key points to remember:

  • The charset configuration section associates MIME-types with character set identifiers.
  • A charset attribute is appended to the content-type response header.
  • No charset declaration is sent, even if properly configured, if no content-type response header is generated.

Declaring which character set to use with documents, style sheets, and text files