What ASL characters look like is assumed

Handling of character encodings in HTML and CSS (tutorial)

Why should you read this?

If a browser cannot recognize the character encoding used on a web page, the content may be illegible. The information in this tutorial is especially important for those who maintain and expand a multilingual website, but specifying the character encoding of a document is for everyone important who write HTML or CSS with non-ASCII characters. Because although it may look good on you, the browser settings can affect the readability of others. This tutorial will help you understand the subject so you can make the right decisions.

aim

After completing this tutorial, you should:

  • have a clear understanding of the factors involved in choosing the character encoding of (X) HTML documents and the benefits of Unicode
  • know when and how to specify the character encoding of HTML and CSS documents
  • are aware of some problems with the delivery and character encoding of HTML files in older browsers that affect the above
  • know what the terms byte-order mark and normalization mean what influence both have and how to deal with them
  • know when and how to use escapes to mask characters

Intended audience: HTML and CSS developers. What has been said here applies both to documents written in the editor and to documents generated by scripts.

This tutorial gives you an ordered collection of references to articles that, taken together, will help you understand the basic aspects of characters and encodings when writing HTML and CSS.

Short and sweet

Always save websites in UTF-8 if possible.

Always include the character encoding of the document: in the HTTP header, if possible; also always in the document itself.

You can use or HTTP headers to specify the character encoding of your stylesheet, but you only have to do this if it contains non-ASCII characters (e.g. in the names of fonts, ID or class identifiers, etc.) and it is not ensured that HTML and associated stylesheet use the same character encoding.

Don't use a BOM in UTF-8. Save HTML code in Unicode normalization form C (NFC).

Don't use character escapes, except for invisible or ambiguous characters. Don't use Unicode control characters if there is markup for them.

Important background information

If you are new to the field, there are some basic concepts that you should understand in order to be able to follow the rest of this tutorial. Once you are comfortable with these concepts, you can skip to the next section.

Choose and apply a character encoding

Content is made up of a sequence of characters. Characters represent the letters of the alphabet, punctuation marks, etc. In a computer, however, the content is stored as a sequence of bytes, which are numerical values. Some characters are represented by more than one byte. As with ciphers in espionage, the way in which sequences of bytes are converted into characters depends on the key used to encode the text. In this context, the key is called character encoding. Different character encodings are available.

Choose and apply a character encoding gives you simple advice on what character encoding to use for your content and how to apply it.

How the character encoding is specified

You should always specify the character encoding used for an HTML or XML document. Otherwise there is a risk that characters in the content will not be interpreted correctly. This not only affects readability for humans, machines also increasingly need to be able to understand your data. You should also check that you are not specifying different character encodings in different places.

Specification of the character encoding in HTML gives brief recommendations for those who want to know what to do quickly and more detailed information for those who need it.

Specification of the character encoding in CSS gives information for CSS.

 

The BOM (byte-order mark)

The BOM (byte-order mark) is encountered when using a Unicode-based character encoding such as UTF-8 or UTF-16. In some cases you have to remove the BOM, in other cases you have to make sure that one is present.

The BOM (byte-order mark) in HTML helps you understand.

Unicode normalization forms

Normalization must be considered when writing HTML pages with CSS stylesheets in UFT-8 (or another Unicode coding), especially when dealing with text whose font contains accents or other diacritical marks (e.g. Umlauts - translator's note).

Normalization in HTML and CSS explains this in more detail.

Use of character escapes

Every Unicode character can be represented (masked) in HTML, XML or CSS by means of a character escape; in this only ASCII characters appear.

Use of character escapes in markup and CSS gives advice on when and how to use escapes when needed.

Characters or markup?

There are some control characters in Unicode, some of which have the same function as markup. The question arises: which should you use and which should you avoid?

Characters or markup? answers this question.

Further reading