W3C I18N Q&A: HTML, XHTML, XML and Control Codes

Legacy applications sometimes create data incorporating controls. It is therefore important to understand how controls are supported in markup languages, when migrating these applications or their data to the web.

There are two ranges of the Unicode Character Set that are assigned as Control Codes. The Unicode Standard makes no particular use of these controls and leaves their definition up to the application. If the application does not specify their use, then they are to be interpreted according to the semantics of ISO/IEC 6429. Most of you will recognize many of the 6429 controls: ACK, NAK, BEL, LF, FF, VT, CR, et al. The ISO 8859 family and other character standards base their control codes on the ISO 6429 standard.

The control codes in the range U+0000-U+001F are known as the "C0" range.1. This range begins with the NUL U+0000 control. The control codes in the range U+0080-U+009F are known as the "C1" range.2. Delete U+007F is also a control and is adjacent to the beginning of the C1 Range.

A few points are worth noting about controls and markup:

Whereas the ISO 8859 family reserves the C1 range for controls, Microsoft character sets (e.g. 1250-1258) place characters in this range. Sometimes content authors mistakenly use the Microsoft character code points in creating Numeric Character References (NCRs) instead of using the Unicode values. Because of the prevalence of this mistake, many browsers display the Microsoft characters in this range. This is incorrect behavior and further misleads the developer by incorrectly confirming the mistaken value. The problem may eventually be discovered when the data is treated by some application as a control and not the erroneous character.
When control codes are used for formatting text, for example Form Feed, U+000C, it is better to replace the controls with appropriate markup3.
If the data is not really textual, but binary, then it may be more practical to encode it, for example using base64.

When control codes represent other kinds of text data, (not formatting or binary data), it can be important to maintain their values in context. However, the display of most of the controls by browsers is behavior that is unspecified. Maintenance of control codes in text is generally more important for data interchange. Programmers working with legacy applications that may have data in the C0 range should be aware of which markup languages support the range.

The following table summarizes which markup languages support the control codes:

Solutions

If you need to represent the C0 controls in HTML, XML 1.0 or XHTML, you can create a convention to represent them and replace every occurence with that convention. An alternative is to encode the data. For example, encode the data as base64 or as hexadecimal values, to ensure only supported characters are used in the markup language text. (And of course, decoding the text when reading the files.) Note that XML Schema provides data types for these encodings.

Another alternative is to store the data in an external document and reference it from the XML document.

Controls and Markup Language
Controls	Range	HTML 4	XML 1.0	XHTML	XML 1.1
C0, except TAB, LF, CR	U+0000 (NUL)	Illegal	Illegal	Illegal	Illegal
U+0001-U+001F	Illegal	Illegal	Illegal	NCR, CER4
DELETE + C1	U+007F-U+009F	Supported	Supported	Supported	NCR, CER4

In XML 1.1, the simplest alternative is to represent any occurence of a control with an NCR. For example, the control code "ESCAPE" U+001B would be represented by either the  (hexadecimal) or  (decimal) Numeric Character References.

NOTES:

4 Character Entity Reference (CER) is a term defined in the HTML standard for a named Entity that contains a single character. For example, eacute is the Character Entity Reference which represents "é". These Character Entity References are predefined and so available to all HTML files. XML does not use the term Character Entity References, but we use the term here to refer to an Entity, that you might define, to represent characters that may be controls.

Questions & Answers: HTML, XHTML, XML and Control Codes

Question...

Answer...

Solutions

NOTES: