How do HTML, XML, and XHTML support the Control Codes in the C0 (U+0000-U+001F) and C1 (U+007F-U+009F) ranges?
Legacy applications sometimes create data incorporating controls. It is therefore important to understand how controls are supported in markup languages,
when migrating these applications or their data to the web.
There are two ranges of the Unicode Character Set that are
assigned as Control Codes. The Unicode Standard makes no
particular use of these controls and leaves their definition up
to the application. If the application does not specify their
use, then they are to be interpreted according to the semantics
of ISO/IEC 6429. Most of you will recognize many of the 6429 controls: ACK, NAK, BEL, LF, FF, VT, CR, et al.
The ISO 8859 family and other character standards base their control codes on the ISO 6429 standard.
The control codes in the range U+0000-U+001F are known as the "C0" range.1.
This range begins with the NUL U+0000 control.
The control codes in the range U+0080-U+009F are known as the "C1" range.2.
Delete U+007F is also a control and is adjacent to the beginning of the C1 Range.
A few points are worth noting about controls and markup:
- Whereas the ISO 8859 family reserves the C1 range for
controls, Microsoft character sets (e.g. 1250-1258) place
characters in this range. Sometimes content authors mistakenly
use the Microsoft character code points in creating Numeric
Character References (NCRs) instead of using the Unicode values.
Because of the prevalence of this mistake, many browsers display
the Microsoft characters in this range. This is incorrect
behavior and further misleads the developer by incorrectly
confirming the mistaken value. The problem may eventually be
discovered when the data is treated by some application as a
control and not the erroneous character.
- When control codes are used for formatting text,
for example Form Feed, U+000C, it is
better to replace the controls with appropriate
markup3.
- If the data is not really textual, but binary, then it may be more practical to encode it, for example using base64.
When control codes represent other kinds of text data, (not formatting or binary data), it can be important to maintain their values in
context. However, the display of most of the controls by browsers is behavior that is unspecified.
Maintenance of control codes in text is generally more important for data interchange.
Programmers working with legacy applications that may have data in the C0 range should be aware of which markup languages support the range.
The following table summarizes which markup languages support the control codes:
Controls and Markup Language
Controls | Range | HTML 4 | XML 1.0 | XHTML | XML 1.1 |
C0, except TAB, LF, CR | U+0000 (NUL) | Illegal | Illegal | Illegal | Illegal |
U+0001-U+001F | Illegal | Illegal | Illegal | NCR, CER4 |
DELETE + C1 | U+007F-U+009F | Supported | Supported | Supported | NCR, CER4 |
- The NUL control is illegal and cannot be represented by NCR or encoded directly in markup languages.
- HTML, XML 1.0, and XHTML do not support the C0 range, except for Tab U+0009, LF U+000A,
and CR U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs.
- XML 1.1 restricts the C1 range, except for U+0085 (NEL, the EBCDIC New line),
as well as the C0 range. However, XML 1.1 allows the controls to be represented by Numeric Character References
(NCR) or Character Entity References (CER4).
Solutions
If you need to represent the C0 controls in HTML, XML 1.0 or XHTML, you can create a convention to represent them
and replace every occurence with that convention.
An alternative is to encode the data. For example, encode the data as base64 or as hexadecimal values,
to ensure only supported characters are used in the markup language text.
(And of course, decoding the text when reading the files.) Note that XML Schema provides data types for these encodings.
Another alternative is to store the data in an external document and reference it from the XML document.
In XML 1.1, the simplest alternative is to represent any occurence of a control with an NCR.
For example, the control code "ESCAPE" U+001B would be
represented by either the  (hexadecimal) or  (decimal) Numeric Character References.
NOTES:
1 More details on the C0 range are available in the Unicode Code Chart: C0 Controls and Basic Latin.
2 More details on the C1 range are available in the Unicode Code Chart: C1 Controls and Latin-1 Supplement.
3 The document Unicode in XML and other Markup Languages
contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML.
4 Character Entity Reference (CER) is a term defined in the HTML standard for a named Entity
that contains a single character. For example, eacute is the Character Entity Reference
which represents "é". These Character Entity References are predefined and so available to all HTML files.
XML does not use the term Character Entity References, but
we use the term here to refer to an Entity, that you might define, to represent characters that may be controls.