W3C GEO Task Force
Glossary of Internationalization Terminology for the World Wide Web

DRAFT Version

NOTE: Links to other websites will appear in a new browser window.

A list of terms and definitions related to internationalization and localization in the web environment. The list is created and maintained by the W3C I18N GEO Task Force.

The glossary is in support of HTML Techniques (Draft)

If you would like to link to any of the definitions on this page, you can find the appropriate fragment identifier (i.e. the link's anchor name) by placing your mouse over the term. The identifier will appear in a tooltip. Append a hash mark ("#") and the identifier to the URL for this page: www.i18nguy.com/markup/i18n-glossary.html.

For example, place your mouse over the term "i18n". The tooltip will display the identifier is also "i18n". (Most of the terms use their name as their identifier.) To link to the "i18n" entry, therefore specify: www.i18nguy.com/markup/i18n-glossary.html#i18n

W3C I18N GEO Task Force Home Page

I18nGuy home page

Editor's notes:

Suggested that we add: BDO, LRE, PDF, LRM, ZWNJ. Do we want to document Unicode controls characters?

Glossary of Internationalization Terminology for the World Wide Web
abjad A type of writing system where only consonants are generally written.
abugida A type of writing system whose basic characters denote consonants followed by a particular vowel, and in which diacritics denote the other vowels.
ANSI American National Standards Institute.
Microsoft's collective name for all or any Windows code pages. (As in "ANSI code page".) Sometimes used specifically for code page 1252, which is a superset of ISO/IEC 8859-1.
ASCII American Standard for Character Information Interchange. ISO 646.
bidi Internationalization industry jargon. Abbreviation for bidirectional text.
Bidirectional text Also abbreviated as "bidi", describes text that is primarily written from right-to-left, and which is often mixed with left-to-right text. Examples include text written in Hebrew and Arabic scripts.
Basic Multilingual Plane (BMP) TBD
BMP Basic Multilingual Plane
BOM Byte Order Mark, U+FEFF, Also used as Unicode Character Encoding Signature
byte order mark U+FEFF, also known as BOM and ZWNSP. Also used as Character Encoding Signature for Unicode encodings (UTF-8, UTF-16, et al.)
character A member of a set of elements used for the organization, control, or representation of data. For example, "LATIN CAPITAL LETTER A" names a character.
character encoding TBD
character entity TBD
character set TBD
charset TBD
character encoding signature TBD
character escape tbd
character repertoire A set of characters (in the mathematical sense)
coded character set TBD
code point TBD
compatibility character TBD
complex script TBD
DBCS Double-Byte Character Set. A specific type of MBCS, character encodings where characters are of varying byte length, limited to a maximum length of 2 bytes for characters. A character encoding where characters are represented by either one or two bytes. Sometimes DBC is used for double-byte character.
diacritic TBD
document character set TBD
escape see "character escape"
fragment TBD
GEO W3C Abbreviation for Guidelines, Education, and Outreach. See www.w3.org/International/geo/
glyph TBD
goober A type of consideration for the internationalization of software or Web applications due to local legal, regulatory, or other governmental requirements. See Web Services Internationalization Usage Scenarios, Section 4.15 Legal and Regulatory Goobers
HTTP HyperText Transfer Protocol
HTTP header TBD
i18n Abbreviation. See internationalization. Also see "Origin of the abbreviation i18n".
IANA Internet Assigned Numbers Authority www.iana.org
IANA Charset Registry Registry for character encodings used by MIME, Web standards, and others.
internationalization Designing software to be usable around the world.
IRI W3C acronym for Internationalized Resource Identifier, an internationalized form of URI. See www.w3.org/International/O-URL-and-ident.
MBCS Multi-Byte Character Set. A type of character encoding where characters are of varying byte length. Characters may be encoded as 1, 2, 3 or 4 bytes for example in some encodings.
mojibake (文字化け) Japanese jargon for any of "garbage", "changed", "ghost" or "disguised" characters or what is shown when Japanese characters are not displayed correctly (various black boxes or other nonsense characters). Here are some examples that look like mojibake: █ █ (You should see some black boxes.) There can also be white boxes:  █ █  or ǶǶǶ. In Japan, these are sometimes called "TOFU"
NCR Numeric Character Reference. (See HTML specification.)
NFC Unicode acronym for Normalization Form C
NLS Software Industry abbreviation for National Language System. General term refering to features, and libraries and related data supporting internationalization within an operating system or product. Example usage: "NLS Library".
normalization Unicode term normalization
quirks mode TBD
PUA Abbreviation for Unicode term: Private Use Area
SBCS Single-Byte Character Set. Some vendors refer to this as a code page. A character encoding where each character is represented by one 8-bit value. Sometimes SBC is used for single-byte character.
standards mode TBD
supplementary character TBD
tofu (豆腐) Japanese jargon for the white box character that is displayed by default for an unassigned or unknown character. For example: Ƕ. See mojibake
transcoding TBD
UCS Abbreviation for Unicode term: Universal Character Set which is specified by International Standard ISO/IEC 10646. Sometimes also used as Unicode Character Standard.
Unicode Unicode Character Standard (UCS), Universal Character Set. See Unicode ConsortiumAlso see ISO 10646.
user agent (UA) TBD
UTF Abbreviation, Unicode term for Unicode Transformation Format. Also see UTF-8, UTF-16, UTF-16LE, UTF-16BE, UTF-32, UTF-32BE, UTF-32LE
virama TBD
W3C Abbreviation for World Wide Web Consortium. See www.w3.org
WAI W3C abbreviation for Web Accessibility Initiative. See www.w3.org/WAI/
XML eXtensible Markup Language
XML declaration TBD
ZWNSP Zero Width No-break Space. Deprecated. Formerly doubled as a Byte Order Mark, U+FEFF.
Å, Å The symbol for Ångstrom (U+212B) and the letter A-ring (U+00C5, or U+0041 and U+030A - A and Combining Ring Above). Scandanavian alphabets sort the letter A-ring after the letter Z.

Other Terminology Resources