FACTOID # 8: Bookworms: Vermont has the highest number of high school teachers per capita and third highest number of librarians per capita.
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 


FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:



(* = Graphable)



Encyclopedia > Numeric character reference

A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents. Markup refers to the use of a markup language to describe the structure and appearance of a particular document. ... The Standard Generalized Markup Language (SGML) is a metalanguage in which one can define markup languages for documents. ... HTML, short for Hypertext Markup Language, is the predominant markup language for the creation of web pages. ... The Extensible Markup Language (XML) is a general-purpose markup language. ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ...



In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma ("Σ"):

 Σ Σ Σ Σ 


Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding. A character encoding or character set (sometimes referred to as code page) consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers...

Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence. This article is about the unit of information. ...

Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each. ISO 8859, more formally ISO/IEC 8859, is a joint ISO and IEC standard for 8-bit character encodings for use by computers. ... In computer science a byte is a unit of measurement of information storage, most often consisting of eight bits. ...

Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.

The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.

Character references that are based on the referenced character's UCS or Unicode "code point" are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows: The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. ...

Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:

  • one or more decimal digits zero (U+0030) through nine (U+0039); or
  • character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066);

all followed by character U+003B (semicolon). Older versions of HTML disallowed the hexadecimal syntax.

The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.

There is another kind of character reference called a character entity reference, which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity.) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs. HTML has been in use since 1991 (note that the W3C international standard is now XHTML), but the first standardized version with a reasonably complete treatment of international characters was version 4. ... An SGML entity is an abbreviation for part of a document. ...


The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS.

While the syntax of SGML does not prohibit references to unassigned code points, such as , SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters or that have not been permanently left unassigned.

Restrictions may also apply for other reasons. For example, in HTML 4, , which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference. As another example, €, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers—some of which attempt to interpret it as a reference to the character represented by code value 128 in the Windows-1252 encoding: "€", which actually should be represented as €. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like 𐀀 (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended. ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. ... October 6 is the 279th day of the year (280th in leap years). ... 2000 (MM) was a leap year starting on Saturday of the Gregorian calendar. ...

Markup languages also place restrictions on where character references can occur.

See also



Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m