FACTOID # 19: Cheap sloppy joes: Looking for reduced-price lunches for schoolchildren? Head for Oklahoma!
 
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 
   
 
WHAT'S NEW
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Unicode and HTML
Because of technical limitations, some web browsers may not display some special characters in this article.
Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces
HTML

Character encodings
Dynamic HTML
Font family
HTML editor
HTML element
HTML scripting
Layout engine comparison
Style sheets
Unicode and HTML
W3C
Web browsers comparison
Web colors
XHTML An example of a Web browser (Mozilla Firefox) A web browser is a software application that enables a user to display and interact with text, images, videos, music and other information typically located on a Web page at a website on the World Wide Web or a local area network. ... The Unicode Standard, Version 5. ... This page compares Unicode encodings. ... UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ... CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. ... In computing, UTF-16 is a variable-length (16 or 32 bits) character encoding. ... UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ... UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ... The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ... This article or section may be confusing for some readers, and should be edited to be clearer. ... Example of Arabic IDN Example of Chinese IDN Example of Greek IDN Example of Hebrew IDN Example of Hindi IDN An internationalized domain name (IDN) is an Internet domain name that (potentially) contains non-ASCII characters. ... GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC) superseding GB2312. ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points. ... Some writing systems of the world, such as Arabic and Hebrew, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. ... A Byte Order Mark (BOM) is the character at code point U+FEFF (zero-width no-break space), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text... Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ... Many e-mail clients are now able to use Unicode. ... Unicode typefaces (also known as UCS fonts and Unicode fonts) contains wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc, which are collectively mapped into Universal Character Set, also known as, UCS (which is an international standard ISO/IEC 10646), derived from many different languages, scripts from all... HTML, short for Hypertext Markup Language, is the predominant markup language for web pages. ... HTML has been in use since 1991, but HTML 4. ... Dynamic HTML or DHTML is a collection of technologies used together to create interactive and animated web sites by using a combination of a static markup language (such as HTML), a client-side scripting language (such as JavaScript), a presentation definition language (Cascading Style Sheets, CSS), and the Document Object... In HTML and XHTML, a font face or font family is the typeface that is applied to some text. ... An HTML editor is a software application for creating web pages. ... In computing, an HTML element indicates structure in an HTML document and a way of hierarchically arranging content. ... The W3C HTML standard includes support for client-side scripting. ... This article or section is incomplete and may require expansion and/or cleanup. ... It has been suggested that Tableless web design be merged into this article or section. ... It has been suggested that W3C Markup Validation Service be merged into this article or section. ... The following tables compare general and technical information for a number of web browsers. ... Web colors are colors used in designing web pages, and the methods for describing and specifying those colors. ... The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. ...

 This box: view  talk  edit 

Web pages authored using hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set. HTML, short for Hypertext Markup Language, is the predominant markup language for web pages. ...


The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. The accurate representation of text in web pages from different natural languages and writing systems is complicated by the details of character encoding, markup language syntax, font, and varying levels of support by web browsers. The Unicode Standard, Version 5. ... WWWs historical logo designed by Robert Cailliau The World Wide Web (commonly shortened to the Web) is a system of interlinked, hypertext documents accessed via the Internet. ... A screenshot of a web page. ... The term natural language is used to distinguish languages spoken and signed (by hand signals and facial expressions) by humans for general-purpose communication from constructs such as writing, computer-programming languages or the languages used in the study of formal logic, especially mathematical logic. ... Writing systems of the world today. ... A character encoding or character set (sometimes referred to as code page) consists of a code that pairs a sequence of characters from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers... A specialized markup language using SGML is used to write the electronic version of the Oxford English Dictionary. ... “Font” redirects here. ... An example of a Web browser (Mozilla Firefox) A web browser is a software application that enables a user to display and interact with text, images, videos, music and other information typically located on a Web page at a website on the World Wide Web or a local area network. ...

Contents

HTML document characters

Web pages are typically HTML or XHTML documents. Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks. HTML, short for Hypertext Markup Language, is the predominant markup language for web pages. ... The Extensible HyperText Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax. ... In typography, a grapheme is the atomic unit in written language. ... This article does not cite any references or sources. ... A computer network is a useless group of computers. ...


An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML document character set: a character repertoire wherein each character is assigned a unique, non-negative integer code point. This set is defined in the HTML 4.0 DTD, which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by Unicode and ISO/IEC 10646: the Universal Character Set (UCS). Document Type Definition (DTD), defined slightly differently within the XML and SGML (the language XML was derived from) specifications, is one of several SGML and XML schema languages, and is also the term used to describe a document or portion thereof that is authored in the DTD language. ... The Unicode Standard, Version 5. ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ...


Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an XML document, which, while not having an explicit "document character" layer of abstraction, nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author. The Extensible Markup Language (XML) is a general-purpose markup language. ... abstraction in general. ...


Regardless of whether the document is HTML or XHTML, when stored on a file system or transmitted over a network, the document's characters are encoded as a sequence of bit octets (bytes) according to a particular character encoding. This encoding may either be a Unicode Transformation Format, like UTF-8, that can directly encode any Unicode character, or a legacy encoding, like Windows-1252, that can't. For library and office filing systems, see Library classification. ... BIT is an acronym for: Bannari amman Institute of Technology Bangalore Institute of Technology Beijing Institute of Technology Benzisothiazolinone Bilateral Investment Treaty Bhilai Institute of Technology - Durg Birla Institute of Technology - Mesra Battles in Time (Doctor Who magazine) BIT International College, formerly the Bohol Institute of Technology in Bohol, Philippines... In computer technology and networking, an octet is a group of 8 bits. ... In computer science a byte (pronounced bite) is a unit of measurement of information storage, most often consisting of eight bits. ... In computing, Unicode is the international standard whose goal is to provide the means to encode the text of every document people want to store in computers. ... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ... ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. ...


Numeric character references

In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that comprise the numeric character reference are universally representable in every encoding approved for use on the Internet. A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. ... A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. ... For other uses, see Decimal (disambiguation). ... In mathematics and computer science, hexadecimal, base-16, or simply hex, is a numeral system with a radix, or base, of 16, usually written using the symbols 0–9 and A–F, or a–f. ...


For example, a Unicode code point like 33865 (decimal), which corresponds to a particular Chinese character, has to be preceded by &# and followed by ;, like this: 葉, which produces this: 葉 (if it doesn't look like a Chinese character, see the special characters note at bottom of article).


The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers — but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example ♠ instead of ♠).


Named character entities

In HTML, there is a standard set of 252 named character entities for characters — some common, some obscure — that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers. HTML has been in use since 1991 (note that the W3C international standard is now XHTML), but the first standardized version with a reasonably complete treatment of international characters was version 4. ...


Character entities can be included in an HTML document via the use of entity references, which take the form &EntityName;, where EntityName is the name of the entity. For example, —, much like — or —, represents U+2014: the em dash character — like this — even if the character encoding used doesn't contain that character. The Unicode Standard, Version 5. ... A dash is a punctuation mark, and is not to be confused with the hyphen, which has quite different uses. ...


Character encoding determination

In order to correctly process HTML, a web browser must ascertain which Unicode characters are represented by the encoded form of an HTML document. In order to do this, the web browser must know what encoding was used. When a document is transmitted via a MIME message or a transport that uses MIME content types such as an HTTP response, the message may signal the encoding via a Content-Type header, such as Content-Type: text/html; charset=ISO-8859-1. Other external means of declaring encoding are permitted, but rarely used. The encoding may also be declared within the document itself, in the form of a META element, like <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">. When there is no encoding declaration, the default varies depending on the localisation of the browser. For mime as an art form, see mime artist. ... HTTP (for HyperText Transfer Protocol) is the primary method used to convey information on the World Wide Web. ...


For a system set up mainly for Western European languages, it will generally be ISO-8859-1 or its close relation Windows-1252. For a browser from a location where multibyte character encodings are the norm, some form of autodetection is likely to be applied. ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding originally developed by ISO, but later jointly maintained by ISO and IEC. The standard, when supplemented with additional character assignments, is the... ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding originally developed by ISO, but later jointly maintained by ISO and IEC. The standard, when supplemented with additional character assignments, is the...


Because of the legacy of 8-bit text representations in programming languages and operating systems, and the desire to avoid burdening users with needing to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk, and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. It is also a common misunderstanding that the encoding declaration effects a change in the actual encoding, whereas it is actually just a label that could be inaccurate. A programming language is an artificial language that can be used to control the behavior of a machine, particularly a computer. ... An operating system (OS) is the software that manages the sharing of the resources of a computer and provides programmers with an interface used to access those resources. ...


Many HTML documents are served with inaccurate encoding declarations, or no declarations at all. In order to determine the encoding in such cases, many browsers allow the user to manually select one from a list. They may also employ an encoding autodetection algorithm that works in concert with the manual override. The manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. This has been addressed somewhat by XHTML, which, being XML, requires that encoding declarations be accurate and that no workarounds be employed when they're found to be inaccurate.


Web browser support

Many browsers are only capable of displaying a small subset of the full Unicode repertoire. Here is how your browser displays various Unicode code points:

Character HTML char ref Unicode name What your browser displays
U+0041 &#65; Latin capital letter A A
U+00DF &#223; Latin small letter Sharp S ß
U+00FE &#254; Latin small letter Thorn þ
U+0394 &#916; Greek capital letter Delta Δ
U+0419 &#1049; Cyrillic capital letter Short I Й
U+05E7 &#1511; Hebrew letter Qof ק
U+0645 &#1605; Arabic letter Meem م
U+0E57 &#3671; Thai digit 7
U+1250 &#4688; Ge'ez syllable Qha
U+3042 &#12354; Hiragana letter A (Japanese)
U+53F6 &#21494; CJK Unified Ideograph-53F6 (Simplified Chinese "Leaf")
U+8449 &#33865; CJK Unified Ideograph-8449 (Traditional Chinese "Leaf")
U+B5AB &#46507; Hangul syllable Tteolp (Korean "Ssangtikeut Eo Rieulpieup")
U+10346 &#66374; Gothic letter Faihu 𐍆
To display all of the characters above, you may need to install one or more large multilingual fonts, like Code2000 (and Code2001 for some extinct languages, for example Gothic).

Some web browsers, such as Mozilla Firefox, Opera, and Safari, are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system. The Latin alphabet, also called the Roman alphabet, is the most widely used alphabetic writing system in the world today. ... For other uses of A, see A (disambiguation). ... ß as the combination of Å¿s on a Pirna street sign (Waldstraße) This article is about the letter ß in the German alphabet. ... Þþ The letter Þ (miniscule: þ), which is also known as thorn or þorn is a letter in the Anglo-Saxon and Icelandic alphabets. ... Look up Δ, δ in Wiktionary, the free dictionary. ... The Cyrillic alphabet (pronounced also called azbuka, from the old name of the first two letters) is actually a family of alphabets, subsets of which are used by certain Slavic languages — Belarusian, Bulgarian, Macedonian, Russian, Rusyn, Serbian, and Ukrainian—as well as many other languages of the former Soviet Union... &#1049;, &#1081; (Short I) is a letter in the Cyrillic alphabet. ... Note: This article contains special characters. ... Qoph or Qop is the nineteenth letter in many Semitic abjads, including Phoenician, Aramaic, Hebrew ‎ and Arabic alphabet ‎ (in abjadi order). ... The Arabic alphabet is the script used for writing languages such as Arabic, Persian, Urdu, and others. ... Mem (also spelled Meem or Mim) is the thirteenth letter of many Semitic abjads, including Phoenician, Aramaic, Hebrew and Arabic alphabet . Its value is IPA . ... In mathematics and computer science, a numerical digit is a symbol, e. ... Seven Days of Creation - 1765 book, title page 7 (seven) is the natural number following 6 and preceding 8. ... Note: This article contains special characters. ... Hiragana ) is a Japanese syllabary, one component of the Japanese writing system, along with katakana and kanji; the Latin alphabet is also used in some cases. ... CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ... A Chinese character. ... This article or section does not adequately cite its references or sources. ... CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ... A Chinese character. ... Traditional Chinese (Traditional Chinese: 正體字/繁體字, Simplified Chinese: 正体字/繁体字) refers to one of two standard sets of printed Chinese characters. ... Jamo redirects here. ... A syllable (Ancient Greek: ) is a unit of organization for a sequence of speech sounds. ...   The Gothic alphabet is an alphabetic writing system attributed by Philostorgius to Wulfila, used exclusively for writing the ancient Gothic language. ... The Fe rune &#5792; represents the f-sound. ... Code2000 is a digital font which includes characters and symbols from a very large range of writing systems. ... In digital typography, James Kasss Code2000 OpenType font is designed to support as much of the Unicode standard, version 4. ... Gothic is an extinct Germanic language that was spoken by the Goths. ... Firefox redirects here. ... Opera is an Internet suite which handles common internet-related tasks, including visiting web sites, sending and receiving e-mail messages, managing contacts, and online chat. ... Safari is a web browser developed by Apple Inc. ... Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points. ... This is a list of typefaces. ... An operating system (OS) is the software that manages the sharing of the resources of a computer and provides programmers with an interface used to access those resources. ...


Internet Explorer version 6 for Windows is capable of displaying the full range of Unicode characters, but characters which are not present in the first available font specified in the web page will only display if they are present in the designated fallback font for the current international script[1] (for example, only Arial font will be considered for a block beginning with Latin text, or Arial Unicode MS if it is also installed; subsequent fonts specified in a list are ignored).[2] Otherwise, Explorer will display placeholder squares. For characters not present in a web page's fonts, Web page authors must guess which other appropriate fonts might be present on users' systems, and manually specify them as the preferred choices for each block or range of text containing such characters—Microsoft recommends using CSS to specify a font for each block of text in a different language or script. The characters in the table above haven't been assigned specific fonts, yet most should render correctly if appropriate fonts have been installed. Windows Internet Explorer (formerly Microsoft Internet Explorer abbreviated MSIE), commonly abbreviated to IE, is a series of proprietary graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems starting in 1995. ... Arial, sometimes marketed as Arial MT, is a typeface and a computer font packaged with Microsoft Windows, other Microsoft software applications, and many PostScript computer printers. ... In digital typography, Arial Unicode MS is an extended version of the OpenType font Arial. ...


Older browsers, such as Netscape Navigator 4.77, can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they will display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others. Netscape Navigator, also known as Netscape, was a proprietary web browser that was popular during the 1990s. ...


For displaying characters outside the Basic Multilingual Plane, like the Gothic letter faihu in the table above, some systems (like Windows 2000) need manual adjustments of their settings. Fonts with larger unicode block coverage and a vast character set are better than regular fonts. Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. ... Unicode fonts (also known as UCS fonts, Unicode Typefaces and Typefaces) contains wide range of characters, letters, digits, glyphs, symbols, etc, which are collectively mapped into Universal Character Set, also known as, UCS (which is an international standard ISO/IEC 10646), derived from many different languages from all around the...


References

  1. ^ Microsoft (2006), “Globalization Step-by-Step: Fonts” at Microsoft Global Development and Computing Portal. URL retrieved on 2006-04-26.
  2. ^ Girt By Net (2005), “Internet Explorer Makes Me ☹” at girtby.net. URL retrieved on 2006-04-26.

Year 2006 (MMVI) was a common year starting on Sunday of the Gregorian calendar. ... is the 116th day of the year (117th in leap years) in the Gregorian calendar. ... Year 2006 (MMVI) was a common year starting on Sunday of the Gregorian calendar. ... is the 116th day of the year (117th in leap years) in the Gregorian calendar. ...

See also

HTML has been in use since 1991, but HTML 4. ...

External links


  Results from FactBites:
 
Mamluk Encyclopedia: Unicode and diacritics (3368 words)
Unicode compliant fonts in OS X are listed here: http://www.alanwood.net/unicode/fonts_macosx.html (scroll down or use the links at the top).
Unicode compliant fonts for Windows are listed here: http://www.alanwood.net/unicode/fonts.html (scroll down or use the links at the top).
There are also separate keyboard layouts for typing IPA characters in Unicode fonts for the same national standards (that is, the non-option keys follow the regular national keyboard standard, but the IPA characters are all placed on option keys under no particular standard).
Unicode HOWTO (4144 words)
Unicode and ISO 10646 were originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode.
Unicode code points 0-255 are identical to the Latin-1 values, so converting to this encoding simply requires converting code points to byte values; if a code point larger than 255 is encountered, the string can't be encoded into Latin-1.
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often written as the first character of a file in order to assist with autodetection of the file's byte ordering.
  More results at FactBites »

 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m