FACTOID # 24: Looking for table makers? Head to Mississippi, with an overwhlemingly large number of employees in furniture manufacturing.
 
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 
   
 
WHAT'S NEW
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Han unification
Unicode
Encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces
Because of technical limitations, some web browsers may not display some special characters in this article.

Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. The Chinese characters are common to Chinese (where they are called hanzi), Japanese (where they are called kanji), Korean (where they are called hanja) and Vietnamese (Hán Tự). Modern Korean, Chinese and Japanese typefaces may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these different glyphs were treated as the same character. This unification is referred to as "Han unification", with the resulting character repertoire sometimes referred to as Unihan. Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ... This page compares Unicode encodings. ... UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ... UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ... CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. ... In computing, UTF-16 is a variable-length (16 or 32 bits) character encoding. ... UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ... UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ... The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ... This article or section may be confusing for some readers, and should be edited to be clearer. ... GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC) superseding GB2312. ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... Unicode’s Universal Character Set potentially supports over 1 million code points (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points. ... Some writing systems of the world, such as Arabic and Hebrew, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. ... A Byte Order Mark (BOM) is the character at code point U+FEFF (zero-width no-break space), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text... The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike. ... Many e-mail clients are now able to use Unicode. ... Unicode typefaces (also known as UCS fonts and Unicode fonts) contains wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc, which are collectively mapped into Universal Character Set, also known as, UCS (which is an international standard ISO/IEC 10646), derived from many different languages, scripts from all... An example of a web browser (Internet Explorer), displaying the English Wikipedia main page. ... Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... A character encoding is a code that pairs a set of characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. ... CJK is a collective term for Chinese, Japanese, and Korean, which comprise the main East Asian languages. ... Japanese name Kanji: Kana: Korean name Hangul: Hanja: Vietnamese name Vietnamese: Hantu: A Chinese character (Simplified Chinese: ; Traditional Chinese: ; Pinyin: ) is a logogram used in writing Chinese, Japanese, sometimes Korean, and formerly Vietnamese. ... Technical note: Due to technical limitations, some web browsers may not display some special characters in this article. ... Japanese writing Kanji Kana Hiragana Katakana Hentaigana Manyōgana Uses Furigana Okurigana Rōmaji   ) are the Chinese characters that are used in the modern Japanese logographic writing system along with hiragana (平仮名), katakana (片仮名), and the Arabic numerals. ... Hanja is the Korean name for Chinese characters. ... Hán tá»± (漢字, lit. ... For the origin and evolution of fonts, see History of western typography. ...


Unihan can also refer to the Unihan Database web site maintained by the Unicode Consortium, which provides information about all of the unified Han characters encoded in the Unicode standard, including representative glyphs, mappings to various national standards, dictionary numbers, and definitions for compound words drawn from the free Japanese EDICT and Chinese CEDICT dictionary projects (which are provided for convenience and are not a formal part of the Unicode standard). An edict is an announcement of a law, often associated with monarchism. ... The CEDICT project was started by Paul Denisowski in 1997 with the aim to provide a complete Chinese to English dictionary with pronunciation in pinyin for the Chinese characters. ...

Contents

Standard

Rules for Han Unification are given in the East Asian Scripts chapter of the various versions of the Unicode Standard (Chapter 11 in Unicode 4.0) [1]. The Ideographic Rapporteur Group (IRG) [2], made up of experts from the Chinese-speaking countries, North and South Korea, Japan, Vietnam, and other countries, is responsible for the process. The IRG advises the Unicode Consortium and the ISO/IEC JTC1/SC2/WG2 on Han character additions to the repertoire of the Unicode and ISO/IEC 10646-1 (Universal Multiple-Octet Coded Character Set, or UCS) character set standards, and on Han unification. ...


Details

The secret life of Unicode article located on IBM DeveloperWorks has an explanation of this issue that illustrates some of the confusion:

The problem stems from the fact that Unicode encodes characters rather than "glyphs," which are the visual representations of the characters. There are four basic traditions for East Asian character shapes: traditional Chinese, simplified Chinese, Japanese, and Korean. While the Han root character may be the same for CJK languages, the glyphs in common use for the same characters may not be, and new characters were invented in each country.

For example, the traditional Chinese glyph for "grass" uses four strokes for the "grass" radical, whereas the simplified Chinese, Japanese, and Korean glyphs use three. But there is only one Unicode point for the grass character (, U+8349) regardless of writing system. Another example is the ideograph for "one" (壹, 壱, or 一), which is different in Chinese, Japanese, and Korean. Many people think that the three versions should be encoded differently. Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ...

In fact, the three ideographs for "one" are encoded separately in Unicode. They are not national variants. The first and second are used on financial instruments to prevent forgery, while the third is the common form in all three countries.


A slight difference in rendering characters might be considered a serious problem if it changes the meaning or reflects the wrong cultural tradition. Besides a simple nuisance like Japanese text looking like Chinese, names might be displayed with a different glyph — the same character in the sense of encoding but a different character in the view of the users. This rendering problem is often employed to criticize Westerners for not being aware of subtle distinctions, even though Unification is being carried out by Easterners. The display error occurs only when rendering plain text in a single font, and not when rendering language-specific text and names in language-appropriate fonts.


The problem of one character representing semantically different concepts is also present in the Latin part of Unicode. The Unicode character for an apostrophe is the same as the character for a right single quote: ’.


The process of Han Unification was controversial, with most of the opposition coming from Japan. Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible. Proponents of Han unification point out that the unification process is in the hands of specialists from China, Korea, and Japan, and that the objections to unification of specific characters are made without regard to their histories. Characters which some Japanese today consider completely distinct were historically the same, and were taught as the same in Japanese schools until the 1950s. As for historical research, Unicode now encodes far more characters than any other standard, and far more than were listed in any dictionary, with many more being processed for inclusion as fast as the scholars can agree on their identities.


Some characters used only in names are not included in Unicode. This is not a form of cultural imperialism, as is sometimes feared. These characters are generally not included in their national character sets either.


Controversy

Some of the controversy comes from the fact that the very decision of performing Han unification was made by the initial Unicode Consortium, which at the time was a consortium of North American companies and organizations (most of them in California) [3], but included no East Asia government representatives. The initial design goal was to create a 16-bit standard, and Han unification was therefore a critical step for avoiding tens of thousands of character replications [4]. This 16-bit requirement later had to be abandoned. The controversy later extended to the internationally representative ISO: the initial CJK-JRG group favored a proposal (DIS 10646) for a non-unified character set, "which was thrown out in favor of unification with the Unicode Consortium's unified character set by the votes of American and European ISO members" (even though the Japanese position was unclear) [5]. Endorsing the Unicode Han unification was a necessary step for the heated ISO 10646/Unicode merger.


Much of the controversy surrounding Han unification is based on use of the distinction between the ideas of characters and glyphs, as defined in Unicode, and the related but distinct idea of graphemes. Unicode defines abstract characters, as opposed to glyphs, which are particular visual representations of a character in a font, or graphemes, basic units of writing in a particular language. One character may be represented by many distinct glyphs, for example a "g" or an "a", both of which may have one loop or two. In Dutch, "ij" is a sometimes considered a single letter (ij), and thus arguably a grapheme (a digraph). For example, the first letter in "IJsselmeer" is capitalized. Similarly for "ch" in some Spanish-speaking countries, and "lj" in Croatian. Graphemes present in national character code standards have been added to Unicode, as required by Unicode's Source Separation rule, even where they can be composed of characters already available. variant glyphs representing the character a (allographs of a) in the Zapfino typeface. ... In typography, a grapheme is the atomic unit in written language. ... For the origin and evolution of fonts, see History of western typography. ... IJ is a letter from the Dutch alphabet used to represent the diphthong or . ... Digraph has several meanings: directed graph, or digraph Digraph (orthography) Digraph (computing) This is a disambiguation page: a list of articles associated with the same title. ...


Unicode publishes charts with pictures for each character, but these are illustrations only and do not mandate the character's shape. References like [6] below seem to assume that what the Unicode standard pictures is how each character must be displayed, and protest when it doesn't match the local appearance of the character. The way things are supposed to work is that a Japanese user will have a font with Japanese-style characters, a Chinese user will have a font with Chinese-style characters, etc., and everyone will see the "right" characters for them. Problems are introduced when several languages must be represented in the same text document, and users expect different fonts for the different languages. This falls outside the scope of the Unicode standard, and is intended to be handled with higher-level markup defining the language used for each string of characters; the fact that software support for this has tended to be cumbersome and often inadequate has contributed toward the misunderstanding of the effects of unification.


Note that most of the opposition to Han unification appears to be Japanese, because of increased sensitivity to the distinctions between Chinese and Japanese styles of characters. There has been very little opposition from Chinese speakers, since, on the other hand, Unicode did not unify simplified characters with their traditional forms. Although the Taiwan Big5 character set does not include Simplified characters, the PRC has character set standards with and without them. Unicode is seen as neutral with regards to the politically charged issue of Simplified versus Traditional characters, encoding Simplified and Traditional Chinese glyphs separately (e.g. the ideograph for "discard" is 丟 U+4E1F for Traditional Chinese Big5 #A5E1 and 丢 U+4E22 for Simplified Chinese GB #2210). Traditional and Simplified characters must be encoded separately according to Unicode Han Unification rules, because they are distinguished in pre-existing PRC character sets, not just because they have different shapes. Mapping between Traditional and Simplified characters is not one-to-one, which also prevents unification. Big-5 or Big5 is a character encoding method used in Taiwan (Republic of China) and Hong Kong for Traditional Chinese characters. ...


Specialist character sets developed to address, or regarded by some as not suffering from, these perceived deficiencies include:

However, none of these alternative standards has been as widely adopted as Unicode, which is now the base character set for many new standards and protocols, and is built into the architecture of operating systems (Microsoft Windows, Apple Mac OS X, and many versions of Unix), programming languages (Perl, Python, C#, Java, Common LISP, APL), and libraries (IBM International Components for Unicode (ICU) along with the Pango, Graphite, Scribe, Uniscribe, and ATSUI rendering engines), font formats (TrueType and OpenType) and so on. ISO 2022, more formally ISO/IEC 2022, is an ISO standard (equivalent to the ECMA standard ECMA-35) specifying a technique for including multiple character sets in a single character encoding, and a technique for representing character sets which cannot be represented in 7 bits. ... The CNS 11643 character set (Chinese National Standard 11643), also officially known as the Chinese Standard Interchange Code (中文標準交換碼), is officially the standard character set of the Republic of China. ... CCCII (Chinese Character Code for Information Interchange, 中文資訊交換碼) is a character set developed specifically to address the problem of interchange of Chinese information. ... TRON is a multi-byte character encoding. ... The UTF-2000 character encoding project is an alternative for Unicode, first proposed in 1998. ... Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ... Microsoft Windows is the name of several families of proprietary software operating systems by Microsoft. ... Apple Inc. ... Mac OS X (official IPA pronunciation: ) is a line of proprietary, graphical operating systems developed, marketed, and sold by Apple Inc. ... Filiation of Unix and Unix-like systems Unix (officially trademarked as UNIX®) is a computer operating system originally developed in the 1960s and 1970s by a group of AT&T employees at Bell Labs including Ken Thompson, Dennis Ritchie and Douglas McIlroy. ... Wikibooks has a book on the topic of Perl Programming Perl is a dynamic programming language created by Larry Wall and first released in 1987. ... Python is an interpreted programming language created by Guido van Rossum in 1990. ... The title given to this article is incorrect due to technical limitations. ... Java is a programming language originally developed by Sun Microsystems and released in 1995. ... Common Lisp, commonly abbreviated CL, is a dialect of the Lisp programming language, standardised by ANSI X3. ... APL (for A Programming Language) is an array programming language based on a notation invented in 1957 by Kenneth E. Iverson while at Harvard University. ... International Components for Unicode (ICU) is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ... In computing, Pango is an open source library for rendering internationalized texts integrated into GTK+ 2. ... Graphite is a programmable Unicode-compliant smart-font rendering system developed by SIL International. ... In computer programming, Qt is a cross-platform application development framework, widely used for the development of GUI programs, and, since the release of Qt 4, also used for developing non-GUI programs such as console tools and servers. ... Uniscribe is the Microsoft Windows set of services for rendering Unicode-encoded text. ... The Apple Type Services for Unicode Imaging (ATSUI) is the Mac OS set of services for rendering Unicode-encoded text. ... TrueType is an outline font standard originally developed by Apple Computer in the late 1980s as a competitor to Adobes Type 1 fonts used in PostScript. ... OpenType is a scalable computer font format initially developed by Microsoft, later joined by Adobe Systems. ...


Examples of language dependent characters

In each row of the following table, the same character is repeated in in all five columns. However each column is marked (via the HTML lang attribute) as being in a different language: Chinese (3 varieties: unmarked "Chinese", simplified characters, and traditional characters), Japanese, or Korean. Your browser should select, for each character, a glyph (from a font) suitable to the specified language. This only works for fallback glyph selection if you have CJK fonts installed on your system and the font selected to display this article does not include glyphs for these characters. Note also that Unicode includes non-graphical language tag characters in the range U+E0000 – U+E007F for plain text language tagging. Simplified Chinese character (Simplified Chinese: or ; Traditional Chinese: or ; pinyin: or ) is one of two standard sets of Chinese characters of printed contemporary Chinese written language, simplified from traditional Chinese by the Peoples Republic of China in an attempt to promote literacy. ... Traditional Chinese characters refers to one of two standard sets of printed Chinese characters. ... An example of a web browser (Internet Explorer), displaying the English Wikipedia main page. ... variant glyphs representing the character a (allographs of a) in the Zapfino typeface. ... For the origin and evolution of fonts, see History of western typography. ...

Code Chinese
Generic
   
Chinese
Simplified
   
Chinese
Traditional
   
Japanese    Korean   
U+4E0E
U+4ECA
U+4EE4
U+514D
U+5165
U+5168
U+5177
U+5203
U+5316
U+5340
U+5916
U+60C5
U+624D
U+6B21
U+6D77
U+6F22
U+753B
U+76F4
U+771F
U+7A7A
U+7D00
U+8349
U+89D2
U+8ACB
U+9053
U+9913
U+9AA8

Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. ...

Examples of some non-unified Han Ideographs

For some glyph differences, Unicode has different encoded a second character for a variant. This means for these characters, one does not have to switch font or language tag to make the difference. Simply selecting the variant character will display the language dependent glyph even without language tags. In the following table, the separate rows in each group contains "the equivalent" character using different code points. Note that for characters such as 入 (U+5165), the only way to display the two variants is to change font (or language tag) as described in the previous table. However, for 內 (U+5167), there is an alternate character 内 (U+5185) as illustrated below. For some characters, like 兌/兑 (U+514C/U+5151), either methods can be used to display the glyph differences.

Code Chinese
Generic
   
Chinese
Simplified
   
Chinese
Traditional
   
Japanese    Korean   
U+9AD8
U+9AD9
 
U+7D05
U+7EA2
 
U+4E1F
U+4E22
 
U+4E57
U+4E58
 
U+4FA3
U+4FB6
 
U+514C
U+5151
 
U+5167
U+5185
 
U+7522
U+7523
 
U+7A05
U+7A0E
 
U+4E80
U+9F9C
U+9F9F
 
U+5225
U+522B
 
U+4E21
U+4E24
U+5169

Unicode ranges

v  d  e
Character Types

Letters and other
     script specific
Unihan ideographs, etc.
Phonetic characters
Numerals
Punctuation and separators
Diacritics and other marks
Symbols:
Compatibility characters
Control characters
Other Topics
Combining character
Precomposed character
Unicode’s Universal Character Set potentially supports over 1 million code points (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points. ... In Unicode, a script is an abstract coherent and unified writing system supporting one or more concrete writing systems which in turn support the written forms of one or more languages. ... In Unicode, a script is an abstract coherent and unified writing system supporting one or more concrete writing systems which in turn support the written forms of one or more languages. ... Unicode ranges encoding phonetic notation. ... Numerals (often called numbers in Unicode) are characters that denote a number. ... The term punctuation has two different linguistic meanings: in general, the act and the effect of punctuating, i. ... A diacritical mark or diacritic, also called an accent mark, is a small sign added to a letter to alter pronunciation or to distinguish between similar words. ... In discussing Unicode and the UCS, many often refer to compatibility characters. ... Many characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. ... Combining diacritical marks are Unicode characters that are intended to modify other characters (see Diacritic). ... Precomposed character is a Unicode entity that can be decomposed into a canonically equivalent string of several other characters. ...

The "CJK Unified Ideographs" range has 20,924 characters. Together with the extension A and B ranges, the total number of characters included for CJK ideographs (as of Unicode 5.0) is 70,226 characters. These ideographic characters appear in the following three blocks:

  • CJK Unified Ideographs (4E00–9FFF) (chart)
  • CJK Unified Ideographs Extension A (3400–4DBF) (chart)
  • CJK Unified Ideographs Extension B (20000–2A6DF)

Unicode includes supporting CJKV radicals, strokes, punctuation, marks and symbols in the following blocks:

Additional compatibility (discouraged use) characters appear in these blocks: The left part of 媽 mā, a Chinese character meaning mother, is a radical (in both senses) nǚ, which means woman (or female) A radical (from Latin radix, meaning root) is the semantic root (i. ... Outline of the character 永, showing stroke order. ...

  • Kangxi Radicals (2F00–2FDF)
  • Enclosed CJK Letters and Months (3200–32FF) (chart)
  • CJK Compatibility (3300–33FF) (chart)
  • CJK Compatibility Ideographs (F900–FAFF) (chart)
  • CJK Compatibility Ideographs (2F800–2FA1F)
  • CJK Compatibility Forms (FE30–FE4F) (chart)

These compatibility characters are included for compatibility with legacy text handling system and other legacy character sets. They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. The following is a list of all 214 Kangxi radicals, used originally in the 1615 Zihui and adopted by the 1716 Kangxi dictionary, in order of the number of strokes along with some examples of characters containing them. ...


See also

In computing, Chinese character encodings can be used to represent text written in the CJK languages — Chinese, Japanese, Korean — and (rarely) Vietnamese, all of which use Chinese characters. ... GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC) superseding GB2312. ... Sinicization, or Sinification, is to make things Chinese. ... Look up z-variant on Wiktionary, the free dictionary. ... This table lists the characters of the CJK Unified Ideographs block (4E00-9FFF). ...

External links

  • Unihan Database
    • Example of data for the han character "中"
  • Unicode standard
  • Han Unification in Unicode by Otfried Cheong
  • Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations
  • Why Unicode Will Work On The Internet
  • Per-character summary of differences in characters
  • The secret life of Unicode
  • GB18030 Support Package for Windows 2000/XP, including Chinese, Tibetan, Yi, Mongolian and Thai font by Microsoft
  • Proposal to encode additional grass radicals in the UCS – A humorous proposal to encode all possible variants of the grass radical, made as an April Fool's Day joke (link is dead)
  • Unicode Technical Note 26: On the Encoding of Latin, Greek, Cyrillic, and Han
  • "Unicode Revisited" – the strong point of view of some people working on the competing TRON proposal
  • "Unicode in Japan, guide to a technical and psychological struggle" – A more balanced take on the arguments for and against Unicode for Japanese.

  Results from FactBites:
 
HAN UNIFICATION (406 words)
The Han unification refers to the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified glyphs.
Opponents of Han unification state that it steamrollers over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible.
Proponents of Han unification state that the Unicode BMP set of unified characters is "good enough" for almost all everyday uses of the languages that use these scripts, that Unicode 3.1 greatly extends this repertoire for academic and literary needs, and that other encodings are also available for specialist academic purposes.
Definition of Han unification (1296 words)
Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters.
Opponents of Han unification state that it steamrolls over thousands of years of cultural tradition, misses many of the subtleties that are one of the most important features of these languages, and renders serious literature and academic research in these languages impossible.
Proponents of Han unification point out that the unification process is in the hands of specialists from China, Korea, and Japan, and that the objections to unification of specific characters are made without regard to their histories.
  More results at FactBites »

 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m