FACTOID # 22: South Dakota has the highest employment ratio in America, but the lowest median earnings of full-time male employees.
 
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 
   
 
WHAT'S NEW
RELATED ARTICLES
People who viewed "Punycode" also viewed:
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Punycode
Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Punycode is a computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted in network host names. The encoding syntax is published on the Internet in Request for Comments 3492. The Unicode Standard, Version 5. ... A character encoding is a code that pairs a set of characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. ... This page compares Unicode encodings. ... UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ... UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. ... CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. ... In computing, UTF-16 is a variable-length (16 or 32 bits) character encoding. ... UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ... UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ... The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ... Example of Arabic IDN Example of Chinese IDN Example of Persian IDN Example of Greek IDN Example of Hebrew IDN Example of Ukrainian IDN An internationalized domain name (IDN) is an Internet domain name that (potentially) contains non-ASCII characters. ... GB18030 is the registered Internet name for the official character set of the Peoples Republic of China (PRC) superseding GB2312. ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points. ... Some writing systems of the world, such as Arabic and Hebrew, are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. ... A byte-order mark (BOM) is the Unicode character at code point U+FEFF (zero-width no-break space) when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32. ... Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ... Web pages authored using hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set. ... Many e-mail clients are now able to use Unicode. ... Unicode typefaces (also known as UCS fonts and Unicode fonts) contains wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc, which are collectively mapped into Universal Character Set, also known as, UCS (which is an international standard ISO/IEC 10646), derived from many different languages, scripts from all... Programming redirects here. ... The Unicode Standard, Version 5. ... A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes referred to as code page) with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the... A hostname (occasionally also, a sitename) is the unique name by which a network attached device (which could consist of a computer, file server, network storage device, fax machine, copier, cable modem, etc. ... In internetworking and computer network engineering, Request for Comments (RFC) documents are a series of memoranda encompassing new research, innovations, and methodologies applicable to Internet technologies. ...


The encoding is used as part of IDNA, which is a system enabling the use of internationalized domain names in all scripts that are supported by Unicode, where the burden of translation lies entirely with the user application (a web browser for example). IDNA performs some significant pre- and post-processing in addition to its use of Punycode. For further information on IDNA including the pre- and post-processing and the spoofing concerns see the Internationalized domain name article. Example of Arabic IDN Example of Chinese IDN Example of Persian IDN Example of Greek IDN Example of Hebrew IDN Example of Ukrainian IDN An internationalized domain name (IDN) is an Internet domain name that (potentially) contains non-ASCII characters. ... The Unicode Standard, Version 5. ... An example of a Web browser (Mozilla Firefox) A web browser is a software application that enables a user to display and interact with text, images, videos, music and other information typically located on a Web page at a website on the World Wide Web or a local area network. ... Example of Arabic IDN Example of Chinese IDN Example of Persian IDN Example of Greek IDN Example of Hebrew IDN Example of Ukrainian IDN An internationalized domain name (IDN) is an Internet domain name that (potentially) contains non-ASCII characters. ...

Contents

Encoding procedure

This section demonstrates the procedure for Punycode encoding, showing how the string "bücher" is encoded as "bcher-kva".


Separation of ASCII characters

First all basic (ASCII) characters in the string are copied directly from input to output skipping over other characters (e.g. "bücher" → "bcher"). If and only if there was one or more basic characters copied, an ASCII hyphen is added to the output next (e.g. "bücher" → "bcher-") . There are 95 printable ASCII characters, numbered 32 to 126. ...


Encoding of non-ASCII character insertions as code numbers

To understand the next part of the encoding process we first need to understand the behaviour of the decoder. The decoder is a state machine with two state variables i and n. i is an index into the string ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end). Fig. ...


i starts at zero while n starts at 128 (the first non-ASCII code point). The state progression is monotonic. A state change either increments i or if i is at its maximum resets i to zero and increments n. At each state change either the code point denoted by "n" is inserted or it is not inserted. A monotonically increasing function (it is strictly increasing on the left and just non-decreasing on the right). ...


The code numbers generated by the encoder encode how many possibilities the decoder should skip before an insertion is made. "ü" has code point 252. So before we get to the possibility of inserting ü in position one it is necessary to skip over six potential insertions of each of the 124 preceding non-ASCII code points and one possible insertion (at position zero) of code point 252. That is why it is necessary to tell the decoder to skip a total of (6 × 124) + 1 = 745 possible insertions before getting to the one required.


Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva" is used to represent the code number 745: This article is about different methods of expressing numbers with symbols. ...


A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits (like the third digit from the right in ordinary numbers having a weight 100) varies. In computing, endianness is the byte (and sometimes bit) ordering in memory used to represent some kind of data. ...


In this case a "number system" with 36 "digits" is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0". The second digit has a weight of 35 instead of 36 because for three-digit numbers the first (least significant) digit is in the range b–9, "a" would mark the end of the number. Therefore "kva" represents the number 10 + 35 × 21 = 745.


For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf", etc.


To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.


Compare an ASCII 'punycoded' URL http://xn--tdali-d8a8w.lv/ that includes the Unicode representation of the Latvian "u with a macron", and "n with cedilla", instead of the unmarked base characters: http://tūdaliņ.lv. A macron, from Greek (makros) meaning large, is a diacritic ¯ placed over a vowel originally to indicate that the vowel is long. ... A cedilla is a hook (¸) added under certain consonant letters as a diacritical mark to modify their pronunciation. ...


Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string. Nameprep is the process of Unicode NFKC normalization, case-folding to lowercase and removal of some generally invisible code points before it is suitable to represent a domain name, or other such canonical name. ... “TLD” redirects here. ...


External links

International Components for Unicode (ICU) is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization. ...

  Results from FactBites:
 
RFC 3492 (rfc3492) - Punycode: A Bootstring encoding of Unicode for Intern (5189 words)
RFC 3492 (rfc3492) - Punycode: A Bootstring encoding of Unicode for Intern
Punycode is an instance of Bootstring that uses particular parameter values specified by this document, appropriate for IDNA.
Punycode is an instance of a more general algorithm called Bootstring, which allows strings composed from a small set of "basic" code points to uniquely represent any string of code points drawn from a larger set.
Punycode - Wikipedia, the free encyclopedia (1098 words)
Punycode, defined in RFC 3492, is the self-proclaimed "bootstring encoding" of Unicode strings into the limited character set supported by the Domain Name System.
Punycode is designed to work across all script systems, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates.
Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being Punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.
  More results at FactBites »

 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m