FACTOID # 21: 15% of Army recruits from South Dakota are Native American, which is roughly the same percentage for female Army recruits in the state.
 
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 
   
 
WHAT'S NEW
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Byte Order Mark
Unicode
Encodings
UCS
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail


A Byte Order Mark (BOM) is the character at code point U+FEFF ("zero-width no-break space"), when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32 and/or as a marker to indicate that text is encoded in UTF-8, UTF-16 or UTF-32. Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. ... This page compares Unicode encodings. ... UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ... UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ... CESU-8 is a variant of UTF-8 that is described in Unicode Technical Report 26. ... In computing, UCS-2 and UTF-16 are the names of two nearly identical 16-bit Unicode Transformation Formats: character encoding forms that provide a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or... UTF-32 and UCS-4 are alternate names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. ... UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ... The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ... This article or section may be confusing for some readers, and should be edited to be clearer. ... GB18030 is the registered internet name for the official character set of the Peoples Republic of China (PRC). ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... The writing systems of some languages, such as Persian (Farsi), Hebrew, and Arabic are written in a form known as right-to-left (RTL), in which writing begins at the right-hand side of a page and concludes at the left-hand side. ... Han unification is the process used by the authors of Unicode and the Universal Character Set to map multiple character sets of the CJK languages into a single set of unified characters. ... HTML 4. ... Many email clients are now able to use Unicode. ... Endianness generally refers to sequencing methods used in a one-dimensional system (such as writing or computer memory). ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. ... In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ... UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ... In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ...


In most encodings the BOM is a sequence which is unlikely to be seen in more conventional encodings or other Unicode encodings (usually looking like a sequence of obscure control codes). If a BOM is misinterpreted as an actual character within the text then it will generally be invisible due to the fact it is a zero-width no-break space. The "zero-width no-break space" semantics of the U+FEFF character has been deprecated in Unicode 3.2, allowing it to be used solely with the semantic of BOM.


In UTF-16, a BOM is expressed as the two-byte sequence FE FF at the beginning of the encoded string, to indicate that the encoded characters that follow it use big-endian byte order; or it is expressed as the byte sequence FF FE to indicate little-endian order. In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... A byte is commonly used as a unit of storage measurement in computers, regardless of the type of data being stored. ...


While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script. The UTF-8 representation of the BOM is the byte sequence EF BB BF, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to handle UTF-8. UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ... Microsoft Windows is a series of operating environments and operating systems created by Microsoft for use on personal computers and servers. ... Notepad is a simple text editor included with Microsoft Windows since version 1. ... A Unix-like operating system is one that behaves in a manner similar to a Unix system, while not necessarily conforming to or being certified to any version of the Single UNIX Specification. ... Computer files can be divided into two broad categories: binary and text. ... In computing, a shebang is a special line that begins an executable text file (commonly called a script) causing Unix-like operating systems to execute the commands in the text file using a specific interpreter (program). ... The GNU Compiler Collection (usually shortened to GCC) is a set of programming language compilers produced by the GNU Project. ... PHP, short for PHP: Hypertext Preprocessor, is an open-source, reflective programming language used mainly for developing server-side applications and dynamic web content, and more recently, other software. ... ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. ... Notepad is the standard text editor for Microsoft Windows A text editor is a piece of computer software for editing plain text. ... Web browser shortcuts on an Apple computer A web browser is a software application, technically a type of HTTP client, that enables a user to display and interact with HTML documents hosted by web servers or held in a file system. ...


Although a BOM could be used with UTF-32, this encoding is almost never used for transmission anyway. UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ...


Representations of Byte Order Marks by Encoding

Encoding Representation
UTF-8 EF BB BF
UTF-16 Big Endian FE FF
UTF-16 Little Endian FF FE
UTF-32 Big Endian 00 00 FE FF
UTF-32 Little Endian FF FE 00 00
SCSU 0E FE FF
UTF-7 2B 2F 76
and one of the following byte sequences: [ 38 | 39 | 2B | 2F | 38 2D ]
UTF-EBCDIC DD 73 66 73
BOCU-1 FB EE 28

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ... In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ... UTF-32 is a method of encoding Unicode characters, using a fixed amount of 32 bits for each character. ... The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard to reduce the number of bytes needed to represent text, especially if that text uses mostly characters from a small number of Unicode blocks. ... UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages. ... UTF-EBCDIC is an encoding of Unicode that is meant to be EBCDIC friendly so that some older EBCDIC applications can handle some Unicode data. ... BOCU-1 is a MIME compatible Unicode compression scheme. ...

See also

Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. ... The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. ... In computing, UTF-16 is a 16-bit Unicode Transformation Format, a character encoding form that provides a way to represent a series of abstract characters from Unicode and ISO/IEC 10646 as a series of 16-bit words suitable for storage or transmission via data networks. ... UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. ...

External links


  Results from FactBites:
 
Byte Order Mark - definition of Byte Order Mark in Encyclopedia (321 words)
A Byte Order Mark (BOM) is the character at code point FEFF (ZERO-WIDTH NO-BREAK SPACE), when that character is used to denote the Endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32.
In UTF-16, a BOM is expressed as the 2 byte sequence FE FF at the beginning of the encoded string, to indicate that the encoded characters that follow it use big-endian byte order; or it is expressed as the byte sequence FF FE to indicate little-endian order.
The UTF-8 representation of the BOM is the byte sequence EF BB BF.
  More results at FactBites »

 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m