FACTOID # 20: Statistically, Delaware bears more cost of the US Military than any other state.
 
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 
   
 
WHAT'S NEW
RELATED ARTICLES
People who viewed "Collation" also viewed:
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Collation

In textual criticism and bibliography, collation is the reading of two (or more) texts side-by-side in order to note their differences. A Specimen of typeset fonts and languages, by William Caslon, letter founder; from the 1728 Cyclopaedia. ... Textual criticism or lower criticism is a branch of philology or bibliography that is concerned with the identification and removal of errors from texts. ... Bibliographies at the University Library of Graz Bibliography (from Greek βιβλιογραφία, lit. ...


In printing and photocopying, collation is the arrangement of pages in order when several copies of a document are bound after printing or copying. This article or section is in need of attention from an expert on the subject. ... A small, much-used Xerox copier in a high school library. ... Old book binding and cover Bookbinding is the process of physically assembling a book from a number of separate or bifoliate sheets of paper or other material. ...


Collation can also refer to the detailed bibliographical description of a book or the comparison of the physical makeup of two copies of a book. Bibliographies at the University Library of Graz Bibliography (from Greek βιβλιογραφία, lit. ...


This article discusses the terms as used in library and information science and computer science, where collation is the assembly of written information into a standard order. In common usage, this is called alphabetisation, though collation is not limited to ordering letters of the alphabet. Collating lists of words or names into alphabetical order is the basis of most office filing systems, library catalogs and reference books. Library and information science (LIS) is the study of issues related to libraries and the information fields. ... Computer science, or computing science, is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. ... A Specimen of typeset fonts and languages, by William Caslon, letter founder; from the 1728 Cyclopaedia. ... File has several meanings: Computer file File (tool) file (Unix), a program used to determine file types. ... The card catalog at Yale Universitys Sterling Memorial Library goes almost completely unused, but adds to the austere atmosphere. ... A reference work is a compendium of information, usually of a specific type, compiled in a book for ease of reference. ...


Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the partial ordering of those categories. In mathematics, a partially ordered set (or poset for short) is a set equipped with a special binary relation which formalizes the intuitive concept of an ordering. ...


Collation differs from a sorting algorithm in that whereas sorting algorithms decide which pairs of elements to compare, collation defines a total order on pairs that the sorting algorithm uses to determine when to swap the elements (usually a lexicographical order). In fact, sorting algorithms are often implemented to take a collation as an input. In computer science and mathematics, a sorting algorithm is an algorithm that puts elements of a list in a certain order. ... In mathematics, a total order, linear order or simple order on a set X is any binary relation on X that is antisymmetric, transitive, and total. ... In mathematics, the lexicographical order, or dictionary order, is a natural order structure of the cartesian product of two ordered sets. ...

Contents

Collation systems

Numerical sorting, sorting of single characters

The simplest collation system is numerical sorting: ordering numbers by their magnitude. For example, the list of numbers 4 · 17 · 3 · 5 collates to 3 · 4 · 5 · 17.


While this might appear to work only for numbers, computers can use this method for any textual information since computers internally use character sets which assign a numeric code point to each letter or glyph. For example, a computer using ASCII code (or any of its supersets such as Unicode) and numerical sorting would collate the list of characters a · b · C · d · $ to $ · C · a · b · d. A BlueGene supercomputer cabinet. ... A character encoding is a code that pairs a set of characters (such as an alphabet or syllabary) with a set of something else, such as numbers or electrical pulses. ... These are the astrological glyphs as most commonly used in Western Astrology A glyph is a specific symbol representing a semantic or phonetic unit of definitive value in a writing system. ... There are 95 printable ASCII characters, numbered 32 to 126. ... A is a subset of B, and B is a superset of A. In mathematics, especially in set theory, a set A is a subset of a set B, if A is contained inside B. The relationship of one set being a subset of another is called inclusion. ... Because of technical limitations, some web browsers may not display some special characters in this article. ...


The numerical values that ASCII uses are $ = 36, a = 97, b = 98, C = 67, and d = 100, resulting in ASCIIbetical order.


This style of collation is commonly used, often with the refinement of converting uppercase letters to lowercase before comparing ASCII values, since most people do not expect capitalised words to jump the head of the list.


Alphabetical sorting

A collation system for multiple-character words is alphabetical sorting in lexicographical order, based on the conventional order of letters in an alphabet or abjad (most of which have a single conventional order). Each nth letter is compared with the nth letter of other words in the list, starting at the first letter of each word and advancing to the second, third, fourth, and so on, until the order is established. In mathematics, the lexicographical order, or dictionary order, is a natural order structure of the cartesian product of two ordered sets. ... A Specimen of typeset fonts and languages, by William Caslon, letter founder; from the 1728 Cyclopaedia. ... An abjad is a type of writing system where there is one symbol per consonantal phoneme, sometimes also called a consonantary. ...


For example, the list of words foo · bar · bibble collates to bar · bibble · foo because (1) f comes after b so bar and bibble both precede foo and (2) a comes before i so bar precedes bibble.


Numerical sorting (not to be confused with sorting of numbers, see below) on a computer and alphabetical sorting often produce the same ordering for English. The English language is a West Germanic language that originates in England. ...


The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. The Latin alphabet, also called the Roman alphabet, is the most widely used alphabetic writing system in the world today. ...


For example, the thirty-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c, l, respectively. Ch and ll are still considered letters, but are alphabetized as digraphs. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994. On the other hand, the letter rr follows rqu as expected, both with and without the 1994 alphabetization rule.) A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization. The Real Academia Española (Royal Spanish Academy or RAE) is the institution responsible for regulating the Spanish language. ... 1994 (MCMXCIV) was a common year starting on Saturday of the Gregorian calendar, and was designated as the International Year of the Family and the International Year of the Sport and the Olympic Ideal by United Nations. ...


Similar differences between computer numeric sorting and alphabetic sorting occur in Danish and Norwegian (in some cases, aa is ordered as å at the end of the alphabet), German (ß is ordered as s + s; ä, ö, ü are ordered as a + e, o + e, u + e in phone books, but as o elsewhere, and behind o in Austria), Icelandic (ð follows d), English (æ is ordered as a + e), and many other languages.


Usually the spaces or hyphens between words are ignored. A space is a punctuation convention for providing interword separation in some scripts, including the Latin, Greek, Cyrillic, and Arabic. ... A hyphen ( -, or ‐ ) is a punctuation mark. ...


Languages that used a syllabary or abugida instead of an alphabet (for example, Cherokee) can use approximately the same system if there is a set ordering for the symbols. A syllabary is a set of written symbols that represent (or approximate) syllables, which make up words. ... An abugida or alphasyllabary is a writing system composed of signs (graphemes) denoting consonants with an inherent following vowel, which are consistently modified to indicate other vowels (or, in some cases, the lack of a vowel). ... Cherokee (Cherokee: Tsalagi) is an Iroquoian language spoken by the Cherokee people. ...


For a comprehensive list of the collation orders in various languages, see Alphabets derived from the Latin. Variants of the Latin alphabet are used by the writing systems of many languages throughout the world. ...


Radical-and-stroke sorting

The Character Palette from Mac OS X is an example of use of radical-and-stroke sorting on a computer to provide an interface to input Chinese, Japanese, and Korean characters

Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as Chinese hanzi and Japanese kanji, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character for "mother" (媽) is sorted as a thirteen-stroke character under the three-stroke primary radical (女). Image File history File links Download high resolution version (369x632, 108 KB) Licensing File history Legend: (cur) = this is the current file, (del) = delete this old version, (rev) = revert to this old version. ... Image File history File links Download high resolution version (369x632, 108 KB) Licensing File history Legend: (cur) = this is the current file, (del) = delete this old version, (rev) = revert to this old version. ... Mac OS X (official IPA pronunciation: ) is a line of proprietary, graphical operating systems developed, marketed, and sold by Apple Computer, the latest of which is pre-loaded on all currently shipping Macintosh computers. ... Technical note: Due to technical limitations, some web browsers may not display some special characters in this article. ... Japanese writing Kanji Kana Hiragana Katakana Hentaigana Manyōgana Uses Furigana Okurigana Rōmaji Kanji (Japanese:  ) are the Chinese characters that are used in the modern Japanese logographic writing system along with hiragana (平仮名), katakana (片仮名), and the Hindu-Arabic numerals. ... The left part of mā, a Chinese character meaning mother, is a radical that means woman A radical (from Latin radix, meaning root) is a basic identifiable component of every Chinese character. ...


The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word ''Tōkyō (東京), the Japanese name of Tokyo can be sorted as if it were spelled out in the Japanese alphabet as "to-u-ki-yo-u" (とうきょう). For other uses, see Tokyo (disambiguation). ...


Nevertheless, the radical-and-stroke system is the only practical method for constructing dictionaries that someone may use to look up a logograph whose pronunciation is unknown.


Multilingual ordering

When lists of names or words need to be ordered, but the context does not define a particular single language or alphabet, the Unicode Collation Algorithm provides a way to put them in sequence. The Unicode collation algorithm provides a standard way to put names, words or strings of text in sequence according to the needs of a particular situation. ...


Complications

Conventions in typography and in sorting systems

In typography and in the writing of scientific articles etc, such things as headers, sections, lists, pages etc. might use alphabetical numbering instead of numerical numbering. However, this does not always mean that the full alphabet of a particular language is used. Often alphabetical numbering—or enumeration—only uses a subset of the full alphabet. E.g. the Russian alphabet has 33 letters, but typically only 28 are used in typographical enumeration (and for instance Ukrainian, Belarusian and Bulgarian Cyrillic enumeration shows similar features). Two Russian letters, Ъ and Ь, are only used for modifying the preceding consonants—they naturally fall out. The last three could have been used, but mostly aren't: Ы never begins a Russian word, Й almost never begins a word either, and it is perhaps too much alike the И—and also a relatively new character. Ё is also relatively new and much debated—sometimes in proper alphabetical sorting letters on Ё are listed under Е. (These "rules" are of course moderated, again, e.g. in phone catalogs, where foreign (non-Russian) names may frequently begin with Й or Ы.) This alludes to a simple fact: alphabets are not only tools for writing. And letters are often kept in an alphabet of a certain language even though they are not used in writing, not least because they are used in alphabetical enumeration. For instance, X,W,Z are not used in writing the Norwegian language, except in loanwords. Still they are kept in the Norwegian alphabet, and used in alphabetical lists. Likewise, earlier versions of the Russian alphabet contained letters which only had two purposes: they were good for writing Greek words and for using the Greek counting system in its Cyrillic form. The letter (Ъ, ÑŠ) of the Cyrillic alphabet is known as the hard sign (твёрдый знак ) in the modern Russian alphabet and as er golyam (ер голям, big yer) in the Bulgarian alphabet. ... Soft Sign (Ь, ÑŒ) is a letter in the Cyrillic alphabet (Russian: мягкий знак (mÄ­ahkiy znak) [], Ukrainian: м’який знак (miakyy znak) [], Belarusian: мяккі знак (miakki znak) []). It is named so because it usually indicates softening, or palatalization, of the preceding consonant or of the group of them. ... See also consonance in music. ... Yery (Ы, Ñ‹) is a letter in the Cyrillic alphabet. ... Й, й (Short I) is a letter in the Cyrillic alphabet. ... I or Y (И, и) is a letter in the Cyrillic alphabet, pronounced in Russian, or in Ukrainian. ... Yo (Ё, Ñ‘) is the seventh letter of the Russian Cyrillic alphabet, invented to replace the recklessly confused е and o for soft o relatively soon after the introduction of the Civil alphabet. ... Ye, or E (Е, е), is a letter of the Cyrillic alphabet. ... For other uses, see X (disambiguation). ... Look up W in Wiktionary, the free dictionary. ... Look up Z, z in Wiktionary, the free dictionary. ... The Danish and Norwegian alphabet consists of 29 letters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, Æ, Ø, Å The letter Å was introduced in Norwegian in 1917, replacing Aa. Similarly, Å was introduced in Danish... The modern Russian alphabet is a variant of the Cyrillic alphabet (Кириллица). It was introduced into Kievan Rus (Киевская Русь) at the time of its conversion to Christianity (988), or, if certain archaelogical finds are correctly dated, at a slightly earlier date. ...


Compound words and special characters

A complication in alphabetical sorting can arise due to disagreements over how groups of words (separated compound words, names, titles, etc.) should be ordered. One rule is to remove spaces for purposes of ordering, another is to consider a space as a character that is ordered before numbers and letters (this method is consistent with ordering by ASCII or Unicode codepoint), and a third is to order a space after numbers and letters. Given the following strings to alphabetize — "catch", "cattle", "cat food" — the first rule produces "catch" "cat food" "cattle", the second "cat food" "catch" "cattle", and the third "catch" "cattle" "cat food". The first rule is used in most (but not all) dictionaries, the second in telephone directories (so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo). The third rule is rarely used. A compound is a word (lexeme) that consists of more than one free morpheme. ... For other senses of this word, see name (disambiguation). ... For other uses, see Title (disambiguation). ... A space is a punctuation convention for providing interword separation in some scripts, including the Latin, Greek, Cyrillic, and Arabic. ... A dictionary is a list of words with their definitions, a list of characters with their glyphs, or a list of words with corresponding words in other languages. ... Moscow phone book, 1930. ...


A similar complication arises when special characters such as hyphens or apostrophes appear in words or names. Any of the same rules as above can be used in this case as well; however, the strict ASCII sorting no longer corresponds exactly to any of the rules. A hyphen ( -, or ‐ ) is a punctuation mark. ... Look up apostrophe in Wiktionary, the free dictionary. ...


Name/surname ordering

The telephone directory example sheds light on another complication. In cultures where family names are written after given names, it is usually still desired to sort by family name first. In this case, names need to be reordered to be sorted properly. For example, Juan Hernandes and Brian O'Leary should be sorted as Hernandes, Juan and O'Leary, Brian even if they are not written this way. Capturing this rule in a computer collation algorithm is difficult, and simple attempts will necessarily fail. For example, unless the algorithm has at its disposal an extensive list of family names, there is no way to decide if "Gillian Lucille van der Waal" is "van der Waal, Gillian Lucille", "Waal, Gillian Lucille van der", or even "Lucille van der Waal, Gillian". A family name, surname, or last name is the part of a persons name that indicates to what family he or she belongs. ... A given name is a name which specifies and differentiates between members of a group of individuals, especially a family, all of whose members usually share the same family name. ...


In telephone directories in English speaking countries, surnames beginning with Mc are sometimes sorted as if starting with Mac and placed between "Mabxxx" and "Madxxx". Under these rules, the telephone directory order of the following names would be: Maam, McAllan, Macbeth, MacCarthy, McDonald, Macy, Mboko.


Abbreviations and common words

When abbreviations are used, it is sometimes desired to expand the abbreviations for sorting. In this case, "St. Paul" comes before "Shanghai". Obviously, to capture this behavior in a collation algorithm, we need a list of abbreviations. It may be more practical in some cases to store two sets of strings, one for sorting and one display. A similar problem arises when letters are replaced by numbers or special symbols in an irregular manner, for example 1337 for leet or the movie Se7en. In this case, proper sorting necessitates keeping two sets of strings. This article is about the modification of text. ... Se7en (Seven) is a 1995 film about two detectives investigating a well-read, methodical serial killer who uses the seven deadly sins as inspiration for a series of ritualistic murders. ...


In certain contexts, very common words (such as articles) at the beginning of a sequence of words are not considered for ordering, or are moved to the end. So "The Shining" is considered "Shining" or "Shining, The" when alphabetizing and therefore is ordered before "Summer of Sam". This rule is fairly easy to capture in an algorithm, but many programs rely instead on simple lexicographic ordering. One fairly quaint exception to this rule is the flying of the flag of The Former Yugoslav Republic of Macedonia at the United Nations between those of Thailand and Timor Leste. An article is a word that is next to a noun or any word that modifies a noun to indicate the type of reference being made by the noun. ... The Shining may mean: The Shining (novel), by Stephen King The Shining (film), Stanley Kubricks adaptation of the novel The Shining (mini-series), the ABC mini-series scripted by Stephen King The Shining (band), an English music group named after Kings novel This is a disambiguation page: a... DVD cover Summer of Sam is a 1999 film about the Son of Sam serial murders. ... For an explanation of terms related to Macedonia, see Macedonia (terminology). ... The United Nations (UN) is an international organization whose stated aims are to facilitate co-operation in international law, international security, economic development, and social equity. ... Motto: Honra, Pátria e Povo(Portuguese) Honour, Homeland and People Anthem: Pátria Capital (and largest city) Dili Tetum, Portuguese1 Government Republic  - President Xanana Gusmão  - Prime Minister José Ramos Horta Independence from Portugal2   - Declared November 28, 1975   - Recognized May 20, 2002  Area  - Total 14,609 km² (158th) 5...


Numerical sorting of strings

Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though 'b' comes after '1' in Unicode. This can be extended to Roman numerals. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. The system of Roman numerals is a numeral system originating in ancient Rome, and was adapted from Etruscan numerals. ...


For example, Windows XP does this when sorting file names. Sorting decimals properly is a bit more difficult, due to the fact that different locales use different symbols for a decimal point, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent. Windows XP is a line of operating systems developed by Microsoft for use on general-purpose computer systems, including home and business desktops, notebook computers, and media centers. ... See Filing system for this term as it is used in libraries and offices In computing, a file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. ... The decimal separator is a symbol used to mark the boundary between the integer and the fractional parts of a decimal numeral. ...


Sorting of numbers

Ascending order of numbers differs from alphabetical order, e.g. 11 comes alphabetically before 2. This can be fixed with leading zeros: 02 comes alphabetically before 11. See e.g. ISO 8601. A leading zero is any zero that proceeds a number string beginning with a non-null value. ... ISO 8601, Data elements and interchange formats – Information interchange – Representation of dates and times is an international standard for date and time representations. ...


Also -13 comes alphabetically after -12 although it is less. With negative numbers, to make ascending order correspond with alphabetical sorting, more drastic measures are needed such as adding a constant to all numbers to make them all positive.


See also

The Unicode collation algorithm provides a standard way to put names, words or strings of text in sequence according to the needs of a particular situation. ... In mathematics, the lexicographical order, or dictionary order, is a natural order structure of the cartesian product of two ordered sets. ... One of the Amarna letters The designation Amarna letters denotes an archive of correspondence, mostly diplomatic, between the Egyptian administration and its representatives in Canaan and Amurru. ...

External links and references

In computing, Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in a markup language. ...

Tools

  • sort The GNU implementation of the standard Unix sort utility.
  • msort A sort program that provides an unusual level of flexibility in defining collations and extracting keys.

  Results from FactBites:
 
FAQ - Collation (1512 words)
UTS #10 Unicode Collation Algorithm is defined with a particular base version of the Unicode Standard, but I am using characters from a later version of Unicode.
Q: UTS #10 Unicode Collation Algorithm is defined with a particular base version of the Unicode Standard, but I am using characters from a later version of Unicode.
The UTC is committed to ensuring that the Unicode Collation Algorithm is updated in a timely manner, so that the repertoire of characters in the Default Unicode Collation Element Table stays in synch with the Unicode Standard.
Character Collation Concept Dictionary (1365 words)
N-to-1 collation - collation in which a sequence of N characters are grouped together as a single unit for the purposes of collation.
If collation values are not specified in one or more collation order statements, then 1) only a primary weight is assigned to the corresponding character, 2) the primary weight is determined by the relative position of the character in the sequence.
Collating elements which are specified by a locale definition file may not duplicate a symbolic name in the current charmap file.
  More results at FactBites »

 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m