FACTOID # 26: Delaware is the latchkey kid capital of America, with 71.8% of households having both parents in the labor force.
 
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 
   
 
WHAT'S NEW
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Speech synthesis

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.[1] This article or section does not cite any references or sources. ... Computer software (or simply software) refers to one or more computer programs and data held in the storage of a computer for some purpose. ... Computer hardware is the physical part of a computer, including the digital circuitry, as distinguished from the computer software that executes within the hardware. ... A symbolic linguistic representation is a representation of an utterance that uses symbols to represent linguistic information about the utterance, such as information about phonetics, phonology, morphology, syntax, or semantics. ... Phonetic transcription (or phonetic notation) is the visual system of symbolization of the sounds occurring in spoken human language. ...


Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.[2] In computing , a database can be defined as a structured collection of records or data that is stored in a computer so that a program can consult it to answer queries. ... Look up phone in Wiktionary, the free dictionary. ... In phonetics, a diphone is an adjacent pair of phones. ... Sagittal section of human vocal tract The vocal tract is that cavity in animals and humans, where sound that is produced at the sound source (larynx in mammals; syrinx in birds) is filtered. ...


The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s. Visual impairment is the functional loss of vision. ... A reading disability is a condition in which a sufferer displays difficulty reading resulting primarily from neurological factors. ...

Contents

Overview of text processing

Overview of a typical TTS system Image File history File links No higher resolution available. ...

Audio sample:
  • Sample of Microsoft Sam ( file info) — play in browser (beta)
    • Windows XP’s default speech synthesizer voice saying, “The quick brown fox jumps over the lazy dog 1,234,567,890 times.”
  • Problems playing the files? See media help.

A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end. The front-end has two major tasks. First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words. This process is often called text normalization, pre-processing, or tokenization. The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences. The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion .[3] Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end. The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound. Image File history File links MS_Sam. ... Software development stages In computer programming, development stage terminology expresses how the development of a piece of software has progressed and how much further development it may require. ... Windows XP is a line of proprietary operating systems developed by Microsoft for use on general-purpose computer systems, including home and business desktops, notebook computers, and media centers. ... In their most general meanings, the terms front end and back end refer to the initial and the end stages of a process flow. ... Tokenizing is the operation of splitting up a string of characters into a set of tokens. ... Phonetic transcription (or phonetic notation) is the visual system of symbolization of the sounds occurring in spoken human language. ... In linguistics, prosody refers to intonation, rhythm, and vocal stress in speech. ... Look up phrase in Wiktionary, the free dictionary. ... In grammar, a clause is a word or group of words ordinarily consisting of a subject and a predicate, although in some languages and some types of clauses, the subject may not appear explicitly. ... In linguistics, a sentence is a unit of language, characterized in most languages by the presence of a finite verb. ... In typography, a grapheme is the atomic unit in written language. ...


History

Mechanical devices

Long before electronic signal processing was invented, there were those who tried to build machines to create human speech. Early examples of "speaking heads" were made by Gerbert of Aurillac (d. 1003 AD), Albertus Magnus (1198–1280), and Roger Bacon (1214–1294). Electronics is the study of the flow of charge through various materials and devices such as, semiconductors, resistors, inductors, capacitors, nano-structures, and vacuum tubes. ... Signal processing is the processing, amplification and interpretation of signals, and deals with the analysis and manipulation of signals. ... Gerbert of Aurillac, later known as pope Silvester II, (or Sylvester II), (ca. ... Albertus Magnus (b. ... For the Nova Scotia premier see Roger Bacon (politician). ...


In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of Sciences, built models of the human vocal tract that could produce the five long vowel sounds (in International Phonetic Alphabet notation, they are [aː], [eː], [iː], [oː] and [uː]).[4] This was followed by the bellows-operated "acoustic-mechanical speech machine" by Wolfgang von Kempelen of Vienna, Austria, described in a 1791 paper.[5] This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels. In 1837, Charles Wheatstone produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia". Wheatstone's design was resurrected in 1923 by Paget.[6] Russian Academy of Sciences: main building Russian Academy of Sciences (Росси́йская Акаде́мия Нау́к) is the national academy of Russia. ... Sagittal section of human vocal tract The vocal tract is that cavity in animals and humans, where sound that is produced at the sound source (larynx in mammals; syrinx in birds) is filtered. ... Note: This page contains IPA phonetic symbols in Unicode. ... Articles with similar titles include the NATO phonetic alphabet, which has also informally been called the “International Phonetic Alphabet”. For information on how to read IPA transcriptions of English words, see IPA chart for English. ... Hand bellows The bellows is a device for delivering pressured air in a controlled quantity to a controlled location. ... Wolfgang von (de Pámánd) Kempelen or Ján Vlk Kempelen or Farkas Kempelen (born 23 January 1734 in Pressburg (today Bratislava), died 26 March 1804 in Vienna) was an author and inventor, who became most famous for his construction of the Mechanical Turk, which was a first-class... Vienna (German: , see also other names) is the capital of Austria, and also one of the nine States of Austria. ... In articulatory phonetics, a consonant is a sound in spoken language that is characterized by a closure or stricture of the vocal tract sufficient to cause audible turbulence. ... Charles Wheatstone Sir Charles Wheatstone (February 6, 1802 - October 19, 1875) was the British inventor of many innovations including the English concertina the Stereoscope an early form of microphone the Playfair cipher (named for Lord Playfair, the person who publicized it) He was a major figure in the development of...


In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World's Fair. Bell Laboratories (also known as Bell Labs and formerly known as AT&T Bell Laboratories and Bell Telephone Laboratories) was the main research and development arm of the United States Bell System. ... A vocoder (name derived from voice encoder, formerly also called voder) is a speech analyzer and synthesizer. ... Trylon, Perisphere and Helicline photo by Sam Gottscho The 1939-40 New York Worlds Fair, located on the current site of Flushing Meadows-Corona Park (also the location of the 1964-1965 New York Worlds Fair), was one of the largest worlds fairs of all time. ...


The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories in the late 1940s and completed in 1950. There were several different versions of this hardware device but only one currently survives. The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound. Using this device, Alvin Liberman and colleagues were able to discover acoustic cues for the perception of phonetic segments (consonants and vowels). The Pattern playback[1][2] is an early talking device that was built by Dr. Franklin S. Cooper and his colleagues, including John M. Borst and Caryl Haskins, at Haskins Laboratories in the late 1940s and completed in 1950. ... Franklin Seaney Cooper (Apr. ... Haskins Laboratories [1] is an independent, international, multidisciplinary community of researchers conducting basic research on spoken and written language. ... Alvin Liberman (1917-2000) was a psychologist whose ideas set the agenda for fifty years of research in the psychology of speech perception and laid the groundwork for modern computer speech synthesis and the understanding of critical issues in cognitive science. ... Phonetic (pho-NET-ic) is a nationwide voicemail-to-text messaging service available for most digital mobile phones in which a subscriber is provided a custom voice mailbox for the purpose of receiving all incoming voice messages as actual transcribed text for reading via short messaging (also known as SMS...


Early electronic speech synthesizers sounded robotic and were often barely intelligible. However, the quality of synthesized speech has steadily improved, and output from contemporary speech synthesis systems is sometimes indistinguishable from actual human speech.


Electronic devices

The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968. In 1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman[7] used an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs. Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews. Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility. Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey,[8] where the HAL 9000 computer sings the same song as it is being put to sleep by astronaut Dave Bowman.[9] Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers for use in humanoid robots. [10] John Larry Kelly, Jr. ... The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in April, 1956. ... Bell Laboratories (also known as Bell Labs and formerly known as AT&T Bell Laboratories and Bell Telephone Laboratories) was the main research and development arm of the United States Bell System. ... Daisy Bell is a popular song whose famous lines (Daisy, Daisy, Give me your answer do. ... Max Vernon Mathews was born in Columbus, Nebraska, on November 13, 1926. ... Sir Arthur Charles Clarke (born December 16, 1917) is a British science-fiction author and inventor, most famous for his novel 2001: A Space Odyssey, and for collaborating with director Stanley Kubrick on the film of the same name. ... HAL 9000 (Heuristically programmed ALgorithmic computer) is a fictional character in Arthur C. Clarkes Space Odyssey saga. ... David Bowman is a character in the Space Odyssey series. ... ASIMO, a humanoid robot manufactured by Honda. ...


Synthesizer technologies

The most important qualities of a speech synthesis system are naturalness and Intelligibility. Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood. The ideal speech synthesizer is both natural and intelligible. Speech synthesis systems usually try to maximize both characteristics. Look up Transwiki:intelligibility in Wiktionary, the free dictionary. ...


The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant synthesis. Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used. Spectrogram of American English vowels [i, u, É‘] showing the formants f1 and f2 A formant is a peak in an acoustic frequency spectrum which results from the resonant frequencies of any acoustical system. ...


Concatenative synthesis

Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech. Generally, concatenative synthesis produces the most natural-sounding synthesized speech. However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output. There are three main sub-types of concatenative synthesis. Concatenation is a standard operation in computer programming languages (a subset of formal language theory). ...


Unit selection synthesis

Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram.[11] An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection). This process is typically achieved using a specially weighted decision tree. In computing , a database can be defined as a structured collection of records or data that is stored in a computer so that a program can consult it to answer queries. ... Look up phone in Wiktionary, the free dictionary. ... A syllable (Ancient Greek: ) is a unit of organization for a sequence of speech sounds. ... In morpheme-based morphology, a morpheme is the smallest lingual unit that carries a semantic interpretation. ... A word is a unit of language that carries meaning and consists of one or more morphemes which are linked more or less tightly together, and has a phonetical value. ... Look up phrase in Wiktionary, the free dictionary. ... In linguistics, a sentence is a unit of language, characterized in most languages by the presence of a finite verb. ... Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as Voice Recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. ... Waveform quite literally means the shape and form of a signal, such as a wave moving across the surface of water, or the vibration of a plucked string. ... It has been suggested that this article or section be merged with periodogram. ... It has been suggested that Bitmap index be merged into this article or section. ... The fundamental tone, often referred to simply as the fundamental, is the lowest frequency in a harmonic series. ... Pitch is the perceived fundamental frequency of a sound. ... In computer science, run time (with a space, though often its spelled without one) describes the operation of a computer program, the duration of its execution, from beginning to termination (compare compile time). ... In operations research, specifically in decision analysis, a decision tree is a decision support tool that uses a graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. ...


Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech. DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform. The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned. However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech.[12] Digital signal processing (DSP) is the study of signals in a digital representation and the processing methods of these signals. ... A gigabyte (derived from the SI prefix giga-) is a unit of information or computer storage equal to one billion (short scale) bytes or 230 bytes (1024 mebibytes)[1]. It is commonly abbreviated GB (not to be confused with Gb, which is used for gigabits). ...


Diphone synthesis

Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language. The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500. In diphone synthesis, only one example of each diphone is contained in the speech database. At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA.[13] The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size. As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations. In phonetics, a diphone is an adjacent pair of phones. ... Note: This page or section contains IPA phonetic symbols in Unicode. ... Prosody may mean several things: Prosody consists of distinctive variations of stress, tone, and timing in spoken language. ... Digital signal processing (DSP) is the study of signals in a digital representation and the processing methods of these signals. ... It has been suggested that this article or section be merged with Code Excited Linear Prediction. ... Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. ... MBROLA is an algorithm for speech synthesis, a software which is distributed at no financial cost but in binary form only, and a world-wide collaborative project. ...


Domain-specific synthesis

Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances. It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports. The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators. The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.[citation needed]


Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed. The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account. For example, in non-rhotic dialects of English the <r> in words like <clear> /ˈkliːə/ is usually only pronounced when the following word has a vowel as its first letter (e.g. <clear out> is realized as /ˌkliːəɹˈɑʊt/). Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive. English pronunciation is divided into two main accent groups, the rhotic and non-rhotic, depending on when the phoneme (the letter r) is pronounced. ... Note: This page or section contains IPA phonetic symbols in Unicode. ... In linguistics, Alternation is when a set of morphosyntactic properties is phonologically expressed in two or more different ways in different words. ... A context-sensitive grammar is a formal grammar G = (N, &#931;, P, S) such that all rules in P are of the form &#945;A&#946; &#8594; &#945;&#947;&#946; with A in N (i. ...


Formant synthesis

Formant synthesis does not use human speech samples at runtime. Instead, the synthesized speech output is created using an acoustic model. Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech. This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components. Spectrogram of American English vowels [i, u, É‘] showing the formants f1 and f2 A formant is a peak in an acoustic frequency spectrum which results from the resonant frequencies of any acoustical system. ... The fundamental tone, often referred to simply as the fundamental, is the lowest frequency in a harmonic series. ... In phonetics, phonation is the use of the laryngeal system to generate an audible source of acoustic energy, i. ... For the Irish mythological figure, see Naoise. ... Waveform quite literally means the shape and form of a signal, such as a wave moving across the surface of water, or the vibration of a plucked string. ...


Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech. However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems. Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems. High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader. Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples. They can therefore be used in embedded systems, where memory and microprocessor power are especially limited. Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and intonations can be output, conveying not just questions and statements, but a variety of emotions and tones of voice. A screen reader is a software application that attempts to identify and interpret what is being displayed on the screen. ... It has been suggested that Embedded System Design in an FPGA be merged into this article or section. ... Many different consumer electronic devices can store data. ... A microprocessor is a programmable digital electronic component that incorporates the functions of a central processing unit (CPU) on a single semiconducting integrated circuit (IC). ... Intonation, in linguistics, is the variation of pitch when speaking. ...


Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines.[14] Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces. [15] Texas Instruments (NYSE: TXN), better known in the electronics industry (and popularly) as TI, is an American company based in Dallas, Texas, USA, renowned for developing and commercializing semiconductor and computer technology. ... The Speak & Spell was a popular and revolutionary electronic toy consisting of a speech synthesizer and a keyboard. ... Sega Corporation ) is a Japanese multinational video game software and hardware developing company, and a former home computer and console manufacturer. ... A video arcade (known as an amusement arcade in the United Kingdom) is a place where people play arcade video games. ...


Articulatory synthesis

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein. This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues. Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. ... Sagittal section of human vocal tract The vocal tract is that cavity in animals and humans, where sound that is produced at the sound source (larynx in mammals; syrinx in birds) is filtered. ... Haskins Laboratories [1] is an independent, international, multidisciplinary community of researchers conducting basic research on spoken and written language. ... Philip Rubin (born May 22, 1949, in Newark, New Jersey) is an American cognitive scientist who since 2003 has been the Chief Executive Officer and a Senior Scientist at Haskins Laboratories in New Haven, Connecticut. ... Bell Telephone Laboratories or Bell Labs was originally the research and development arm of the United States Bell System, and was the premier corporate facility of its type, developing a range of revolutionary technologies from telephone switches to specialized coverings for telephone cables, to the transistor. ...


Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems. A notable exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted. Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under a GNU General Public Licence, with work continuing as gnuspeech. The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model". Look up Next in Wiktionary, the free dictionary. ... The University of Calgary is a public university located in the north-western quadrant of Calgary, Alberta, Canada. ... Steven Paul Jobs (born February 24, 1955) is the co-founder and CEO of Apple and was the CEO of Pixar until its acquisition by Disney. ... GPL redirects here. ...


HMM-based synthesis

HMM-based synthesis is a synthesis method based on hidden Markov models. In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. [16] State transitions in a hidden Markov model (example) x — hidden states y — observable outputs a — transition probabilities b — output probabilities A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to... Familiar concepts associated with a frequency are colors, musical notes, radio/TV channels, and even the regular rotation of the earth. ... Sagittal section of human vocal tract The vocal tract is that cavity in animals and humans, where sound that is produced at the sound source (larynx in mammals; syrinx in birds) is filtered. ... The fundamental tone, often referred to simply as the fundamental, is the lowest frequency in a harmonic series. ... Prosody may mean several things: Prosody consists of distinctive variations of stress, tone, and timing in spoken language. ... www. ... Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution from a given data set. ...


Sinewave synthesis

Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles. [17] Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles. ... spectrogram of American English vowels [i, u, É‘] showing the formants F1 and F2 A formant is a peak in an acoustic frequency spectrum which results from the resonant frequencies of any acoustical system. ...


Challenges

Text normalization challenges

The process of normalizing text is rarely straightforward. Texts are full of heteronyms, numbers, and abbreviations that all require expansion into a phonetic representation. There are many spellings in English which are pronounced differently based on context. For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project". In linguistics, heteronyms (also known as heterophones) are words with identical spellings but different pronunciations and meanings. ... A number is an abstract idea used in counting and measuring. ... This article does not cite any references or sources. ...


Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective. As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence. Look up Heuristic in Wiktionary, the free dictionary. ...


Deciding how to convert numbers is another problem that TTS systems have to address. It is a simple programming challenge to convert a number into words, like "1325" becoming "one thousand three hundred twenty-five." However, numbers occur in many different contexts; when part of an address, "1325" should likely be read as "thirteen twenty-five", or, when part of a social security number, as "one three two five". A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.[citation needed]


Similarly, abbreviations can be ambiguous. For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street". TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs.


Text-to-phoneme challenges

Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds in a language). The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program. Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary. The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings. This is similar to the "sounding out", or synthetic phonics, approach to learning reading. In human language, a phoneme is the theoretical representation of a sound. ... Please wikify (format) this article as suggested in the Guide to layout and the Manual of Style. ...


Each approach has advantages and drawbacks. The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary.[citation needed] As dictionary size grows, so too does the memory space requirements of the synthesis system. On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations. (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].) As a result, nearly all speech synthesis systems use a combination of these approaches.


Some languages, like Spanish, have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful.[citation needed] Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings. On the other hand, speech synthesis systems for languages like English, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries. The English language is a West Germanic language that originates in England. ...


Evaluation challenges

It is very difficult to evaluate speech synthesis systems consistently because there is no subjective criterion and usually different organizations use different speech data. The quality of a speech synthesis system highly depends on the quality of recording. Therefore, evaluating speech synthesis systems is almost the same as evaluating the recording skills.


Recently researchers start evaluating speech synthesis systems using the common speech dataset.[18] This may help people to compare the difference between technologies rather than recordings.


Computer operating systems or outlets with speech synthesis

Apple

The first speech system integrated into an operating system was Apple Computer's MacInTalk in 1984. Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software. In the early 1990s apple expanded its capabilities offering system wide text-to-speech support. With the introduction of faster PowePC based computers they included higher quality voice sampling. Apple also introduced speech recognition into its systems which provided a fluid command set. More recently, Apple has added sample-based voices. Starting as a curiosity, the speech system of Apple Macintosh has evolved into a cutting edge fully-supported program, PlainTalk, for people with vision problems. VoiceOver, which is now included in Mac OS 10.4 Tiger is included with all installations of Mac OS Tiger and Mac OS Leopard. An operating system (OS) is a set of computer programs that manage the hardware and software resources of a computer. ... Apple Inc. ... PlainTalk is the collective name for several speech synthesis (MacInTalk) and speech recognition technologies, developed by Apple Computer. ... Macintosh, also known as Mac, is a family of personal computers manufactured by Apple Computer, Inc. ... PlainTalk is the collective name for several speech synthesis (MacInTalk) and speech recognition technologies, developed by Apple Computer. ... VoiceOver is a feature built into Apple Computers Mac OS X v10. ...


AmigaOS

The second operating system with advanced speech synthesis capabilities was AmigaOS, introduced in 1985. The voice synthesis was licensed by Commodore International from a third-party software house (Don't Ask Software, now Softvoice, Inc.) and it featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the Amiga hardware audio chipset.[19] It was divided into a narrator device and a translator library. Amiga Speak Handler featured a text-to-speech translator. AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it. Some Amiga programs, such as word processors, made extensive use of the speech system. AmigaOS is the default native operating system of the Amiga personal computer. ... Commodore, the commonly used name for Commodore International, was an American electronics company based in West Chester, Pennsylvania which was a vital player in the home/personal computer field in the 1980s. ... The original Amiga 1000 (1985) with various peripherals The Amiga is a family of personal computers originally developed by Amiga Corporation. ... A chipset is a group of integrated circuits (chips) that are designed to work together, and are usually marketed as a single product. ... AmigaOS is the default native operating system of the Amiga personal computer. ...


Microsoft Windows

Modern Windows systems use SAPI4- and SAPI5-based speech systems that include a speech recognition engine (SRE). SAPI 4.0 was available on Microsoft-based operating systems as a third-party add-on for systems like Windows 95 and Windows 98. Windows 2000. Windows XP finally added a speech synthesis program called Narrator, directly available to users. All Windows-compatible programs could make use of speech synthesis features, available through menus once installed on the system. Microsoft Speech Server is a complete package for voice synthesis and recognition, for commercial applications such as call centers. Microsoft Windows is the name of several families of proprietary software operating systems by Microsoft. ... To meet Wikipedias quality standards, this article or section may require cleanup. ... To meet Wikipedias quality standards, this article or section may require cleanup. ... Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as Voice Recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. ... Windows 95 is a consumer-oriented graphical user interface-based operating system. ... Windows 98 (codenamed Memphis) is a graphical operating system released on June 25, 1998 by Microsoft and the successor to Windows 95. ... Windows 2000 (also referred to as Win2K) is a preemptible, interruptible, graphical and business-oriented operating system that was designed to work with either uniprocessor or symmetric multi-processor 32-bit Intel x86 computers. ... Windows XP is a line of proprietary operating systems developed by Microsoft for use on general-purpose computer systems, including home and business desktops, notebook computers, and media centers. ... Narrator is a light-duty screen reader utility packaged with Microsoft Windows 2000, Windows XP and Windows Vista. ... The Microsoft Speech Server is a product from Microsoft designed to allow the authoring and deployment of IVR applications incorporating Speech Recognition, Speech Synthesis and DTMF. The product also has some limited support for multimodal applications running on IE on Windows and PocketPC devices. ... A call centre (Commonwealth English) or call center (AmE) is a centralised office of a company that answers incoming telephone calls from customers or that makes outgoing telephone calls to customers (telemarketing). ...


Internet

Currently, there are a lot of applications, plugins and gadgets, e.g. Power Text to Speech Reader and TextAloud that can read messages directly from an e-mail client and web pages from a web browser. Some specialized software (e.g. RSS to speech Google gadget), can narrate Rss-feeds. This article or section does not cite any references or sources. ... A plugin (plug-in, addin, add-in, addon or add-on) is a computer program that interacts with a main (or host) application (a web browser or an email program, for example) to provide a certain, usually very specific, function on demand. ... The iPhone is a promising new gadget that was released June 29, 2007 A True Utility LockLite, a gadget invention to turn a key into a flashlight. ... {{ #REDIRECT [[ --216. ... An example of a web browser (Internet Explorer), displaying the English Wikipedia main page. ... It has been suggested that this article or section be merged with Computer program. ... For RSS feeds from Wikipedia, see Wikipedia:Syndication. ...


On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to podcasts. On the other the hand, on-line RSS-readers are available on almost any PC connected to the Internet. Note, that users can download generated audio files to portable devices, e.g. with a help of podcast receiver, and listen to them while walking, jogging or commuting to work. An orange square with waves indicates that an RSS feed is present on a web page. ... The Altair 8800 was among the first microcomputers to be affordable by an individual, although it initially lacked peripherals and memory. ... An orange square with waves indicates that an RSS feed is present on a web page. ...


Aside RSS-based speech synthesis, there are other useful text-to-speech web services. For example, Bluemountain.com has featured eCards that allow a user to produce custom-made vocal greetings from a computer-generated voice. These eCards usually consist of pre-made images, but some allow the user to select an image of whatever the user wants[20]. The Pediaphon project provides dynamically generated Text to Speech podcasts of all English, French and German language Wikipedia articles. An E-card is a postcard, sent by means of the Internet, usually through email. ...


Others

The Texas Instruments TI-99/4A was an early home computer, released in June 1981, originally at a price of $525. ... A codec is a device or program capable of performing encoding and decoding on a digital data stream or signal. ... Linux (IPA pronunciation: ) is a Unix-like computer operating system. ... It has been suggested that Open source culture be merged into this article or section. ... Festival is a general multi-lingual speech synthesis system developed at Centre for Speech Technology Research (CSTR) at the University of Edinburgh. ... MBROLA is an algorithm for speech synthesis, a software which is distributed at no financial cost but in binary form only, and a world-wide collaborative project. ... The Free Software Foundation (FSF) is a non-profit corporation founded in October 1985 by Richard Stallman to support the free software movement (free as in freedom), and in particular the GNU project. ... Lernout & Hauspie Speech Products, or L&H, was a Belgium-based speech and language technology leader company, which was founded by Jo Lernout and Pol Hauspie, and which went bankrupt in 2001. ... Acapela Group is a speech solutions company. ... A cepstrum (pronounced ) is the result of taking the Fourier transform (FT) of the decibel spectrum as if it were a signal. ... DECtalk was a speech synthesizer and text-to-speech technology developed by Digital Equipment Corporation in the early 1980s. ... IBM ViaVoice is a range of language-specific continuous speech recognition software products offered by IBM. Individual language editions may have different features, specifications, technical support, and microphone support. ...

Speech synthesis markup languages

A number of markup languages have been established for the rendition of text as speech in an XML-compliant format. The most recent is Speech Synthesis Markup Language (SSML), which became a W3C recommendation in 2004. Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE. Although each of these was proposed as a standard, none of them has been widely adopted. A specialized markup language using SGML is used to write the electronic version of the Oxford English Dictionary. ... The Extensible Markup Language (XML) is a general-purpose markup language. ... Speech Synthesis Markup Language (SSML) is an XML-based markup language for speech synthesis applications. ... A W3C Recommendation is the final stage of a ratification process of the W3C working group concerning the standard. ...


Speech synthesis markup languages are distinguished from dialogue markup languages. VoiceXML, for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup. VoiceXML (VXML) is the W3Cs standard XML format for specifying interactive voice dialogues between a human and a computer. ...


Applications

Accessibility

Speech synthesis has long been a vital assistive technology tool and its application in this area is significant and widespread. It allows environmental barriers to be removed for people with a wide range of disabilities. The longest application has been in the use of screenreaders for people with visual impairment, but text-to-speech systems are now commonly used by people with dyslexia and other reading difficulties as well as by pre-literate youngsters. They are also frequently employed to aid those with severe speech impairment usually through a dedicated voice-output communication aid (VOCA). Assistive Technology (AT) is a generic term that includes assistive, adaptive, and rehabilitative devices and the process used in selecting, locating, and using them. ... A screen reader is a software application that attempts to identify and interpret what is being displayed on the screen. ... Visual impairment is the functional loss of vision. ... This article is about developmental dyslexia. ... Speech disorders, or speech impediments as they are also called, are a type of communication disorders where normal speech is disrupted. ... The term Voca may mean: Voća in Croatia Voca Limited, a provider of payment services for banks and corporates Category: ...


News service

Sites such Ananova have used speech synthesis to convert written news to audio content, which can be used for mobile applications. Ananova is a Web-oriented news service that features a computer-simulated animation of a woman newscaster, named Ananova, who has been programmed to read newscasts to Web users. ...


Entertainment

In 2007-5-1, speech synthesis software developer Animo Limited and anime/manga creation tool software developer CELSYS,Inc. announced the development of a speech synthesis software package geared towards customers in the anime, game, and other entertainment industries. The software, which was based on Animo's speech synthesis software FineSpeech, would generate narration and lines of dialogue according to user specifications.[23]


References

  1. ^ Jonathan Allen, M. Sharon Hunnicutt, Dennis Klatt, From Text to Speech: The MITalk system. Cambridge University Press: 1987. ISBN 0521306418
  2. ^ Rubin, P., Baer, T., & Mermelstein, P. (1981). An articulatory synthesizer for perceptual research. Journal of the Acoustical Society of America, 70, 321-328.
  3. ^ P. H. Van Santen, Richard William Sproat, Joseph P. Olive, and Julia Hirschberg, Progress in Speech Synthesis. Springer: 1997. ISBN 0387947019
  4. ^ History and Development of Speech Synthesis, Helsinki University of Technology, Retrieved on November 04, 2006
  5. ^ Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine ("Mechanism of the human speech with description of its speaking machine," J. B. Degen, Wien).
  6. ^ Mattingly, Ignatius G. Speech synthesis for phonetic and phonological models. In Thomas A. Sebeok (Ed.), Current Trends in Linguistics, Volume 12, Mouton, The Hague, pp. 2451-2487, 1974.
  7. ^ http://query.nytimes.com/search/query?ppds=per&v1=GERSTMAN%2C%20LOUIS&sort=newest NY Times obituary for Louis Gerstman.
  8. ^ Arthur C. Clarke online Biography
  9. ^ Bell Labs: Where "HAL" First Spoke (Bell Labs Speech Synthesis website)
  10. ^ Anthropomorphic Talking Robot Waseda-Talker Series
  11. ^ Alan W. Black, Perfect synthesis for all of the people all of the time. IEEE TTS Workshop 2002. (http://www.cs.cmu.edu/~awb/papers/IEEE2002/allthetime/allthetime.html)
  12. ^ John Kominek and Alan W. Black. (2003). CMU ARCTIC databases for speech synthesis. CMU-LTI-03-177. Language Technologies Institute, School of Computer Science, Carnegie Mellon University.
  13. ^ T. Dutoit, V. Pagel, N. Pierret, F. Bataiile, O. van der Vrecken. The MBROLA Project: Towards a set of high quality speech synthesizers of use for non commercial purposes. ICSLP Proceedings, 1996.
  14. ^ Examples include Astro Blaster, Space Fury, and Star Trek: Strategic Operations Simulator.
  15. ^ John Holmes and Wendy Holmes. Speech Synthesis and Recognition, 2nd Edition. CRC: 2001. ISBN 0748408568.
  16. ^ The HMM-based Speech Synthesis System, http://hts.sp.nitech.ac.jp/
  17. ^ Remez, R.E., Rubin, P.E., Pisoni, D.B., & Carrell, T.D. Speech perception without traditional speech cues. Science, 1981, 212, 947-950.
  18. ^ Blizzard Challenge http://festvox.org/blizzard
  19. ^ Miner, Jay et al (1991). Amiga Hardware Reference Manual: Third Edition. Addison-Wesley Publishing Company, Inc. ISBN 0-201-56776-8.
  20. ^ "http://veepers.bluemountain.com/service/Start?memstat=afu&email=&path=82948&prodnum=3085679&bc=Talking%20eCards!82947*Anytime!0&src=bma&adisplay=1&va=0". 
  21. ^ Smithsonian Speech Synthesis History Project (SSSHP) 1986-2002
  22. ^ gnuspeech
  23. ^ Speech Synthesis Software for Anime Announced

November 4 is the 308th day of the year (309th in leap years) in the Gregorian Calendar, with 57 days remaining. ... For the Manfred Mann album, see 2006 (album). ... Astro Blaster is a shoot em up arcade game released by Sega in 1981. ... Space Fury (J:スペースフューリー) is a 1981 multi-directional shooter arcade game created by Sega. ... Star Trek - Strategic Operations Simulator is an arcade game released by Sega in 1982. ... It has been suggested that this article or section be merged with Amiga Corporation. ... Pearson can mean Pearson PLC the media conglomerate. ...

See also

Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there. ... Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Mandarin). ... Natural language processing (NLP) is a subfield of artificial intelligence and linguistics. ... Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles. ... Speech processing is the study of speech signals and the processing methods of these signals. ... Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as Voice Recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. ... PlainTalk is the collective name for several speech synthesis (MacInTalk) and speech recognition technologies, developed by Apple Computer. ... Festival is a general multi-lingual speech synthesis system developed at Centre for Speech Technology Research (CSTR) at the University of Edinburgh. ... FreeTTS is a speech synthesis system written entirely in the Java programming language. ... Screen_reader are a form of assistive technology: programs to help people with visual impairments and blindness to use computers. ... Praat is a free scientific software program for the analysis of speech in phonetics. ... Software Automatic Mouth, or S.A.M., was a speech synthesis program for the early personal computers, developed and sold by a company called Dont Ask Software (now SoftVoice). ... Text2Speech is an Open Source, Speech Synthesis program. ... A vocoder (name derived from voice encoder, formerly also called voder) is a speech analyzer and synthesizer. ...

External links


  Results from FactBites:
 
Halfbakery: Themeable Speech Synthesis Accents (1015 words)
Speech synthesis has advanced somewhat over the past few decades; today's speech synthesizers sound somewhat less coarse than, say, SAM on the Commodore 64.
However, one fundamental aspect of speech synthesis that hasn't changed is the accent.
So trying to make text to speech systems sound natural may not be the best way to go, at least not unless they can be significantly improved in quality; and there are limits on quality without actually understanding the text being spoken.
  More results at FactBites »

 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m