FACTOID # 23: Wisconsin has more metal fabricators per capita than any other state.
 
 Home   Encyclopedia   Statistics   States A-Z   Flags   Maps   FAQ   About 
   
 
WHAT'S NEW
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > CPU design

CPU design is the hardware design of a central processing unit. Design focuses on these areas: Hardware Design, or computer hardware design, is integral to how a computer operates. ... CPU redirects here. ...

  1. Datapaths (such as ALUs and pipelines)
  2. Logic which controls the datapaths
  3. Memory components such as register files, caches
  4. Clock circuitry such as clock drivers, PLLs, clock distribution networks
  5. Pad transceiver circuitry
  6. Logic gate cell library which is used to implement the logic

CPUs designed for high performance markets might require custom designs for each of these items to achieve frequency, power-dissipation, and chip-area goals. The arithmetic logic unit/arithmetic-logic unit (ALU) of a computers CPU is a part of the execution unit, a core component of all CPUs. ... Many electronic systems use internal clocks which are required to be phase-aligned to and/or frequency multiples of some external reference clock. ... In electronic design, library often refers to a collection of cells, macros or functional units that perform common operations and are used to build more complex logic blocks. ... In electrical engineering, power consumption refers to the electrical energy over time that must be supplied to an electrical device to maintain its operation. ...


CPUs designed for lower performance markets might lessen the implementation burden by:

  • acquiring some of these items by purchasing them as intellectual property
  • use control logic implementation techniques (logic synthesis using CAD tools) to implement the other components - datapaths, register files, clocks

Common logic implementation techniques used in CPU design include: In law, intellectual property (IP) is an umbrella term for various legal entitlements which attach to certain types of information, ideas, or other intangibles in their expressed form. ... Logic synthesis is a process by which an abstract form of desired circuit behavior (typically register transfer level (RTL) or behavioral) is turned into a design implementation in terms of logic gates. ...

A CPU design project generally has these major tasks: In the theory of computation, a finite state machine (FSM) or finite state automaton (FSA) is an abstract machine that has only a finite, constant amount of memory. ... A microprogram is a program consisting of microcode that controls the different parts of a computers central processing unit (CPU). ... A programmable logic array (PLA) is a programmable device used to implement combinational logic circuits. ...

As with most complex electronic designs, the logic verification effort (proving that the design does not have bugs) now dominates the project schedule of a CPU. Register transfer level description (RTL), also called register transfer logic is a description of a digital electronic circuit in terms of data flow between registers, which store information between clock cycles in a digital circuit. ... The process of circuit design can cover systems ranging from national power grids all the way down to the individual transistors within an integrated circuit. ... Logic synthesis is a process by which an abstract form of desired circuit behavior (typically register transfer level (RTL) or behavioral) is turned into a design implementation in terms of logic gates. ... Static Timing Analysis is a method of computing the expected timing of a digital circuit without requiring simulation. ... Definition: The act of designing a birds eye view of a structure (eg: house). ... Place and Route is a stage in design of: Printed circuit boards at which components are graphically placed on the board and the wires drawn between them. ... Signal Integrity, sometimes known as SI, refers to electronic circuit tools and techniques that ensure electrical signals are of sufficient quality for proper operation. ... Design Rule Checking or Check(s) (DRC) is the area of Electronic Design Automation software that determines whether a particular chip design satisfies a series of recommended parameters called Design Rules. ...


Key CPU architectural innovations include cache, virtual memory, instruction pipelining, superscalar, CISC, RISC, virtual machine, emulators, microprogram, and stack. Diagram of a CPU memory cache A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. ... It has been suggested that this article be split into multiple articles. ... Instruction pipelining is a method for increasing the throughput of a digital circuit, particularly a CPU, and implements a form of instruction level parallelism. ... Processor board of a CRAY T3e parallel computer with four superscalar Alpha processors A superscalar CPU architecture implements a form of parallelism called Instruction-level parallelism within a single processor. ... A Complex Instruction Set Computer (CISC) is an instruction set architecture (ISA) in which each instruction can indicate several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. ... Reduced Instruction Set Computer (RISC), is a microprocessor CPU design philosophy that favors a smaller and simpler set of instructions that all take about the same amount of time to execute. ... In general terms, a virtual machine in computer science is software that creates a virtualized environment between the computer platform and the end user in which the end user can operate software. ... This article is about emulation in computer science. ... A microprogram implements a CPU instruction set. ... Simple representation of a stack In computer science, a stack is a temporary abstract data type and data structure based on the principle of Last In First Out (LIFO). ...

Contents

Goals

The first CPUs were designed to do mathematical calculations faster and more reliably than human computers. Before mechanical and electronic computers, the term computer, in use from the mid 17th century, meant a human undertaking mathematical calculations. ...


Each successive generation of CPU might be designed to achieve some of these goals:

  • higher performance levels of a single program or thread
  • higher throughput levels of multiple programs/threads
  • less power consumption for the same performance level
  • lower cost for the same performance level
  • greater connectivity to build larger, more parallel systems
  • more specialization to aid in specific targeted markets

Re-designing a CPU core to a smaller die-area helps achieve several of these goals. The term low-power refers to machines or activities that use less power than other similar machines or activies. ...

  • Shrinking everything (a "photomask shrink"), resulting in the same number of transistors on a smaller die, improves performance (smaller transistors switch faster), reduces power (smaller wires have less parasitic capacitance) and reduces cost (more CPUs fit on the same wafer of silicon).
  • Releasing a CPU on the same size die, but with a smaller CPU core, keeps the cost about the same but allows higher levels of integration within one VLSI chip (additional cache, multiple CPUs, or other components), improving performance.

Because there are too many programs to test a CPU's speed on all of them, benchmarks were developed. The most famous benchmarks are the SPECint and SPECfp benchmarks developed by Standard Performance Evaluation Corporation and the ConsumerMark benchmark developed by the Embedded Microprocessor Benchmark Consortium [1]. As used in photolithography, a photomask is typically a transparent fused quartz blank covered with a pattern defined with chrome metal as the absorbing film. ... In computing, a benchmark is the result of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, by running a number of standard tests and trials against it. ... The Standard Performance Evaluation Corporation (SPEC) is a non-profit organization that aims to produce fair, impartial and meaningful benchmarks for computers. ...


Some important measurements include:

  • Most consumers pick a computer architecture (normally Intel IA32 architecture) to be able run a large base of pre-existing pre-compiled software. Being relatively uninformed on computer benchmarks, most of them pick a particular CPU based on operating frequency.
  • System designers building parallel computers, such as Google, pick CPUs based on their speed per watt of power, because the cost of powering the CPU outweighs the cost of the CPU itself. [2][3]
  • Some system designers building parallel computers pick CPUs based on the speed per dollar.
  • System designers building real-time computing systems want to guarantee worst-case response. That is easier to do when the CPU has low interrupt latency and when it has deterministic response. (DSP)
  • Computer programmers who program directly in assembly language want a CPU to support a full featured instruction set.


Some of these measures conflict. In particular, many design techniques that make CPU run faster make the "performance per watt", "performance per dollar", and "deterministic response" much worse, and vice versa. Intel Corporation (NASDAQ: INTC, SEHK: 4335), founded in 1968 as Integrated Electronics Corporation, is an American multinational corporation that is best known for designing and manufacturing microprocessors and specialized integrated circuits. ... ... Parallel computing is the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain results faster. ... The Google search technology is the heart and/or the brain of Google. ... In computer science, real-time computing (RTC) is the study of hardware and software systems which are subject to a real-time constraint —ie. ... Interrupt latency is the time between the generation of an interrupt by a device and the servicing of the device which generated the interrupt. ... The term DSP, when used by itself, can refer to: The sterilisation process Dry Sterilisation Process for the cold and fast sterilisation of surfaces. ... An instruction set, or instruction set architecture (ISA), describes the aspects of a computer architecture visible to a programmer, including the native datatypes, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O (if any). ...


History of general purpose CPUs

1950s: early designs

Each of the computer designs of the early 1950s was a unique design; there were no upward-compatible machines or computer architectures with multiple, differing implementations. Programs written for one machine would not run on another kind, even other kinds from the same company. This was not a major drawback at the time because there was not a large body of software developed to run on computers, so starting programming from scratch was not seen as a large barrier. The 1950s was the decade spanning from the 1st of January, 1950 to the 31st December, 1959. ...


The design freedom of the time was very important, for designers were very constrained by the cost of electronics, yet just beginning to explore how a computer could best be organized. Some of the basic features introduced during this period included index registers (on the Ferranti Mark I), a return-address saving instruction (UNIVAC I), immediate operands (IBM 704), and the detection of invalid operations (IBM 650). An index register in a computer CPU is a processor register used for modifying operand addresses during the run of a program, typically for doing vector/array operations. ... The Ferranti Mark I was the second commercially available general-purpose computer (first being the Z4 computer), with the first machine delivered in February 1951, just beating the UNIVAC I. The machine was built by Ferranti of the United Kingdom. ... In both conventional and electronic messaging, a return address is an explicit inclusion of the address of the person sending the message. ... UNIVAC I Central Complex, containing the central processor and main memory unit. ... The IBM 704, the first mass-produced computer with floating point arithmetic hardware, was introduced by IBM in April, 1956. ... IBM 650 front panel, showing bi-quinary indicators IBM 650 front panel, rear view The IBM 650 was one of IBM’s early computers, and the world’s first mass-produced computer. ...


By the end of the 1950s commercial builders had developed factory-constructed, truck-deliverable computers. The most widely installed computer was the IBM 650, which used drum memory onto which programs were loaded using either paper tape or punch cards. Some very high-end machines also included core memory which provided higher speeds. Hard disks were also starting to become popular. 1950 (MCML) was a common year starting on Sunday (link will take you to calendar). ... IBM 650 front panel, showing bi-quinary indicators IBM 650 front panel, rear view The IBM 650 was one of IBM’s early computers, and the world’s first mass-produced computer. ... hi i am cool xbox is all most as cool as me hi again ... A roll of punched tape Punched tape is an old-fashioned form of data storage, consisting of a long strip of paper in which holes are punched to store data. ... Punched cards (or Hollerith cards, or IBM cards), are pieces of stiff paper that contain digital information represented by the presence or absence of holes in predefined positions. ... A 16×16 cm area core memory plane of 128×128 bits, i. ... Typical hard drives of the mid-1990s. ...


Computers are automatic abaci. The type of number system affects the way they work. In the early 1950s most computers were built for specific numerical processing tasks, and many machines used decimal numbers as their basic number system – that is, the mathematical functions of the machines worked in base-10 instead of base-2 as is common today. These were not merely binary coded decimal. The machines actually had ten vacuum tubes per digit in each register. Some early Soviet computer designers implemented systems based on ternary logic; that is, a bit could have three states: +1, 0, or -1, corresponding to positive, no, or negative voltage. An abacus is a calculation tool, often constructed as a wooden frame with beads sliding on wires. ... The 1950s was the decade spanning from the 1st of January, 1950 to the 31st December, 1959. ... Binary-coded decimal (BCD) is a numeral system used in computing and in electronics systems. ... In computer architecture, a processor register is a small amount of very fast computer memory used to speed the execution of computer programs by providing quick access to commonly used values—typically, the values being in the midst of a calculation at a given point in time. ... Motto: Пролетарии всех стран, соединяйтесь! (Transliterated: Proletarii vsekh stran, soedinyaytes!) (Russian: Workers of the world, unite!) Anthem: The Internationale (1922-1944) Hymn of the Soviet Union (1944-1991) Capital (and largest city) Moscow None; Russian de facto Government Federation of Soviet Republics  - Last President Mikhail Gorbachev  - Last Premier Ivan Silayev Establishment October Revolution   - Declared...


An early project for the U.S. Air Force, BINAC attempted to make a lightweight, simple computer by using binary arithmetic. It deeply impressed the industry. Seal of the Air Force. ... BINAC, the Binary Automatic Computer, was an early electronic computer designed for Northrop Aircraft Company by the Eckert-Mauchly Computer Corporation in 1949. ...


As late as 1970, major computer languages were unable to standardize their numeric behavior because decimal computers had groups of users too large to alienate.


Even when designers used a binary system, they still had many odd ideas. Some used sign-magnitude arthmetic (-1 = 10001), or ones' complement (-1 = 11110), rather than modern two's complement arithmetic (-1 = 11111). Most computers used six-bit character sets, because they adequately encoded Hollerith cards. It was a major revelation to designers of this period to realize that the data word should be a multiple of the character size. They began to design computers with 12, 24 and 36 bit data words (e.g. see the TX-2). In mathematics, signed numbers in some arbitrary base is done in the usual way, by prefixing it with a - sign. ... Twos complement is the most popular method of representing signed integers in computer science. ... The punch card (or Hollerith card) is a recording medium for holding information for use by automated data processing machines. ... The MIT Lincoln Laboratory TX-2 computer was the successor to the Lincoln TX-0 and was known for its role in advancing both artificial intelligence and human-computer interaction. ...


In this era, Grosch's law dominated computer design: Computer cost increased as the square of its speed. Groschs Law: Things can get worse without limit. ...


1960s: the computer revolution and CISC

One major problem with early computers was that a program for one would not work on others. Computer companies found that their customers had little reason to remain loyal to a particular brand, as the next computer they purchased would be incompatible anyway. At that point, price and performance were usually the only concerns.


In 1962, IBM tried a new approach to designing computers. The plan was to make an entire family of computers that could all run the same software, but with different performances, and at different prices. As users' requirements grew they could move up to larger computers, and still keep all of their investment in programs, data and storage media.


In order to do this they designed a single reference computer called the System/360 (or S/360). The System/360 was a virtual computer, a reference instruction set and capabilities that all machines in the family would support. In order to provide different classes of machines, each computer in the family would use more or less hardware emulation, and more or less microprogram emulation, to create a machine capable of running the entire System/360 instruction set. System/360 Model 65 operators console, with register value lamps and toggle switches (middle of picture) and emergency pull switch (upper right). ... A microprogram implements a CPU instruction set. ... An instruction set, or instruction set architecture (ISA), describes the aspects of a computer architecture visible to a programmer, including the native datatypes, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O (if any). ...


For instance a low-end machine could include a very simple processor for low cost. However this would require the use of a larger microcode emulator to provide the rest of the instruction set, which would slow it down. A high-end machine would use a much more complex processor that could directly process more of the System/360 design, thus running a much simpler and faster emulator.


IBM chose to make the reference instruction set quite complex, and very capable. This was a conscious choice. Even though the computer was complex, its "control store" containing the microprogram would stay relatively small, and could be made with very fast memory. Another important effect was that a single instruction could describe quite a complex sequence of operations. Thus the computers would generally have to fetch fewer instructions from the main memory, which could be made slower, smaller and less expensive for a given combination of speed and price. An instruction set, or instruction set architecture (ISA), describes the aspects of a computer architecture visible to a programmer, including the native datatypes, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O (if any). ... A control store is the part of a CPUs control unit that stores the CPUs microprogram. ... A microprogram implements a CPU instruction set. ...


As the S/360 was to be a successor to both scientific machines like the 7090 and data processing machines like the 1401, it needed a design that could reasonably support all forms of processing. Hence the instruction set was designed to manipulate not just simple binary numbers, but text, scientific floating-point (similar to the numbers used in a calculator), and the binary coded decimal arithmetic needed by accounting systems. IBM 7090 console The IBM 7090 was a second-generation transistorized version of the earlier IBM 709 vacuum tube mainframe computers and was designed for large-scale scientific and technological applications. The 7090 was the third member of the IBM 700/7000 series scientific computers. ... The IBM 1401 was a variable wordlength decimal computer that was announced by IBM on October 5, 1959 and marketed as an inexpensive Business Computer. It was withdrawn on February 8, 1971. ... Binary-coded decimal (BCD) is a numeral system used in computing and in electronics systems. ...


Almost all following computers included these innovations in some form. This basic set of features is now called a "complex instruction set computer," or CISC (pronounced "sisk"), a term not invented until many years later. A complex instruction set computer (CISC) is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. ...


In many CISCs, an instruction could access either registers or memory, usually in several different ways. This made the CISCs easier to program, because a programmer could remember just thirty to a hundred instructions, and a set of three to ten addressing modes rather than thousands of distinct instructions. This was called an "orthogonal instruction set." The PDP-11 and Motorola 68000 architecture are examples of nearly orthogonal instruction sets. Addressing modes, a concept from computer science, are an aspect of the instruction set architecture in most central processing unit (CPU) designs. ... Orthogonal instruction set is a term used in computer science. ... The PDP-11 was a 16-bit minicomputer sold by Digital Equipment Corp. ... The Motorola 68000 is a 32-bit CISC microprocessor from Motorola. ...


There was also the BUNCH (Burroughs, Univac, NCR, CDC, and Honeywell) that competed against IBM at this time though IBM dominated the era with S/360.


The Burroughs Corporation (which later merged with Sperry/Univac to become Unisys) offered an alternative to S/360 with their B5000 series machines. In 1961, the B5000 had virtual memory, symmetric multiprocessing, a multi-programming operating system (Master Control Program or MCP), written in ALGOL 60, and the industry's first recursive-descent compilers as early as 1963. Unisys Corporation (NYSE: UIS), based in Blue Bell, Pennsylvania, United States, and incorporated in Delaware[2], is a global provider of information technology services and solutions. ... The Burroughs large systems were the largest of three series of Burroughs Corporation mainframe computers. ... ALGOL (short for ALGOrithmic Language) is a programming language originally developed in the mid 1950s which became the de facto standard way to report algorithms in print for almost the next 30 years. ...


1970s: Large Scale Integration

In the 1960s, the Apollo guidance computer and Minuteman missile made the integrated circuit economical and practical. The Apollo Guidance Computer (AGC) was the first recognizably modern embedded system, used in real-time by astronaut pilots to collect and provide flight information, and to automatically control all of the navigational functions of the Apollo spacecraft. ... The LGM-30 Minuteman is a United States nuclear missile, a land-based intercontinental ballistic missile (ICBM) (the other type is the LG-118A Peacekeeper, which is to be phased out by 2005). ... Integrated circuit showing memory blocks, logic and input/output pads around the periphery A monolithic integrated circuit (also known as IC, microchip, silicon chip, computer chip or chip) is a miniaturized electronic circuit (consisting mainly of semiconductor devices, as well as passive components) that has been manufactured in the surface...


Around 1971, the first calculator and clock chips began to show that very small computers might be possible. The first microprocessor was the 4004, designed in 1971 for a calculator company (Busicom), and produced by Intel. The 4004 is the direct ancestor of the Intel 80386, even now maintaining some code compatibility. Just a few years later, the word size of the 4004 was doubled to form the 8008. A microprocessor (sometimes abbreviated µP) is a digital electronic component with transistors on a single semiconductor integrated circuit (IC). ... The Intel 4004, a 4-bit CPU, was the worlds first single_chip microprocessor, as well as the first commercial one. ... Busicom was a company that owned the rights to the first microprocessor but sold them back to Intel. ... Intel Corporation (NASDAQ: INTC, SEHK: 4335), founded in 1968 as Integrated Electronics Corporation, is an American multinational corporation that is best known for designing and manufacturing microprocessors and specialized integrated circuits. ... The Intel 80386 is a microprocessor which was used as the central processing unit (CPU) of many personal computers from 1986 until 1994 and later. ... Intel 8008 The Intel 8008 was an early microprocessor designed and manufactured by Intel, and introduced in April, 1972. ...


By the mid-1970s, the use of integrated circuits in computers was commonplace. The whole decade consists of upheavals caused by the shrinking price of transistors.


It became possible to put an entire CPU on a single printed circuit board. The result was that minicomputers, usually with 16-bit words, and 4k to 64K of memory, came to be commonplace.


CISCs were believed to be the most powerful types of computers, because their microcode was small and could be stored in very high-speed memory. The CISC architecture also addressed the "semantic gap" as it was perceived at the time. This was a defined distance between the machine language, and the higher level language people used to program a machine. It was felt that compilers could do a better job with a richer instruction set.


Custom CISCs were commonly constructed using "bit slice" computer logic such as the AMD 2900 chips, with custom microcode. A bit slice component is a piece of an ALU, register file or microsequencer. Most bit-slice integrated circuits were 4-bits wide. ALU redirects here. ...


By the early 1970s, the PDP-11 was developed, arguably the most advanced small computer of its day. Almost immediately, wider-word CISCs were introduced, the 32-bit VAX and 36-bit PDP-10. The PDP-11 was a 16-bit minicomputer sold by Digital Equipment Corp. ... VAX is a 32-bit computing architecture that supports an orthogonal instruction set (machine language) and virtual addressing (i. ... The PDP-10 was a computer manufactured by Digital Equipment Corporation (DEC) from the late 1960s on; the name stands for Programmed Data Processor model 10. It was the machine that made time-sharing common; it looms large in hacker folklore because of its adoption in the 1970s by many...


Also, to control a cruise missile, Intel developed a more-capable version of its 8008 microprocessor, the 8080. The Intel 8080 was an early microprocessor designed and manufactured by Intel. ...


IBM continued to make large, fast computers. However the definition of large and fast now meant more than a megabyte of RAM, clock speeds near one megahertz [4][5], and tens of megabytes of disk drives.


IBM's System 370 was a version of the 360 tweaked to run virtual computing environments. The virtual computer was developed in order to reduce the possibility of an unrecoverable software failure. VM is an early and influential virtual machine operating system from IBM, apparently the first true virtual machine system. ...


The Burroughs B5000/B6000/B7000 series reached its largest market share. It was a stack computer whose OS was programmed in a dialect of Algol. The Burroughs large systems were the largest of three series of Burroughs Corporation mainframe computers. ...


All these different developments competed madly for marketshare.


Early 1980s: the lessons of RISC

In the early 1980s, researchers at UC Berkeley and IBM both discovered that most computer language compilers and interpreters used only a small subset of the instructions of a CISC. Much of the power of the CPU was simply being ignored in real-world use. They realized that by making the computer simpler and less orthogonal, they could make it faster and less expensive at the same time. The 1980s refers to the years of 1980 to 1989. ... The University of California, Berkeley (also known as Cal, UC Berkeley, UCB, or simply Berkeley) is a prestigious, public, coeducational university situated in the foothills of Berkeley, California to the east of San Francisco Bay, overlooking the Golden Gate and its bridge. ... Big Blue redirects here. ... A Complex Instruction Set Computer (CISC) is an instruction set architecture (ISA) in which each instruction can indicate several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. ...


At the same time, CPU calculation became faster in relation to the time for a necessary memory accesses. Designers also experimented with using large sets of internal registers. The idea was to cache intermediate results in the registers under the control of the compiler. This also reduced the number of addressing modes and orthogonality. Look up cache in Wiktionary, the free dictionary. ... Addressing modes, a concept from computer science, are an aspect of the instruction set architecture in most central processing unit (CPU) designs. ...


The computer designs based on this theory were called Reduced Instruction Set Computers, or RISC. RISCs generally had larger numbers of registers, accessed by simpler instructions, with a few instructions specifically to load and store data to memory. The result was a very simple core CPU running at very high speed, supporting the exact sorts of operations the compilers were using anyway. The reduced instruction set computer, or RISC, is a microprocessor CPU design philosophy that favors a simpler set of instructions that all take about the same amount of time to execute. ...


A common variation on the RISC design employs the Harvard architecture, as opposed to the Von Neumann or Stored Program architecture common to most other designs. In a Harvard Architecture machine, the program and data occupy separate memory devices and can be accessed simultaneously. In Von Neumann machines the data and programs are mixed in a single memory device, requiring sequential accessing which produces the so-called "Von Neumann bottleneck." The term Harvard architecture originally referred to computer architectures that used physically separate storage and signal pathways for their instructions and data (in contrast to the von Neumann architecture). ... Design of the Von Neumann machine A von Neumann architecture is a computer design model that uses a single storage structure to hold both instructions and data. ...


One downside to the RISC design has been that the programs that run on them tend to be larger. This is because compilers have to generate longer sequences of the simpler instructions to accomplish the same results. Since these instructions need to be loaded from memory anyway, the larger code size offsets some of the RISC design's fast memory handling. A diagram of the operation of a typical multi-language, multi-target compiler. ...


Recently, engineers have found ways to compress the reduced instruction sets so they fit in even smaller memory systems than CISCs. Examples of such compression schemes include the ARM's "Thumb" instruction set. In applications that do not need to run older binary software, compressed RISCs are coming to dominate sales. The ARM architecture (previously, the Advanced RISC Machine, and prior to that Acorn RISC Machine) is a 32-bit RISC processor architecture that is widely used in a number of embedded designs. ...


Another approach to RISCs was the MISC, "niladic" or "zero-operand" instruction set. This approach realized that the majority of space in an instruction was to identify the operands of the instruction. These machines placed the operands on a push-down (last-in, first out) stack. The instruction set was supplemented with a few instructions to fetch and store memory. Most used simple caching to provide extremely fast RISC machines, with very compact code. Another benefit was that the interrupt latencies were extremely small, smaller than most CISC machines (a rare trait in RISC machines). The Burroughs large systems architecture uses this approach. The B5000 was designed in 1961, long before the term "RISC" was invented. The architecture puts six 8-bit instructions in a 48-bit word, and was a precursor to VLIW design (see below: 1990 to Today). MISC (Minimal Instruction Set Computer) is a processor architecture with a very small number of basic operations and corresponding opcodes. ... Simple representation of a stack In computer science, a stack is a temporary abstract data type and data structure based on the principle of Last In First Out (LIFO). ... The Burroughs large systems were the largest of three series of Burroughs Corporation mainframe computers. ... A very long instruction word or VLIW CPU architectures implement a form of instruction level parallelism. ...


The Burroughs architecture was one of the inspirations for Charles H. Moore's Forth language, which placed six 5-bit instructions in a 32-bit word. Commercial variants were mostly characterized as "Forth" machines, and probably failed in the market place because the power and advantages of that language were not commonly understood. Also, the machines were developed by defense contractors at exactly the time that the cold war ended. Loss of funding may have broken up the development teams before the companies could perform adequate commercial marketing. Charles H. Moore (also known as Chuck Moore) is the inventor of the Forth programming language. ... Forth is a programming language and programming environment, initially developed by Charles H. Moore at the US National Radio Astronomy Observatory in the early 1970s. ...


RISC chips now dominate the market for 32-bit embedded systems. Smaller RISC chips are even becoming common in the cost-sensitive 8-bit embedded-system market. The main market for RISC CPUs has been systems that require low power or small size.


Even some CISC processors (based on architectures that were created before RISC became dominant) translate instructions internally into a RISC-like instruction set. These CISC chips include newer x86 and VAX models. x86 or 80x86 is the generic name of a microprocessor architecture first developed and manufactured by Intel. ... VAX is a 32-bit computing architecture that supports an orthogonal instruction set (machine language) and virtual addressing (i. ...


These numbers may surprise many, because the "market" is perceived to be desktop computers. With Intel x86 designs dominating the vast majority of all desktop sales, RISC is found in some of the Apple, Sun and SGI desktop computer lines. However, desktop computers are only a tiny fraction of the computers now sold. Most people own more computers in embedded systems in their car and house than on their desks. Apple Computer, Inc. ... Sun Microsystems, Inc. ... SGI is a TLA for at least three separate entities: Saskatchewan Government Insurance Scientific Games International Silicon Graphics, Incorporated Soka Gakkai International This page expands a three-character combination which might be any or all of: an abbreviation, an acronym, an initialism, a word in English, or a word in...


Mid-1980s to today: exploiting instruction level parallelism

In the mid-to-late 1980s, designers began using a technique known as "instruction pipelining", in which the processor works on multiple instructions in different stages of completion. For example, the processor may be retrieving the operands for the next instruction while calculating the result of the current one. Modern CPUs may use over a dozen such stages. MISC processors achieve single-cycle execution of instructions without the need for pipelining. Instruction pipelining is a method for increasing the throughput of a digital circuit, particularly a CPU, and implements a form of instruction level parallelism. ... Miscellaneous in the English language is a word used to describe a thing or a set of things that cannot be categorized into other categories. ...


A similar idea, introduced only a few years later, was to execute multiple instructions in parallel on separate arithmetic-logic units (ALUs). Instead of operating on only one instruction at a time, the CPU will look for several similar instructions that are not dependent on each other, and execute them in parallel. This approach is known as superscalar processor design. ALU redirects here. ... Processor board of a CRAY T3e parallel computer with four superscalar Alpha processors A superscalar CPU architecture implements a form of parallelism called Instruction-level parallelism within a single processor. ...


Such techniques are limited by the degree of instruction level parallelism (ILP), the number of non-dependent instructions in the program code. Some programs are able to run very well on superscalar processors due to their inherent high ILP, notably graphics. However more general problems do not have such high ILP, thus making the achievable speedups due to these techniques to be lower. Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be dealt with at once. ...


Branching is one major culprit. For example, the program might add two numbers and branch to a different code segment if the number is bigger than a third number. In this case even if the branch operation is sent to the second ALU for processing, it still must wait for the results from the addition. It thus runs no faster than if there were only one ALU. The most common solution for this type of problem is to use a type of branch prediction. In computer architecture, a branch predictor is the part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. ...


To further the efficiency of multiple functional units which are available in superscalar designs, operand register dependencies was found to be another limiting factor. To minimize these dependencies, out-of-order execution of instructions was introduced. In such a scheme, the instruction results which complete out-of-order must be re-ordered in program order by the processor for the program to be restartable after an exception. Out-of-Order execution was the main advancement of the computer industry during the 1990s. A similar concept is speculative execution, where instructions from one direction of a branch (the predicted direction) are executed before the branch direction is known. When the branch direction is known, the predicted direction and the actual direction are compared. If the predicted direction was correct, the speculatively-executed instructions and their results are kept; if it was incorrect, these instructions and their results are thrown out. Speculative execution coupled with an accurate branch predictor gives a large performance gain. In computer science, out-of-order execution is a paradigm used in most high-speed microprocessors in order to make use of cycles that would otherwise be wasted by a certain type of costly delay. ... The examples and perspective in this article or section may not represent a worldwide view. ... In computer science, speculative execution is the execution of code whose result may not actually be needed. ...


These advances, which were originally developed from research for RISC-style designs, allow modern CISC processors to execute twelve or more instructions per clock cycle, when traditional CISC designs could take twelve or more cycles to execute just one instruction.


The resulting instruction scheduling logic of these processors is large, complex and difficult to verify. Furthermore, the higher complexity requires more transistors, increasing power consumption and heat. In this respect RISC is superior because the instructions are simpler, have less interdependence and make superscalar implementations easier. However, as Intel has demonstrated, the concepts can be applied to a CISC design, given enough time and money.

Historical note: Some of these techniques (e.g. pipelining) were originally developed in the late 1950s by IBM on their Stretch mainframe computer.

The 1950s was the decade spanning from the 1st of January, 1950 to the 31st December, 1959. ... International Business Machines Corporation (IBM, or colloquially, Big Blue) (NYSE: IBM) (incorporated June 15, 1911, in operation since 1888) is headquartered in Armonk, New York, USA. The company manufactures and sells computer hardware, software, and services. ... The IBM 7030, also known as Stretch, was IBMs first attempt at building a supercomputer. ...

1990 to today: looking forward

VLIW and EPIC

The instruction scheduling logic that makes a superscalar processor is just boolean logic. In the early 1990s, a significant innovation was to realize that the coordination of a multiple-ALU computer could be moved into the compiler, the software that translates a programmer's instructions into machine-level instructions. A diagram of the operation of a typical multi-language, multi-target compiler. ...


This type of computer is called a very long instruction word (VLIW) computer. A Very Long Instruction Word or VLIW CPU architecture implements a form of instruction level parallelism. ...


Statically scheduling the instructions in the compiler (as opposed to letting the processor do the scheduling dynamically) can reduce CPU complexity. This can improve performance, reduce heat, and reduce cost.


Unfortunately, the compiler lacks accurate knowledge of runtime scheduling issues. Merely changing the CPU core frequency multiplier will have an effect on scheduling. Actual operation of the program, as determined by input data, will have major effects on scheduling. To overcome these severe problems a VLIW system may be enhanced by adding the normal dynamic scheduling, losing some of the VLIW advantages.


Static scheduling in the compiler also assumes that dynamically generated code will be uncommon. Prior to the creation of Java, this was in fact true. It was reasonable to assume that slow compiles would only affect software developers. Now, with JIT virtual machines for Java and .net, slow code generation affects users as well. A Java Virtual Machine (JVM), originally developed by Sun Microsystems, is a virtual machine that executes Java bytecode. ... In computing, just-in-time compilation (JIT), also known as dynamic translation, is a technique for improving the performance of bytecode-compiled programming systems, by translating bytecode into native machine code at runtime. ... .net is a generic top-level domain (gTLD) used on the Internets Domain Name System. ...


There were several unsuccessful attempts to commercialize VLIW. The basic problem is that a VLIW computer does not scale to different price and performance points, as a dynamically scheduled computer can. Another issue is that compiler design for VLIW computers is extremely difficult, and the current crop of compilers (as of 2005) don't always produce optimal code for these platforms.


Also, VLIW computers optimise for throughput, not low latency, so they were not attractive to the engineers designing controllers and other computers embedded in machinery. The embedded systems markets had often pioneered other computer improvements by providing a large market that did not care about compatibility with older software. To meet Wikipedias quality standards, this article may require cleanup. ...


In January 2000, a company called Transmeta took the interesting step of placing a compiler in the central processing unit, and making the compiler translate from a reference byte code (in their case, x86 instructions) to an internal VLIW instruction set. This approach combines the hardware simplicity, low power and speed of VLIW RISC with the compact main memory system and software reverse-compatibility provided by popular CISC. This article is about the year 2000. ... Transmeta NASDAQ: TMTA develops computing technologies with focus on reducing power consumption in electronic devices. ... x86 or 80x86 is the generic name of a microprocessor architecture first developed and manufactured by Intel. ...


Intel released a chip, called the Itanium, based on what they call an Explicitly Parallel Instruction Computing (EPIC) design. This design supposedly provides the VLIW advantage of increased instruction throughput. However, it avoids some of the issues of scaling and complexity, by explicitly providing in each "bundle" of instructions information concerning their dependencies. This information is calculated by the compiler, as it would be in a VLIW design. The early versions are also backward-compatible with current x86 software by means of an on-chip emulation mode. Integer performance was disappointing and despite improvements, sales in volume markets continue to be low. Intel Corporation (NASDAQ: INTC, SEHK: 4335), founded in 1968 as Integrated Electronics Corporation, is an American multinational corporation that is best known for designing and manufacturing microprocessors and specialized integrated circuits. ... Itanium 2 logo Old Itanium logo The Itanium is an IA-64 microprocessor developed jointly by Hewlett-Packard and Intel. ... Explicitly Parallel Instruction Computing (EPIC) is a computing paradigm that began to be researched in the 1990s. ... x86 or 80x86 is the generic name of a microprocessor architecture first developed and manufactured by Intel. ... This article is about emulation in computer science. ...


Multi-threading

Current designs work best when the computer is running only a single program, however nearly all modern operating systems allow the user to run multiple programs at the same time. For the CPU to change over and do work on another program requires expensive context switching. In contrast, multi-threaded CPUs can handle instructions from multiple programs at once. An operating system (OS) is a computer program that manages the hardware and software resources of a computer. ... A context switch is the computing process of storing and restoring the state of a CPU (the context) such that multiple processes can share a single CPU resource. ...


To do this, such CPUs include several sets of registers. When a context switch occurs, the contents of the "working registers" are simply copied into one of a set of registers for this purpose.


Such designs often include thousands of registers instead of hundreds as in a typical design. On the downside, registers tend to be somewhat expensive in chip space needed to implement them. This chip space might otherwise be used for some other purpose.


Multi-core

Multi-core CPUs are typically multiple cpus on the same die possibly sharing the same cache to main memory and sharing the same bus to talk to other devices and to talk to each other...


Reconfigurable logic

Another track of development is to combine reconfigurable logic with a general-purpose CPU. In this scheme, a special computer language compiles fast-running subroutines into a bit-mask to configure the logic. Slower, or less-critical parts of the program can be run by sharing their time on the CPU. This process has the capability to create devices such as software radios, by using digital signal processing to perform functions usually performed by analog electronics. The field of electronics comprises the study and use of systems that operate by controlling the flow of electrons (or other charge carriers) in devices such as thermionic valves (vacuum tubes) and semiconductors. ...


Public domain processors

As the lines between hardware and software increasingly blur due to progress in design methodology and availability of chips such as FPGAs and cheaper production processes, even open source hardware has begun to appear. Loosely-knit communities like OpenCores have recently announced completely open CPU architectures such as the OpenRISC which can be readily implemented on FPGAs or in custom produced chips, by anyone, without paying license fees, and even established processor manufacturers like Sun Microsystems have released processor designs (e.g. OpenSPARC) under open-source licenses. A field-programmable gate array or FPGA is a gate array that can be reprogrammed after it is manufactured, rather than having its programming fixed during the manufacturing — a programmable logic device. ... Open source hardware refers to computer, or electronics, hardware that is designed in the same fashion as open-source software. ... OpenCores is a loose collection of people who are interested in developing open source hardware (digital hardware) through electronic design automation, with a similar ethos to the free software movement. ... OpenRISC is an open source hardware RISC CPU design by OpenCores released under the GNU Lesser General Public License. ... Sun Microsystems, Inc. ... OpenSPARC is an open source hardware project started in December 2005. ...


High-end processor economics

Developing new, high-end CPUs is a very expensive proposition. Both the logical complexity (needing very large logic design and logic verification teams and simulation farms with perhaps thousands of computers) and the high operating frequencies (needing large circuit design teams and access to the state-of-the-art fabrication process) account for the high cost of design for this type of chip. The design cost of a high-end CPU will be on the order of US $100 million. Since the design of such high-end chips nominally take about five years to complete, to stay competitive a company has to fund at least two of these large design teams to release products at the rate of 2.5 years per product generation. Only the personal computer mass market (with production rates in the hundreds of millions, producing billions of dollars in revenue) can support such economics. As of 2004, only four companies are actively designing and fabricating state of the art general purpose computing CPU chips: Intel, AMD, IBM and Fujitsu. Motorola has spun off its semiconductor division as Freescale as that division was dragging down profit margins for the rest of the company. Texas Instruments, TSMC and Toshiba are a few examples of a companies doing manufacturing for another company's CPU chip design. Intel Corporation (NASDAQ: INTC, SEHK: 4335), founded in 1968 as Integrated Electronics Corporation, is an American multinational corporation that is best known for designing and manufacturing microprocessors and specialized integrated circuits. ... Advanced Micro Devices, Inc. ... Big Blue redirects here. ... For the district in Saga, Japan, see Fujitsu, Saga. ... Motorola (NYSE: MOT) is an American international communications company based in Schaumburg, Illinois, a Chicago suburb. ... American corporation Freescale Semiconductor, Inc. ... Texas Instruments (NYSE: TXN), better known in the electronics industry (and popularly) as TI, is an American company based in Dallas, Texas, USA, renowned for developing and commercializing semiconductor and computer technology. ... Taiwan Semiconductor Manufacturing Company, Limited (Traditional Chinese: 台灣積體電路製造股份有限公司, abbrev. ... Toshiba Corporations headquarters in Hamamatsucho, Tokyo Toshiba Corporation sales by division for year ending March, 31 2005 Toshiba Corporation ) (TYO: 6502 ) is a Japanese high technology electrical and electronics manufacturing firm, headquartered in Tokyo, Japan. ...


Embedded design

The majority of computer systems in use today are embedded in other machinery, such as telephones, clocks, appliances, vehicles, and infrastructure. An embedded system usually has minimal requirements for memory and program length and may require simple but unusual input/output systems. For example, most embedded systems lack keyboards, screens, disks, printers, or other recognizable I/O devices of a personal computer. They may control electric motors, relays or voltages, and reed switches, variable resistors or other electronic devices. Often, the only I/O device readable by a human is a single light-emitting diode, and severe cost or power constraints can even eliminate that. To meet Wikipedias quality standards, this article may require cleanup. ...


Latency

In contrast to general-purpose computers, embedded systems often seek to minimize interrupt latency over instruction throughput. Interrupt latency is the time between the generation of an interrupt by a device and the servicing of the device which generated the interrupt. ...


When an electronic device causes an interrupt, the intermediate results, the registers, have to be saved before the software responsible for handling the interrupt can run, and then must be put back after it is finished. If there are more registers, this saving and restoring process takes more time, increasing the latency.


Low-latency CPUs generally have relatively few registers in their central processing units, or they have "shadow registers" that are only used by the interrupt software.


Higher integration

In contrast to general-purpose CPUs, many embedded CPUs do not have an address bus or a data bus, because they integrate all the RAM and non-volatile memory on the same chip as the CPU. Because they need fewer pins, the chip can be placed in a much smaller, cheaper package.


Integrating the memory and other peripherals on a single chip and testing them as a unit increases the cost of that chip, but often results in decreased net cost of the embedded system as a whole. (Even if the cost of a CPU that has integraged peripherals is slightly more than the cost of a CPU + external peripherals, having fewer chips typically allows a smaller and cheaper circuit board, and reduces the labor required to assemble and test the circuit board). This trend leads to microcontroller and system-on-a-chip design. The integrated circuit from an Intel 8742, a 8-bit microcontroller that includes a CPU running at 12 MHz, 128 bytes of RAM, 2048 byte of EPROM, and I/O in the same chip. ... System-on-a-chip (SoC or SOC) is an idea of integrating all components of a computer system into a single chip. ...


Other design issues

Optical communication

One interesting near-term possibility would be to eliminate the front side bus. Modern vertical laser diodes enable this change. In theory, an optical computer's components could directly connect through a holographic or phased open-air switching system. This would provide a large increase in effective speed and design flexibility, and a large reduction in cost. Since a computer's connectors are also its most likely failure point, a busless system might be more reliable, as well. In computers, the front side bus (FSB) is a term for the physical bi-directional data bus that carries all electronic signal information between the central processing unit (CPU) and other devices within the system such as random access memory (RAM), the system BIOS, AGP video cards, PCI expansion cards... A packaged laser diode with penny for scale. ...


Optical processors

Another farther-term possibility is to use light instead of electricity for the digital logic itself. In theory, this could run about 30% faster and use less power, as well as permit a direct interface with quantum computational devices. The chief problem with this approach is that for the foreseeable future, electronic devices are faster, smaller (i.e. cheaper) and more reliable. An important theoretical problem is that electronic computational elements are already smaller than some wavelengths of light, and therefore even wave-guide based optical logic may be uneconomic compared to electronic logic. The majority of development effort, as of 2006 is focused on electronic circuitry. See also optical computing. 2006 is a common year starting on Sunday of the Gregorian calendar. ... An Optical Computer is a computer that performs its computation with photons as opposed to the more traditional electron-based computation. ...


Clockless CPUs

Yet another possibility is the "clockless CPU" (asynchronous CPU). Unlike conventional processors, clockless processors have no central clock to coordinate the progress of data through the pipeline. Instead, stages of the CPU are coordinated using logic devices called "pipe line controls" or "FIFO sequencers." Basically, the pipeline controller clocks the next stage of logic when the existing stage is complete. In this way, a central clock is unnecessary. There are two advantages to clockless CPUs over clocked CPUs:

  • components can run at different speeds in the clockless CPU. In a clocked CPU, no component can run faster than the clock rate.
  • In a clocked CPU, the clock can go no faster than the worst-case performance of the slowest stage. In a clockless CPU, when a stage finishes faster than normal, the next stage can immediately take the results rather than waiting for the next clock tick. A stage might finish faster than normal because of the particular data inputs (multiplication can be very fast if it is multiplying by 0 or 1), or because it is running at a higher voltage or lower temperature than normal.

Two examples of asynchronous CPUs are the ARM-implementing AMULET and the asynchronous implementation of MIPS R3000, dubbed MiniMIPS. The ARM architecture (previously, the Advanced RISC Machine, and prior to that Acorn RISC Machine) is a 32-bit RISC processor architecture that is widely used in a number of embedded designs. ... AMULET is a series of microprocessors that implement the ARM processor architecture. ... A MIPS R4400 microprocessor made by Toshiba. ...


The biggest disadvantage of the clockless CPU is that most CPU design tools assume a clocked CPU (a synchronous circuit), so making a clockless CPU (designing an asynchronous circuit) involves modifying the design tools to handle clockless logic and doing extra testing to ensure the design avoids metastable problems. For example, the group that designs the aforementioned AMULET developed a tool called LARD to cope with the complex design of AMULET3. A synchronous circuit is a circuit in which the parts are synchronized by means of a clock subcircuit. ... An asynchronous circuit is a circuit in which the parts are largely autonomous. ... Metastability in electronics is the ability of a non-equilibrium electronic state to persist for a long period of time (see asynchronous circuit). ...


Soft microprocessors

Main article: Soft microprocessor

A soft microprocessor is a microprocessor core written on any hardware description language (HDL). ...

Concepts

In general, all processors, micro or otherwise, run the same sort of task over and over:

  1. read an instruction and decode it
  2. find any associated data that is needed to process the instruction
  3. process the instruction
  4. write the results out

Complicating this simple-looking series of events is the fact that main memory has always been slower than the processor itself. Step (2) often introduces a lengthy (in CPU terms) delay while the data arrives over the computer bus. A considerable amount of research has been put into designs that avoid these delays as much as possible. This often requires complex circuitry and was at one time found only on hand-wired supercomputer designs. However, as the manufacturing processes have improved, they have become a common feature of almost all designs. Primary storage is a category of computer storage, often called main memory. ... In computer architecture, a bus is a subsystem that transfers data or power between computer components inside a computer or between computers and typically is controlled by device driver software. ... A supercomputer is a computer that leads the world in terms of processing capacity, particularly speed of calculation, at the time of its introduction. ...


RISC

The basic concept of RISC is to clearly identify what step 2 does. In older processor designs, now retroactively known as CISC, the instructions were offered in a number of different modes that meant that step 2 took an unknown length of time to complete. In RISC, almost all instructions come in exactly one mode that reads data from one place — the registers. These addressing modes are then handled by the compiler, which writes code to load the data into the registers and store it back out. For this reason the term load-store is often used to describe this philosophy in design; there are many processors with limited instruction sets that are not really RISC. Reduced Instruction Set Computer (RISC), is a microprocessor CPU design philosophy that favors a smaller and simpler set of instructions that all take about the same amount of time to execute. ... A complex instruction set computer (CISC) is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. ... A diagram of the operation of a typical multi-language, multi-target compiler. ...


The side effect of this change is twofold. One is that the resulting logic core is much smaller, largely by making step 1 and 2 much simpler. Second it means that step 2 always takes one cycle, also reducing the complexity of the overall chip design which would otherwise require complex "locks" that ensure the processor completes one instruction before starting the other. For any given level of performance, a RISC design will have a much smaller "gate count" (number of transistors), the main driver in overall cost — in other words a fast RISC chip is much cheaper than a fast CISC chip.


The downside is that the program gets much longer as a side effect of the compiler having to write out explicit instructions for memory handling, the "code density" is lower. This increases the number of instructions that have to be read over the computer bus. When RISC was first being introduced there were arguments that the increased bus access would overwhelm the speed, and that such designs would actually be slower. In theory this might be true, but the real reason for RISC was to allow instruction pipelines to be built much more easily. An instruction pipeline is a technique used in the design of microprocessors and other digital electronic devices to increase their performance. ...


Instruction pipelining

Main article: instruction pipeline

One of the first, and most powerful, techniques to improve performance is the instruction pipeline. Early microcoded designs would carry out all of the steps above for one instruction before moving onto the next. Large portions of the circuitry were left idle at any one step, for instance, the instruction decoding circuitry would be idle during execution and so on. An instruction pipeline is a technique used in the design of microprocessors and other digital electronic devices to increase their performance. ... An instruction pipeline is a technique used in the design of microprocessors and other digital electronic devices to increase their performance. ...


Pipelines improve performance by allowing a number of instructions to work their way through the processor at the same time. In the same basic example, the processor would start to decode (step 1) a new instruction while the last one was waiting for results. This would allow up to four instructions to be "in flight" at one time, making the processor look four times as fast. Although any one instruction takes just as long to complete, there's still four steps, the CPU as a whole "retires" instructions much faster and can be run at a much higher clock speed.


RISC make pipelines smaller, and much easier to construct by cleanly separating each stage of the instruction process and making them take the same amount of time — one cycle. The processor as a whole operates in an assembly line fashion, with instructions coming in one side and results out the other. Due to the reduced complexity of the Classic RISC pipeline, the pipelined core and an instruction cache could be placed on the same size die that would otherwise fit the core alone on a CISC design. This was the real reason that RISC was faster, early designs like the SPARC and MIPS often running over 10 times as fast as Intel and Motorola CISC solutions at the same clock speed and price. 1913 Ford Model T assembly line. ... The introduction to this article provides insufficient context for those unfamiliar with the subject matter. ... Sun UltraSPARC II Microprocessor Sun UltraSPARC T1 (Niagara 8 Core) SPARC (Scalable Processor ARChitecture) is a pure big-endian RISC microprocessor instruction set architecture originally designed in 1985 by Sun Microsystems. ... A MIPS R4400 microprocessor made by Toshiba. ... Intel Corporation (NASDAQ: INTC, SEHK: 4335), founded in 1968 as Integrated Electronics Corporation, is an American multinational corporation that is best known for designing and manufacturing microprocessors and specialized integrated circuits. ... Motorola (NYSE: MOT) is an American international communications company based in Schaumburg, Illinois, a Chicago suburb. ...


Pipelines are by no means limited to RISC designs. By 1986 the top-of-the-line VAX (the 8800) was a heavily pipelined design, slightly predating the first commercial MIPS and SPARC designs. Most modern CPUs (even embedded CPUs) are now pipelined, and microcoded CPUs with no pipelining are seen only in the most area-constrained embedded processors. Large CISC machines, from the VAX 8800 to the modern Pentium 4 and Athlon, are implemented with both microcode and pipelines. Improvements in pipelining and caching are the two major microarchitectural advances that have enabled processor performance to keep pace with the circuit technology on which they are based.


Cache

It was not long before improvements in chip manufacturing allowed for even more circuitry to be placed on the die, and designers started looking for ways to use it. One of the most common was to add an ever-increasing amount of cache memory on-die. Cache is simply very fast memory, memory that can be accessed in a few cycles as opposed to "many" needed to talk to main memory. The CPU includes a cache controller which automates reading and writing from the cache, if the data is already in the cache it simply "appears", whereas if it is not the processor is "stalled" while the cache controller reads it in. Diagram of a CPU memory cache A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. ...


RISC designs started adding cache in the mid-to-late 1980s, often only 4 KB in total. This number grew over time, and typical CPUs now have about 512 KB, while more powerful CPUs come with 1 or 2 MB, organized in multiple levels of a memory hierarchy. Generally speaking, more cache means more speed. The hierarchical arrangement of storage in current computer architectures is called the memory hierarchy. ...


Superscalar

Even with all of the added complexity and gates needed to support the concepts outlined above, chip manufacturing had soon made even them have room left over. This led to the rise of superscalar processors in the early 1990s, processors that could run more than one instruction at once. Processor board of a CRAY T3e parallel computer with four superscalar Alpha processors A superscalar CPU architecture implements a form of parallelism called Instruction-level parallelism within a single processor. ...


In the outline above the processor runs parts of a single instruction at a time. If one were simply to place two entire cores on a die, then the processor would be able to run two instructions at once. However this is not actually required, as in the average program certain instructions are much more common than others. For instance, the load-store instructions on a RISC design are more common than floating point, so building two complete cores is not as efficient a use of space as building two load-store units and only one floating point. A floating-point number is a digital representation for a number in a certain subset of the rational numbers, and is often used to approximate an arbitrary real number on a computer. ...


In modern designs it is common to find two load units, one store (many instructions have no results to store), two or more integer math units, two or more floating point units, and often a SIMD unit of some sort. The decoder grows in complexity by reading in a huge list of instructions from memory and handing them off to the different units that are idle at that point. The results are then collected and re-ordered at the end, as in out-of-order.-1...


Out-of-order execution

The addition of caches reduces the frequency or duration of stalls due to waiting for data to be fetched from the memory hierarchy, but does not get rid of these stalls entirely. In early designs a cache miss would force the cache controller to stall the processor and wait. Of course there may be some other instruction in the program whose data is available in the cache at that point. Out-of-order execution allows that ready instruction to be processed while the processor waits on the cache, then re-orders the results to make it appear that everything happened in the normal order. In computer science, out-of-order execution is a paradigm used in most high-speed microprocessors in order to make use of cycles that would otherwise be wasted by a certain type of costly delay. ...


Speculative execution

One problem with an instruction pipeline is that there are a class of instructions that must make their way entirely through the pipeline before execution can continue. In particular, conditional branches need to know the result of some prior instruction before "which side" of the branch to run is known. For instance, an instruction that says "if x is larger than 5 then do this, otherwise do that" will have to wait for the results of x to be known before it knows if the instructions for this or that can be fetched.


For a small four-deep pipeline this means a delay of up to three cycles — the decode can still happen. But as clock speeds increase the depth of the pipeline increases with it, and modern processors may have 20 stages or more. In this case the CPU is being stalled for the vast majority of its cycles every time one of these instructions is encountered.


The solution, or one of them, is speculative execution, also known as branch prediction. In reality one side or the other of the branch will be called much more often than the other, so it is often correct to simply go ahead and say "x will likely be smaller than five, start processing that". If the prediction turns out to be correct, a huge amount of time will be saved. Modern designs have rather complex prediction systems, which watch the results of past branches to predict the future with greater accuracy. In computer science, speculative execution is the execution of code whose result may not actually be needed. ... In computer architecture, a branch predictor is the part of a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. ...


Multiprocessing and multithreading

Computer architects have become stymied by the growing mismatch in CPU operating frequencies and DRAM access times. None of the techniques that exploited instruction-level parallism within one program could make up for the long stalls that occurred when data had to be fetched from main memory. For this reason, newer generations of computers have started to exploit higher levels of parallelism that exist outside of a single program or program thread. Dynamic random access memory (DRAM) is a type of random access memory that stores each bit of data in a separate capacitor. ... A thread in computer science is short for a thread of execution. ...


This trend is sometimes known as throughput computing. This idea originated in the mainframe market where online transaction processing emphasized not just the execution speed of one transaction, but the capacity to deal with massive numbers of transactions. With transaction-based applications such as network routing and web-site serving greatly increasing in the last decade, the computer industry has re-emphasized capacity and throughput issues. OLTP (Online Transaction Processing) is a form of transaction processing conducted via computer network. ...


One technique of how this parallelism is achieved is through multiprocessing systems, computer systems with multiple CPUs. Once reserved for high-end mainframes and supercomputers, small scale (2-8) multiprocessors servers have become commonplace for the small business market. For large corporations, large scale (16-256) multiprocessors are common. Even personal computers with multiple CPUs have appeared since the 1990s. Multiprocessing is traditionally known as the use of multiple concurrent processes in a system as opposed to a single process at any one instant. ... Mainframes (often colloquially referred to as big iron) are large and expensive computers used mainly by government institutions and large companies for legacy applications, typically bulk data processing (such as censuses, industry/consumer statistics, ERP, and bank transaction processing). ... A supercomputer is a computer that leads the world in terms of processing capacity, particularly speed of calculation, at the time of its introduction. ...


With further transistor size reductions made available with semiconductor technology advances, Chip-level multiprocessing have appeared where multiple CPUs are implemented on the same silicon chip. Initially used in chips targeting embedded markets, where simpler and smaller CPUs would allow multiple instantiations to fit on one piece of silicon. By 2005, semiconductor technology allowed dual high-end desktop CPUs CMP chips to be manufactured in volume. Some designs, such as Sun Microsystems' UltraSPARC T1 have reverted back to simpler (scalar, in-order) designs in order to fit more processors on one piece of silicon. Chip-level multiprocessing (also known as, Chip multiprocessor CMP) is SMP implemented on a single VLSI integrated circuit. ... Sun Microsystems, Inc. ... Sun Microsystems UltraSPARC T1 microprocessor, known until its 14 November 2005 announcement by its development codename Niagara , is a multithreading, multicore CPU. Designed to lower the energy consumption of server computers, the CPU uses typically 72 W of power at 1. ...


Another technique that has become more popular recently is multithreading. In multithreading, when the processor has to fetch data from slow system memory, instead of stalling for the data to arrive, the processor switches to another program or program thread which is ready to execute. Though this does not speed up a particular program/thread, it increases the overall system throughput by reducing the time the CPU is idle. Many programming languages, operating systems, and other software development environments support what are called threads of execution. ...


Conceptually, multithreading is equivalent to a context switch at the operating system level. The difference is that a multithreaded CPU can do a thread switch in one CPU cycle instead of the hundreds or thousands of CPU cycles a context switch normally requires. This is achieved by replicating the state hardware (such as the register file and program counter) for each active thread. A context switch is the computing process of storing and restoring the state (context) of a CPU such that multiple processes can share a single CPU resource. ... A register file is an array of processor registers in a central processing unit (CPU). ... The program counter (also called the instruction pointer in some computers) is a register in a computer processor which indicates where the computer is in its instruction sequence. ...


A further enhancement is simultaneous multithreading. This technique allows superscalar CPUs to execute instructions from different programs/threads simultaneously in the same cycle. Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of the hardware that executes instructions in a computer. ...


See also

Computer science Portal

  Results from FactBites:
 
CPU design - Wikipedia, the free encyclopedia (6751 words)
System designers building parallel computers, such as Google, pick CPUs based on their speed per watt of power, because the cost of powering the CPU outweighs the cost of the CPU itself.
Hence the instruction set was designed to manipulate not just simple binary numbers, but text, scientific floating-point (similar to the numbers used in a calculator), and the binary coded decimal arithmetic needed by accounting systems.
In modern designs it is common to find two load units, one store (many instructions have no results to store), two or more integer math units, two or more floating point units, and often a SIMD unit of some sort.
NodeWorks - Encyclopedia: CPU design (5785 words)
To a large extent, the design of a CPU, or central processing unit, is the design of its control unit.
The biggest disadvantage of the clockless CPU is that most CPU design tools assume a clocked CPU, so making a clockless CPU involves modifying the design tools to handle clockless logic and doing extra testing to ensure the design avoids metastable problems.
Since then almost all designs have added a pipeline, and is the main reason that chips continue to improve in performance long after early estimates said a plateau would be reached in the late 1980s.
  More results at FactBites »

 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m