# Solution Paper – II

# **High Performance Computer Architecture**

# Q. 1. Explain the machine architecture of 8085 processor.

Ans. The 8085 is 8-bit general micro processor can address up to 64K words of memory. It require 5V supply. It can operate at 3 MHZ single phase clock. This processor is having 8-bit data bus and 16-bit address bus. Data bus is multiplexed with address bus.

That means 8-bit data bus is used as lower 8-bits of address bus whenever the address have to carried in address bus.



The arithmetic logic unit includes 8-hit accumulator, 8-bit temporary register, arithmetic and logic circuits and five plays. These five flags are used to indicate certain conditions such as overflow or carry that arises during arithmetic and logical operations. This processor has six general purposes named as B, C, D, E, H and L. These registers can be combined in pairs as BC, DE and 1-IL in order to perform 16-bit operations. The accumulator is named as A and one of the operand may reside in accumulator register in an instruction. The stack pointer and program counter is 16-bit. Stack pointer (SP) is used by the programmer to maintain a stack in memory. Program Counter (PC) is used to keep track of the address of instruction in memory that has to be executed next. The increment decrement address latch is also 16-bit.

The instruction of processor can be classified in following categories.

- 1. Data transfer
- 2. Arithmetic operations
- 3. Logical operations
- 4. Branching operations.
- 5. Machine control operations
- 6. Assembler directives.
- Q. 2. Write a note on.
- (a) VLIW Architecture (b) Super scalar processor.
- Ans. (a) Very long instruction word (VLIW) is a modification over super scalar architecture VLIW architecture implements instruction level parallelism (ILP). VLIW processor fetches very long instruction work having several operations and dispatches it is parallel execution to different functional units. The width of instruction varies from 128 to 1024 bits. VLIW architecture offers static scheduling as super scalar architecture offers dynamic (run time) scheduling. That means view offers a defined plan of execution of instruction by functional units. Due to static scheduling, VLIW architecture can handle up to eight operations per clock cycle. While super scalar architecture can handle up to five operations at a time. VLIW architecture needs the complete knowledge of Hardware like processor and their functional units. It is fast but inefficient for object oriented and event driver programming. In event driven and object oriented programming super scalar architecture is used. Hence view and super scalar architecture are important in different aspects.
- **(b) Super Scalar Processor:** The scalar processor executes one instruction on one set of operands at a time. The super scalar architecture allows the execution of multiple instructions at the same time. In different pipelines. Here multiple processing elements are used for different instruction at the same time. Pipelining is also implemented in each processing elements.

The instruction fetching units fetch multiple instructions at a time from cache. The instruction decoding unit checks the independence of these instructions so that they can be executed in parallel. There should be multiple execution units so that multiple instructions can be executed at same time. The slowest stage among fetch, decode and execute will determine the overall performance of the system. Ideally these three stages should be equally fast. Practically execution stage in slowest and drastically affect the performance of system.

Q. 3. How many cycles are required to execute per instruction for 8086, 8088, Intel 286, 386, 486, Pentium, K6 series, Pentium 11/111/4/cebron, and Athion/Athion XP/Duron?

**Ans.** The time required to execute instructions for different processors are as follows:

• 8086 and 8088: It has taken an average of 12 cycles to execute a single instruction.

- 286 and 386: It improves this rate to about 4.5 cycles per instruction. 486: The 486 and most other fourth generation Intel-compatible processors, such as the DMD 5 x 86, drop the rate further, to about 2 cycles per instruction.
- Pentium, K6 Series,: The Pentium architecture and other fifth generation Intel compatible processors, such as those from AMD and Cyrix, include twin instruction pipelines and other improvements that provide for operation at one or two instruction per cycle.
- Pentium pro, Pentium II,fIII/4/celeron, and Athion/Athlon XP/Duron: These P6 and P7 (Sixth and Seventh generation) processors can execute as many as three or more instructions per cycle.

#### Q. 4. Compare SIMD and MIMD machine.

Ans. SIMD: SIMD machine executes one instruction on multiple data items at a time. Here all processor receives the same instruction from control units and implement it on different data items. Here is single control unit that handle multiple execution unit.

**MIMD:** MIMD Computer involves the execution of multiple instructions on multiple data stream. Hence this type of computer involves multiple processing. MIMD involves multiple control units, multiple processing units and multiple execution units. These types of computer provide the highest level of parallelism by having multiple processors.

# Q. 5. What is meant by hierarchical bus system for multiprocessing system?

**Ans.** A system's work load cannot be handled satisfactory by a single processor, one response is to apply multiple processors to the problem and this is known as multiprocessing environment. There are two types of multiprocessing system.

- 1. Symmetrical multiprocessing system
- 2. Asymmetric multiprocessing system.

Hierarchical bus is also in this multiprocessing system.

- 1. **Symmetrical multiprocessing system**: In this, all of the processors are essentially identical and perform identical functions. Characteristics of such system can be specified as
- (a) Any processor can initiate an I/O operation can handle any external interrupt, and can run any processor in the system.
- (b) All of the processors are potentially available to handle whenever needs. to be done next:



Fig. Buses in Asymmetric Multiprocessing

- 2. **Asymmetric multiprocessing system**: Asymmetry implies imbalance, or some difference between processors. Thus, in asymmetric multiprocessing, different processors do different things. From design point of view, it is often implemented so that one processors job is to control the rest, or it is the supervisor of the others. Some advantages and disadvantages of this approach are:
- (a) In some situation, I/O operation application program processing may be faster because it does not have to contend with other operations or programs for access to a processor i.e. many processors may be available for a single job.
- (b) In other situations, I/O operation or application program processing can be slowed down because not all of the processors are available to handle peak loads.
- (c) In asymmetric multiprocessing, if the supervisor processor handling a specific work fails. The entire system will go down.



Fig. Buses in Symmetric Multiprocessing

#### Q. 6. Write a short note on:

- (A) Parallel computing. (B) Distributed computing.
- (C) Serial and parallel interface.
- **Ans.** (a) **Parallel computing**: Parallel computing provides simultaneous data processing to increase the computation seed of computer. The important goal of computer architecture is to attain high

performance. Implementing parallelism in uniprocessed or multiple processors can enhance the performance of computer. The concurrency in uniprocessor or superscalar in terms of hardware and software implementation can lead to faster execution of program in computer. Real times applications require faster response from computer.

There are three main techniques to implement in parallel processing:

- (a) Multiprocessor system.
- (b) Pipelining.
- (c) Vector processing or computing.
- (b) **Distributed computing**: In distributed system each processor has its own local memory rather than having a shared memory or a clock as in parallel system. The processor communicate with one another through various communication media such as telephone lines, high-speed buses etc. Such systems are referred as loosely-coupled systems. The processors may be called as sites, workstations, minicomputers and large general purpose computers etc.

There are several factors that causes for building such system

- (a) **Resource sharing**: Users who are sitting-at different location can share the resources such as a user who is at location. A can share the printer who is at location B or vice-versa
- (b) **Computation speed**: When a single computation job is divided into number of sub computational job than naturally it executes faster due concurrent execution.
- (c) **Reliability**: In this failure of one processor slows down the speed of system but if each system is assigned to perform a pre-specified task then failure of

one system can halt the system. One way to overcome such problem is use an another processor. Who works as a backup processor.

- (d) **Communication**: Some programs at different sites need to exchange the information, so they communicate with one another via electronic mail.
- (c) **Serial and parallel interface:** In parallel interface there are parallel independent lines of data exist from the I/O part to the device. In addition to these lines a few other optional lines are needed to synchronize the transfer of data Serial interfaces have only a single data line for bit by bit data transfers. Any serial interface has a shift register that converts serial data into parallel and vice versa. Serial interfaces are used for longer distance communication and can be used to connect two computers or a computer and a remote device. A serial interface standard such as Rs 232 C allows for inter connection devices upto 50 meters apart.

Two types of serial interfaces are Asynchronous serial interface and synchronous serial interface another type of large number of application, the analog information has to be converted to digital form for input

to the computer system. Similarly, where the output to device has to be in form of voltage or current which varies smoothly, the digital output form needs to be converted into analog form. A variety of devices are needed to convert the physical quantities such as pressure, temperature, light intensity etc. into electrical signals. These converting devices are called transducers. The output of transducers is a smoothly varying electric signal and converted into discrete digital patterns by a device A/D (analog to digital) converters.

# Q. 7. What do you mean by memory hierarchy? Briefly discuss.

Ans. Memory is technically any form of electronic storage. Personal computer system have a hierarchical memory structure consisting of auxiliary memory (disks), main memory (DRAM) and cache memory (SRAM). A design objective of computer system architects is to have the memory hierarchy work as through it were entirely comprised of the fastest memory type in the system.



Fig. Memory Hierarchy.

# Q 8. What is virtual memory? Explain the address mapping using page?

**Ans.** Virtual memory was developed to automate the movement of program code and data between main memory and secondary storage to give the appearance of a single large store.

- 1. This technique greatly simplified the programmer's job, particularly when program cod and data exceeded the main memory's size. The basic technology **proverlrçadily** adaptable to modern multi programming environments, which in addation to a "virtual" single level memory, also require support for large address spaces; process protection, address space organization and the execution to processes only partially residing in memory.
- 2. Consequently, virtual memory has become widely used, and most processors have hardware to support it.
- 3. Virtual memory is stored in a hard disk image the physical memory hold a small number of virtual pages in physical page frames. Our focus is on mechanism and intructures popular in today's O,s and microprocessors, which geared toward demand paged virtual memory.

Address mapping using pages Virtual memory stores only the most often used portions of an address in main memory and retrieves other portion from a disk as needed. The virtual-memory space is divided into pages identified by virtual page numbers which are mapped to page frames, shown in figure. As figure shows the virtual memory space is divided into uniform virtual pages, each of which is identified by a virtual page number. The physical memory is divided into uniform page frames, each identified by a page frame number. The page frames are so named because they frame, or hold, a page's data. At its simplest, then, virtual memory is a mapping of virtual page **nuirthers** to page frame numbers.

The mapping is a function i.e. a given virtual page can have only one pitysical location. However, the inverse mapping-from page frame numbers to virtual page numbers-is not necessarily a function, and thus it is possible to have several pages mapped to the same page frame,

The table implementation of the address mapping is simplified if the information in address space and memory space are each divided into groups of fixed size. The physical memory is broken down into groups of equal size called blocks. The term page refers to groups of address space of the same size made by the programmer. The programs can also split into pages. Portions of program are moved from auxiliary memory to main memory in records equal to the size of a page. The term 'Page frame' is sometimes used to denote a block. A simple mapping a virtual and a physical memory as shown in figure.

Let us illustrate with an address space (virtual memory) of 8k and a memory space (physical memory) of 4k. If we split each into groups of 1k words, we get eight pages frames as shown in figure. At any given time, upto four pages of virtual memory may reside in main memory in any one of the four page frame.



Fig. Relationship between address space of virtual memory and physical memory.

Address Mapping Using Memory Mapping page table

The organization of the memory mapping is shown in next figure. The memory. page table in a paged system consists of eight words, one for each page. The address- in page table, specifies the page number and the contents of the word gives the frame number where the page is stored in main memory. The table shows the page 1, 3, 6 and 7 are now available in main memory in page from 1, 2, 0 and 3 respectively. A presence bit in each location indicates whether the page has been transferred from auxiliary memory into main memory. A '1' in the presence bit indicates that this page is available in main memory and '0' in the presence bit indicate that this page is not available in main memory.

The CPU reference a word in memory with a virtual address of 13-bits. The three higher order bits of virtual address specifies a page number and also an address for memory page table. The content of word in memory page taken at the page number 'd dress is read out into memory table buffer register (MBR). if the presence bit is 1. The frame number thus read is transferred to the two high order bit of main memory address register. A read signal to main memory transfers the contents. The word to memory buffer register and it ready to be used by the CPU. If the presence bit is 0, it means the contents of word referenced by virtual address does not reitie in main memory. A call to the OS (Operation system) is then generated to fetch the required page from the auxiliary memory and transfers it into main memory before resuming computation.



### Q. 9. What are the different characteristics and versions of Dhrystone benchmark?

**Ans.** The Dhrystone benchmark is used to test performance factors important in non-numeric system programming like operating system, compilers, word processor etc.

There is following different characteristics given below

- 1. This benchmark do not contain floating-point operations.
- 2. This benchmark spent a considerable percentage of time in string functions making the test very dependent upon the way such operations are performed. Example: in line-code, routines written in assembly language making it prove to manufactures fine-tuning of critical routines.
- 3. This benchmark programs contains hardly any height loops so in the case of small caches the majority of instruction access will be miss. Using at least critical size of cache can enhance the performance. Since the cache can store the main measurement loops.
- 4. This benchmark manipulates only a small amount of global data as compared to whetstone.

Version of Dhrystone benchmark: The Dhrystone benchmark version 1.1 has become a popular benchmark for CPU/Compiler, performance measurement particularly in area of micro computers workstations, PCs and micro processors. Hence the two version of Dhrystone are version 1.1 and version 2.1. Version 1.1 contained some dead code that could be removed by optimizing compilers. Version 2.1 corrected this deficiency and this version is used in current practice. Some manufacture quota the (better) results of version 1.1 so care must be taken when comparing Dhrystone performance with reference to version used.

# Q. 10. To achieve a speed up of 3 on a program that originally took 78 sec to execute, what must the execution time of program be reduced to?

**Ans**. Here, we have values for speed up and Execution time before. Substituting these into the formula for speed up and solving for execution time after tells us that the execution time must be reduced to 26 sec to achieve a speed up of 3.

Execution time after = 
$$\frac{\text{Execution time before}}{\text{Speed up}} = \frac{78}{3}$$
  
= 26.