Department of Computer Science Engineering
ADVANCED COMPUTER ARCHITECTURE LECTURE NOTES
Subject Code : CS2354
Name of the subject : ADVANCED COMPUTER ARCHITECTURE
Interconnection Networks and Clusters
Interconnection Networks Media
There is a hierarchy of media to interconnect computers that varies in cost, performance, and reliability. Network media have another ﬁgure of merit, the maximum distance between nodes. This section covers three popular examples, and Figure 8.11 illustrates them.
Category 5 Unsheilded Twisted pair ("Cat5"):
The frst medium is twisted pairs of copper wires. These are two insulated wires, each about
1 mm thick. They are twisted together to reduce electrical interference, since two parallel lines form an antenna but a twisted pair does not. As they can transfer a few megabits per second over several kilometers without amplification, twisted pair were the mainstay of the telephone system. Telephone companies bundled together (and sheathed) many pairs coming into a building. Twisted pairs can also offer tens of megabits per second of bandwidth over shorter distances, making them plausible for LANs.
The original telephone-line quality was called Level 1. Level 3 was good enough for 10 Mbits/second Ethernet. The desire for even greater bandwidth lead to the Lev-el 5 or Category
5, which is sufficient for 100 Mbits/second Ethernet. By limiting the length to 100 meters, “Cat5” wiring can be used for 1000 Mbits/second Ethernet links today. It uses the RJ -45 connector, which is similar to the connector found on telephone lines.
Coaxial cable was deployed by cable television companies to deliver a higher rat e over a few kilometers. To offer high bandwidth and good noise immunity, insulating material surrounds a single stiff copper wire, and then cylindrical conductor surrounds the insulator, often woven as a braided mesh. A 50-ohm baseband coaxial cable delivers 10 megabits per second over a kilometer.
The third transmission media is Fiber optics which transmits digital data as pulses of light. A fiber optic network has three components:
1 the transmission medium, a fiber optic cable;
2 the light source, an LED or laser diode;
3 the light detector, a photodiode.
First, cladding surrounds the glass fiber core to confine the light. A buffer then surrounds the cladding to protect the core and cladding. Note that unlike twisted pairs or coax, fibers are one - way, or simplex, media. A two-way, or full duplex, connection between two nodes requires two fibers.
Since light bends or refracts at interfaces, it can slowly spread as it travels down the cable straight line. Thus, fiber optic cables are of two forms:
1. Multimode fiber—It uses inexpensive LEDs as a light source. It is typically much larger than the wavelength of light: typically 62.5 microns in diameter vs. the 1.3-micron wavelength of infrared light. Since it is wider it has more dispersion problems, where some wave frequencies have different propaga-tion velocities. The LEDs and dispersion limit it to up to a few hundred meters at 1000 Mbits/second or a few kilometers at 100 Mbits /second. It is older and less expensive than single mode fiber.
2. Single-mode fiber—This “single-wavelength” fiber (typically 8 to 9 microns in diameter) requires more expensive laser diodes for light sources and currently transmits gigab its per second for hundreds of kilometers, making it the medium of choice for telephone companies. The loss of signal strength as it passes through a medium, called attenuation, limits the length of the fiber.
To achieve even more bandwidth from a fiber, wavelength division multiplexing (WDM) sends different streams simultaneously on the same fiber using different wavelengths of light, and then demultiplexes the different wavelengths at the receiver. In 2001, WDM can deliver a combined 40 Gbits/second using about 8 wavelengths, with plans to go to 80 wavelengths and deliver 400 Gbits/second.
Practical Issues for Commercial Interconnection Networks :
There are practical issues in addition to the technical issues described so far that are important considerations for some interconnection networks: connectivity, standardization, and fault tolerance.
The number of machines that communication affects the complexity of the net -work and its protocols. The protocols must target the largest size of the network, and handle the types of anomalous events that occur. Hundreds of machines communicating are a much easier than millions.
Connecting the Network to the Computer
Computers have a hierarchy of buses with different cost/performance. For example, a personal computer in 2001 has a memory bus, a PCI bus for fast I/O de-vices, and an USB bus for slow I/O devices. I/O buses follow open standards and have less stringent electrical requirements. Memory buses, on the other hand, provide higher bandwidth and lower latency than I/O buses. Where to connect the network to the machine depends on the performance goals, all LANs and WANs plug into the I/O bus.
The location of the network connection significantly affects the software interface to the network as well as the hardware. A memory bus is more likely to be cache-coherent than an I/O bus and therefore more likely to avoid these extra cache flishes. DMA is the best way to send large messages. Whether to use DMA to send small messages depends on the efﬁciency of the interface to the DMA. The DMA interface is usually memory-mapped, and so each interaction is typically at the speed of main memory rather than of a cache access.
Standardization: Cross-Company Interoperability
LANs and WANs use standards and interoperate effectively. WANs involve many types of companies and must connect to many brands of computers, so it is difficult to imagine a proprietary WAN ever being successful. The ubiquitous nature of the Ethernet shows the popularity of standards for LANs as well as WANs, and it seems unlikely that many customers would tie the viability of their LAN to the stability of a single company.
Message Failure Tolerance
The communication system must have mechanisms for retransmission of a message in case of failure. Often it is handled in higher layers of the software protocol at the end points, requiring retransmission at the source. Given the long time of flight for WANs, often they can retransmit from hop to hop rather relying only on retransmission from the source.
Node Failure Tolerance
The second practical issue refers to whether or not the interconnection relies on all the nodes being operational in order for the interconnection to work properly. Since software failures are generally much more frequent than hardware failures, the question is whether a software crash on a single node can prevent the rest of the nodes from communicating.
Clearly, WANs would be useless if they demanded that thousands of computers spread across a continent be continuously available, and so they all tolerate the failures of individual nodes. LANs connect dozens to hundreds of computers together, and again it would be impractical to require that no computer ever fail. All successful LANs normally survive node failures.
There are many mainframe applications––such as databases, file servers, Web servers, simulations, and multiprogramming/batch processing––amenable to running on more loosely coupled machines than the cache-coherent NUMA machines. These applications often need to be highly available, requiring some form of fault tolerance and repairability. Such applications––plus the similarity of the multiprocessor nodes to desktop computers and the emergence of high-bandwidth, switch-based local area networks—lead to clusters of off-the- shelf, whole computers for large-scale processing.
Performance Challenges of Clusters
One drawback is that clusters are usually connected using the I/O bus of the computer, whereas multiprocessors are usually connected on the memory bus of the computer. The memory bus has higher bandwidth and much lower latency, allowing multiprocessors to drive the network link at higher speed and to have fewer conflicts with I/O traffic on I/O-intensive applications. This connection point also means that clusters generally use software-based communication while multiprocessors use hardware for communication. However, it makes connections non-standard and hence more expensive.
A second weakness is the division of memory: a cluster of N machines has N independent memories and N copies of the operating system, but a shared address multiprocessor allows a single program to use almost all the memory in the computer. Thus, a sequential program in a cluster has 1/Nth the memory available compared to a sequential program in a shared memory multiprocessor. Interestingly, the drop in DRAM prices has made memory costs so low that this multi-processor advantage is much less important in 2001 than it was in 1995. The primary issue in 2001 is whether the maximum memory per cluster node is sufficient for the application.
Dependability and Scalability Advantage of Clusters
The weakness of separate memories for program size turns out to be a strength in system availability and expansibility. Since a cluster consists of independent computers are connected through a local area network, it is much easier to replace a machine without bringing down the system in a cluster than in an shared memory multiprocessor. Clusters are constructed from whole computers and independent, scalable networks, this isolation also makes it easier to expand the system without bringing down the application that runs on top of the cluster. High availability and rapid, incremental extensibility make clusters attractive to service providers for the World Wide Web.
Pros and Cons of Cost of Clusters
One drawback of clusters has been that the cost of ownership. Administering a cluster of N machines is close to the cost of administering N independent machines, while the cost of administering a shared address space multiprocessor with N processors is close to the cost of administering a single, big machine.
Another difference between the two tends to be the price for equivalent computing power for large-scale machines. Since large-scale multiprocessors have small volumes, the extra development costs of large machines must be amortized over few systems, resulting in higher cost to the customer. Originally, the partitioning of memory into separate modules in each node was a significant disadvantage to clusters, as division means memory is used less efficiently than on a shared address computer. The incredible drop in price of memory has mitigated this weakness, dramatically changed the trade-offs in favor of clusters.
Popularity of Clusters
Low cost, scaling and fault isolation proved a perfect match to the companies providing services over the Internet since the mid 1990s. Internet applications such as search engines and email servers are amenable to more loosely coupled computers, as the parallel ism consists of millions of independent tasks. Hence, companies like Amazon, AOL Google, Hotmail, Inktomi, WebTV, and Yahoo rely on clusters of PCs or workstations to provide services used by millions of people every day.
Clusters are growing in popularity in the scientiﬁc computing market as well. Figure 8.30 shows the mix of architecture styles between 1993 and 2000 for the top 500 fastest scientific computers. One attraction is that individual scientists can afford to construct clusters themselves, allowing them to dedicate their cluster to their problem. Shared supercomputers are placed on monthly allocation of CPU time, so its plausible for a scientist to get more work done from a private cluster than from a shared supercomputer. It is also relatively easy for the scientist to scale his computing over time as he gets more money for computing.
Designing a Cluster:
Consider a system with about 32 processors, 32 GB of DRAM, and 32 or 64 disks. Figure 8.33 lists the components we use to construct the cluster, including their prices.
Figure 8.33 confirms some of the philosophical points of the prior section. Note that difference in cost and speed processor is in the smaller systems versus the larger multiprocessor. In addition, the price per DRAM DIMM goes up with the size of the computers.
The higher price of the DRAM is harder too explain based on cost. For example, all include ECC. The uni-processor uses 133 MHz SDRAM and the 2-way and 8-way both use registered DIMM modules (RDIMM) SDRAM. There might a slightly higher cost for the buffered DRAM between the uni-processor and 2-way boxes, but it is hard to explain increasing price 1.5 times for the 8-way SMP vs. the 2-way SMP. In fact, the 8-way SDRAM operates at just 100 MHz. Pre-sumably, customers willing to pay a premium for processors for an 8-way SMP are also willing to pay more for memory.
Reasons for higher price matters little to the designer of a cluster. The task is to minimize cost for a given performance target. To motivate this section, here is an overview of the four examples
1 Cost of Cluster Hardware Alternatives with Local Disk: The first example compares the cost of building from a uniprocessor, a 2-way SMP, and an 8-way SMP. In this example, the disks are directly attached to the computers in the cluster.
2 Cost of Cluster Hardware Alternatives with Disks over SAN: The second ex-ample moves the disk storage behind a RAID controller on a Storage Area Network.
3 Cost of Cluster Options that is more realistic: The third example includes the cost of software, the cost of space, some maintenance costs, and operator costs.
4 Cost and Performance of a Cluster for Transaction Processing: This final example describes a similar cluster tailored by IBM to run the TPC-C bench-mark. This example has more memory and many more disks to achieve a high TPC-C result, and at the time of this writing, it the 13th fastest TPC-C system. In fact, the machine with the fastest TPC-C is just a replicated version of this cluster with a bigger LAN switch.
First Example: Cost of Cluster Hardware Alternatives with Local Disk
This ﬁrst example looks only at hardware cost of the three alternatives using the IBM pricing information.
Second Example: Using a SAN for disks.
The previous example uses disks local to the computer. Although this can reduce costs and space, the problem for the operator is that
1) there is no protection against a single disk failure, and
2) there is state in each computer that must be managed separately.
Hence, the system is down on a disk failures until the operator arrives, and there is no separate visibility or access to storage. This second example centralizes the disks behind a RAID controller in each case using FC-AL as the Storage Area Network.
Third Example: Accounting for Other Costs
The first and second examples only calculated the cost of the hardware. There are two other obvious costs not included: software and the cost of a maintenance agreement for the hardware.
Fourth Example: Cost and Performance of a Cluster for Transaction Processing:
This cluster also has 32 processors, uses the same IBM computers as building blocks, and it uses the same switch to connect computers together.
Disk size: since TPC-C cares more about I/Os per second (IOPS) than disk capacity, this clusters uses many small fast disks. The use of small disks gives many more IOPS for the same capacity. These disks also rotate at 15000 RPM vs. 10000 RPM, delivering more IOPS per disk.
RAID: Since the TPC-C benchmark does not factor in human costs for running a system, there is little incentive to use a SAN. TPC-C does require a RAID protection of disks, however. IBM used a RAID product that plugs into a PCI card and provides four SCSI strings. To get higher availability and performance, each enclosure attaches to two SCSI buses.
Memory: Conventional wisdom for TPC-C is to pack as much DRAM as possible into the servers. Hence, each of the four 8-way SMPs is stuffed with the maximum of 32 GB, yielding a total of 128 GB.
Processor: This benchmark uses 900 MHz Pentium III with a 2MB L2 cache. The price is $6599 as compared to prior 8-way clusters for $1799 for the 700 MHz Pentium III with a 1 MB L2 cache
PCI slots: This cluster uses 7 of the 12 available PCI bus slots for the RAID controllers compared to 1 PCI bus slot for an external SCSI or FC-AL controller in the prior 8-way clusters. This greater utilization follows the guideline of trying to use all resources of a large SMP
Tape Reader, Monitor, Uninterruptable Power Supply: To make the system easier to come up and to keep running for the benchmark, IBM includes one DLT tape reader, four monitors, and four UPSs
Maintenance and spares: TPC-C allows use of spares to reduce maintenance costs, which is a minimum of two spares or 10% of the items. Hence, there are two spare Ethernet switches, host adapters, and cables for TPC-C.