SIRIUS: the MacCHESS Linux Cluster

Motivation, Goals, and Uses at (Mac)CHESS

Parallel Software Development at (Mac)CHESS


Hardware Design

Frank Labonté at CHESS was principally responsible for the selection of hardware components as well as physical construction of the cluster. Please direct nitty-gritty hardware-specific questions to him.

Server Node

The server for the cluster is housed in a standalone tower with a Tyan Thunder K7 S2462 motherboard with dual AMD processors (1.2GHz), 1Gb of DDR RAM, a 137Gb SCSI disk array, dual 100Mb Ethernet ports and a PCI64B-2 Myrinet interface. The SCSI array consists of an Adaptec 3200S SCSI RAID controller and 4 Seagate Cheetah X15 36LP series (model ST336752LW) disks in RAID 0 configuration.

Client Nodes

There are 31 diskless client nodes, each in a "1U" case and mounted in a heavy-duty rolling rack. Each client node is a APPRO 1124 "server" pre-configured (i.e., purchase options) with a Tyan K7 motherboard with dual AMD processors (1.2GHz), 1Gb of DDR RAM, dual 100Mb Ethernet ports and a PCI64B-2 Myrinet interface. Fully loaded, this rack weighs more that 1200 lbs. Power is distributed to groups of 8 client nodes through 4 APC SurgeArrest units, each requiring a dedicated 15A electrical service. We estimate the power requirements of rack to be 4 to 5kW, most of which is lost as heat which must be dissipated by adequate room cooling and ventilation.

Networking

Two LinkSys EF2S24 v2 EtherFast II switches provide 100Mb Ethernet connections for client boot-up and "background" NFS communications. Gigabit optical networking is provided by a Myrinet-2000 32-port switch equipped with a monitoring line card for message-passing during parallel program execution.

MacCHESS Senior Research Associate Dave Schuller hiding behind SIRIUS. 13 of 15 "1U" nodes are present and working in the upper section of the rack (the missing pair above the surge protectors were removed for service), while 16 additional nodes reside in the lower section below Dave's hand. The Myrinet switch is mounted in the front center of the rack (green lights, orange cabling) between 2 pairs of APC Network SurgeArrest units, while the Ethernet switches and cabling are located on the back (not visible).


Software Design

Server Node

The server is currently running a Linux kernel (currently 2.4.14) "optimized" for the dual processor server (SMP) in the RedHat Linux OS (currently 7.1) environment. Client-specific root filesystems are automatically generated by a custom shell script which is run on the server prior to client boot-up. Myrinet GM software is used for low-level message passing, while MPICH-GM software provides a user-level interface to parallel programs written using the MPICH implementation of the MPI library specification.

Client Nodes

Each client obtains its network information via DHCP and boots a Linux kernel (currently 2.4.14) "optimized" for the diskless, dual processor client nodes via TFTP from the server node. Each client mounts its root filesystem using ClusterNFS software from the server node.

Step-by-step Software Installation

If you want to clone this machine, If you want to clone this machine, I have attempted to completely document a step-by-step installation recipe for all the needed software as well as provide copies of important system files that I modified to get things to work. I can almost guarantee that this will not work the first time you try it; I have undoubtedly omitted some crucial step, and you will undoubtedly introduce changes trying to make "improvements" in this procedure or in adapting it to your hardware. If you are making a serious effort to reproduce this or a very similar system, I would be very pleased to hear from you.

To my great delight, there is now at least one person, Essam Metwally, who has succeeded in adapting the SIRIUS installation recipe to a new diskless Linux cluster. His machine, modestly called Ramses, wields its computational power at the Scripps Institute.

Links to Parallel Programming and Software on SIRIUS


© Arthur J. Weaver
ajw33@cornell.edu
art@ajwresearch.com
Last updated 2002.04.02