Portable MPI Model Implementation over GM Version mpich-1.2.1..7, September 27, 2001 README-GM ========= MPICH is a portable implementation of MPI, developed by Argonne National Laboratory. MPICH is designed to be highly portable and is currently used by a large number of providers of MPI implementations. http://www.mcs.anl.gov/mpi/mpich/ MPICH-GM is a port of MPICH on top of GM (ch_gm) and is supported by Myricom. Note: Linux 2.2 and 2.4 are supported. MPICH-GM, version 1.2.1..7 requires gm-1.5(preX). All patches from Argonne National Lab have been applied. Features of this new MPICH-GM implementation include: * The regcache mechanism has been rewritten and uses all of the available DMA-able memory to keep buffers registered as long as possible (it was maximum 64 MB before, it's now half of your physical memory). * No more alignment constraints forcing a memory copy for LANai4. * Uses GM directed sends (PUT) (gm_directed_send_with_callback()) for large messages (Rendez-vous) to improve scalability. Uses all of the send and recv tokens efficiently. * The Eager/Rendez-vous thresholds can be changed at run-time. The default is 16KB. * GM (Myrinet) is used by default for all communications. * Directcopy is disabled by default (will be enabled in the future after extensive tests). * Shared memory support is compiled but disabled by default, and may be enabled at run-time. If you pass the "--gm-use-shmem" flag to mpirun.ch_gm at run-time, the shared memory device will be used (when you select several processes per node in your conf file). No more required to compile two MPICH-GM trees. * The user can change the behavior of the blocking MPI calls by specifying the "--recv-mode" flag to mpirun.ch_gm at run-time. (Type "mpirun.ch_gm --gm-h" for details.) ************************************************************************ For updates to this software, visit `http://www.myri.com/scs'. The FAQ file is located at `http://www.myri.com/scs/GM_FAQ.html'. All Myrinet hardware and software questions should be directed to help@myri.com. ************************************************************************ Table of Contents: I. Installation of MPICH-GM (compilation and usage) (READ THIS!!) II. Run-Time (Tuning) Parameters III. Running TOTALVIEW IV. Memory Leak Detection V. Other Notes I. Installation of MPICH-GM (compilation and usage) =================================================== The installation of MPICH-GM involves the following 4 steps: 1. Configure MPICH-GM 2. Make 3. Create a conf file 4. Run a program STEP 1: Configure MPICH-GM -------------------------- Configuring MPICH-GM involves: setenv GM_HOME ./configure For a complete listing of all options to MPICH-GM configure, type ./configure --help To assist with this process, we have provided several scripts which set the environment variable GM_HOME, contain a sample configure line for the respective compiler, run make, and log the output to a file. These scripts perform Steps 1 and 2 as listed above. The currently available scripts are named: mpich.make -- Gnu gcc and g77 mpich.make.absoft - Absoft Fortran compilers and Gnu gcc mpich.make.fujitsu - Fujitsu Fortran and C compilers mpich.make.pgi - Portland Group Fortran and C compilers mpich.make.intel - Intel Fortran and C compilers We encourage the user to edit one of these scripts as needed and then type: ./mpich.make or ./mpich.make.absoft or ./mpich.make.fujitsu or ./mpich.make.pgi or ./mpich.make.intel and proceed to Step 3. A. Configuring on Linux: ------------------------ WARNING: You must compile/link MPICH-GM on the same architecture and OS version on which you'll be running. Do NOT compile on a linux-2.2.x box if you'll be running under linux-2.4.x. For example, the mpich.make.pgi for the Portland Group compilers, uses a configure line like: ./configure -nodevdebug -cflags="-I$GM_HOME/binary/include \ -I$GM_HOME/include -Msignextend -tp px" -opt=-fast -device=ch_gm \ -noromio -noc++ --lib="-L$GM_HOME/binary/lib/ -L$GM_HOME/lib/ \ -lgm -lpthread" -arch=LINUX -fc=pgf77 -fflags="-Msignextend -tp px" \ -f90=pgf90 -f90flags="-Msignextend -tp px" \ -cc=pgcc -rsh=rsh This example configure line is also given on the Portland Group FAQ (http://www.pgroup.com/faq.htm#cdk1d). Note that the "-tp px" flag is a generic pentium flag. Replace the "px" with the flag appropriate for your hardware. (For example, p6 is optimized for the pentium pro/PII/PIII). Note: There is no longer a "-shared-memory-support" flag at configure time. The shared memory support is compiled by default since 1.2.1..4 and it may be ENABLED at runtime with the "--gm-use-shmem" flag to mpirun.ch_gm. B. Configuring on other architectures: -------------------------------------- Other architecturers will need a different -arch=parameter in the configure line. See the bottom of the file for Silicon Graphics machines and Compaq/Digital Unix (OSF). We have a special (beta) mpich version for running on NT. Contact help@myri.com for more information. At this time, Solaris does not support memory registration, and because of this, this version of MPICH can not be run on Solaris. Please use mpich-1.2..8 from our ftp site (mpich-1.2.1..6 should support Solaris). STEP 2: Make ------------ If you used one of the mpich.make scripts, this step has already been performed, and the make output is logged in make.log. Inspect the make output for any errors, and then proceed to Step 3. Otherwise, after configuring MPICH-GM, go into the main mpich directory and type: make Note: The mpich make process generates lots of output. If the make fails in one directory, it will skip that directory and continue with the rest of the make. The result of this is that when an application is being built, it will fail with 'undefined reference to' errors. In that case, make mpich again using make >& make.log to redirect the output of the command to a file, and then check the output file (make.log) for the first error. Fix the problem and then try making again. STEP 3: Create a conf file -------------------------- The "conf" file specifies the hosts on which the MPI application will run. This file must be accessible on all the nodes where MPI processes will be running. Each process will read this conf file as will the mpirun script. Comments (lines that begin with '#') and blank lines are allowed. The default location for this file is $HOME/.gmpi/conf. If the file is in a different location or of a different name, then the --gm-f option can be used to tell mpirun.ch_gm where to find the conf information. The generic description of the contents of this file is: [node_0_board_optional] [node_1_board_optional] [node_2_board_optional] . . . [node_N_board_optional] An example conf file is given below: # .gmpi/conf file begin # first the number of nodes in the file 11 # the list of (node, port, board) that make the MPI World node1.myri.com 2 node1.myri.com 4 node1.myri.com 5 node1.myri.com 6 node2.myri.com 2 node2.myri.com 4 node2.myri.com 4 1 node2.myri.com 5 1 # .gmpi/conf file end To set up the conf file for SMP use, you simply list the SMP machine N times (one for each processor) using a different GM port for each line. The example above uses two SMP machines (node1 and node2), each of which have 4 processors. If you have multiple boards in a single machine, then you need to add the board number to the end of the line. Missing board numbers are assumed to be zero. In general, gm has software 8 ports. Ports 2,4,5,6,7 are for users. Other ports are not for general user-process use. Refer to the gm-1.5/README for further information on GM Ports. In the example above, node2 has two boards and will run: one process on board 0, port 2, another process on board 0, port 4, another process on board 1, port 4 and a fourth process on board 1, port 5. The machine names are what gm-1.5/binary/bin/gm_board_info shows in the routing tables (the full host names). Note: The exception is for multiple boards in a machine. The hostname in the conf file needs to be a valid hostname that can be used for rsh. The name in the gm_board_info output will be machinename:1 for board #1. Do not put the ":1" in the conf file, just use "machinename" and put the "1" at the end of the line to indicate board "1". Here is the ./gm_board_info output for the aforementioned conf file (with two cards in "node2"). Routing table for this node follows: gmID MAC Address Hostname Route ---- ----------------- -------------------------------- --------------------- 96 00:60:dd:7f:ec:fa node1.myri.com 87 be bb 99 00:60:dd:7f:ec:f9 node2.myri.com 87 be b9 100 00:60:dd:7f:ee:a5 node2.myri.com:1 80 (this node) STEP 4: Run a program --------------------- Sample test programs are in examples/basic, examples/perftest and examples/tests. To run the cpi program in examples/basic. cd examples/basic make ../../bin/mpirun.ch_gm --gm-v cpi examples/perftest/myrunex will gather performance information. If the make process fails with 'undefined reference' errors, see the NOTE under Step 2 on making mpich. II. Run-Time (Tuning) Options ============================= A number of run-time tuning options can be supplied to mpirun.ch_gm. Usage: mpirun.ch_gm [--gm-v] [-np ] [--gm-f ] [--gm-h] prog [options] --gm-v verbose - includes comments -np specifies the number of processes to run --gm-np same as '-np' (use one or the other) --gm-f specifies a configuration file --gm-use-shmem enable the shared memory support --gm-shmem-file specifies a shared memory file name --gm-shf explicitly removes the shared memory file --gm-h generates this message --gm-r start machines in reverse order --gm-w wait n secs between starting each machine --gm-kill n secs after first process exits, kill all other processes --gm-dryrun don't actually execute the commands just print them --gm-recv specifies the recv mode, 'polling', 'blocking' or 'hybrid' --gm-recv-verb specifies verbose for recv mode selection -tv specifies totalview debugger prog [options] specifies which program to run, with its options Examples: -------- 1. Specifying "--recv-mode" as a run-time option changes the behavior of the blocking MPI call. Three modes may be specified at runtime. For example: ./mpirun.ch_gm --recv-mode polling -np 4 foo.x ./mpirun.ch_gm --recv-mode blocking -np 4 foo.x ./mpirun.ch_gm --recv-mode hybrid -np 4 foo.x You can use "--recv-mode-verb" to check the selected mode. The default is "-recv-mode polling". The "polling" mode asks MPI to "poll" all devices continually to check for the completion of an event -- send or receive. This mode provides the lowest latency but also has the highest CPU utilization. It is enabled by default as it provides the best performance when each process has a dedicated processor. The "blocking" mode uses a "blocking GM receive" function called gm_blocking_receive_no_spin() (i.e., each MPI blocking function call will effectively block, sleeping in the kernel waiting for an interrrupt from the Myrinet interface.) The CPU utilization is minimal as a blocked process will not be scheduled on any processor, but the cost of this interrupt and context switches increase the latency by an overhead of 15-40 microseconds, depending upon the architecture. This recv mode is very efficient when several processes compete for the same processor. This is the case for some multi-threaded applications or some MPI applications that spawn several processes per processor by default (GAMESS). The "hybrid" mode is a combination of the two previous modes. In this mode, the process will "poll" for one millisecond (gm_blocking_receive) and then sleep as in the blocking mode. This recv mode provides a good balance between the waste of CPU and the cost of the interrupt overhead. 2. To change the Eager/Rendez-vous threshold at run time, set the variable GMPI_EAGER_SIZE (size in Bytes) as shown below: ./mpirun.ch_gm GMPI_EAGER_SIZE=4096 -np 2 foo.x The default value of GMPI_EAGER_SIZE is 16228, and the minimum value that can be specified is 128. WARNING: Do not change this value unless you know what you are doing! The Eager protocol is a non-blocking protocol where the sender sends a message without knowing if the receiver has posted a matching receive. If the receiver does not provide a matching receive in time, the message is saved in a temporary buffer. This protocol is used for small messages as it provides the lowest latency. Rendez-vous protocol forces the synchronization between sender and receiver by hand-checking with small messages. The data is then transmitted with a gm_directed_send_with_callback() (PUT), and is directly written to the recv buffer without intermediate buffering. This protocol is used for large messages as the buffering overhead becomes unmanageable. In MPICH, the usual threshold between these two protocols is 16K. We are consistent with this value to ensure the same behavior as other MPICH backends (devices). Indeed, the MPI specification says that the application cannot assume anything about the blocking behavior of the MPI function for different message sizes. However, a large set of MPI applications violate this rule and may deadlock if the value of this threshold is changed. 3. To enable the shared memory support (disabled by default), use the mpirun.ch_gm flag "--gm-use-shmem", as shown below. This will enable shared memory between local processes for this run only. ./mpirun.ch_gm --gm-use-shmem -np 4 foo.x Enabling shared memory may improve or may reduce performance, it depends closely on the MPI application: the latency is much better, but the peak bandwidth depends on the performance of the memory copy code provided by the OS. Memory bus traffic and cache trashing are two nasty side effects of shared memory. III. Running TOTALVIEW ====================== To run with totalview, first set the TOTALVIEW environment variable, and then run with the -tv flag : setenv TOTALVIEW /totalview or export TOTALVIEW=/totalview ../../bin/mpirun.ch_gm -tv -np 2 cpi Warning: ANL's KnownBugs file (see ANL_docs/KnownBugs) says: 3. Totalview access to the message queues may not work. We've done some preliminary testing but due to bureaucratic snafus haven't been able to fully test this. IV. Memory Leak Detection ========================= This mpich provides memory leak detection. To enable this feature use -use-debug-malloc in configure. Using the flag in the configure line will replace (Doug Lea's) malloc in MPICH-GM by Dmalloc, and it will automatically produce a dmalloc_log_X file per process where X is the MPI id, with memory leaks info, memory usage, fragmentation, etc. Beware that this flag reduces the global performance: use it only for debugging. To understand the content of the log files see: http://dmalloc.com V. Other Notes ============== 1. The previous version of mpich supported a runtime checksum flag. That is no longer supported - instead there is a define in the file mpid/ch_gm/gmpi.h. Set this define to 1 - #define GMPI_DEBUG_CHECKSUM 1 then make in this directory, and then rebuild the application. 2. MPI_Abort > Should aborttest work in mpich-gm? > > aborttest completes if I use one process. It hangs otherwise. > In fact, MPICH-GM makes no attempt at terminating the other nodes when MPI_Abort is called on one node. Only the calling process is terminated, and the others do not notice (and so will generally hang at some point). Although this is a buggish shortcut, some other MPI implementations behaved similarly, that is certainly why the aborttest comes last in the "env" test suite (without counting the Fortran tests), in this position it does not prevent further tests to be run. It is not simple to implement an MPI_Abort with a correct semantic. In the meantime, the best solution is to use the "--gm-kill" flag at runtime, on the mpirun.ch_gm command line. This flag tells mpirun.ch_gm to kill all of the MPI processes N seconds after the first one exits or aborts. Thus the Abort is propagated to the rest of the MPI job but outside of the MPI code.