Mpiexec is a replacement program for the script mpirun, which is part of the mpich package. It is used to initialize a parallel job from within a PBS batch or interactive environment. See the man page for detailed information. Copyright (C) Pete Wyckoff, 2000-6. Installation instructions ------------------------- 1. First, it is recommended that you apply a patch to your PBS distribution. This patch adds the functionality which allows the stdio streams from a parallel process to be sent directly to mpiexec. It also provides the capability to send stdin to more than just process number zero, if you so choose. It is not mandatory to apply this patch, in which case these stdio redirection features will not work, but the basic MPI spawning through the TM interface of PBS will still function just fine. The PBS distributions with which this is known to work are: OpenPBS-2.3.11 through .16 (http://www.openpbs.org) SPBS 1.0.0 rc1 through rc4 (http://www.supercluster.org/projects/pbs/) Torque (unknown versions, see below) See below for information on use with PBSPro or older versions of OpenPBS. Apply the patch doing something like this: cd /usr/local/src/pbs-2.3.12 patch -p1 -sNE < /home/pw/src/mpiexec/patch/pbs-2.3.12-mpiexec.diff Attempts have been made so that the behavior of PBS does not change unless explicitly instructed to do so by mpiexec. You'll need to build and install PBS as usual, then restart all the MOMs on the compute nodes. If you are using Torque from supercluster.org, it already has the mpiexec patch worked in for versions 1.2.0 or newer. Anything earlier than that is probably not stable enough to use, but here's the ancient patch for historical reasons: cd /usr/local/src/torque-1.1.0p0 patch -p1 -sNE < /home/pw/src/mpiexec/patch/torque-1.1.0p0-mpiexec.diff Thanks to Brett Pemberton of VPAC for generating the patch. No guarantees it will work on any particular version of Torque, though. 1a. (EXPERIMENTAL) A second patch to PBS is necessary if you would like mpiexec jobs to survive across a restart of the pbs_mom using the "-p" flag to reattach existing jobs. If you do not plan to kill and restart pbs_mom on a node while it has jobs running, do not bother with this patch, however it should do no harm. It does four things: - Fix coredump resulting from tm_spawn to restarted pbs_mom - Avoid race condition by which pbs_mom would sometimes kill itself as tasks exit. - Make a restarted pbs_mom search for and report exiting tasks from jobs which were started before the old mom was killed. - Change response of pbs_mom to various signals. Now the default is to leave all jobs running. If you want to stop all jobs, USR1 can be used to achieve the old behavior. Without this patch, mpiexec will exit with "tm: system error" when the new pbs_mom is started with the "-p" argument. If you want to experiment with this capability, apply the second patch similarly. Be warned that this adds a function to the machine- specific code for linux, but no other architectures, thus this entire experiment requires linux: patch -p1 -sNE < /home/pw/src/mpiexec/patch/pbs-2.3.12-mom-restart.diff Note that on my linux redhat 7.3 systems, PBS 2.3.12 will not actually compile out of the box without another patch unrelated to mpiexec. Grab and apply http://www.osc.edu/~pw/pbs/no-linux-headers.patch if this is the case for you, or read about all the patches we use at http://www.osc.edu/~pw/pbs/ 1b. Old MPICH/P4 only. If you are using an mpich older than 1.2.4, see the mpich section below for a necessary patch. 1c. Old MPICH/GM only. WARNING! If you are using an MPICH-GM distribution from Myricom that is older than 1.2.4..8, this version of mpiexec will not work. Fall back to mpiexec-0.69 or upgrade your mpich-gm. 2. Run ./configure with the usual configure syntax. Options specific to mpiexec are: --with-pbs=PATH Specify the location of the PBS library. Default is /usr/local/pbs, where the Makefile will expect to find files lib/libpbs.a and lib/liblog.a containing the TM interface functions, and header file include/tm.h. --enable-pbspro-helper Choose this option if you use PBSPro. That batch system does not have the mpiexec patch, and unless you have the source and have patched it yourself, you will not get standard IO streams redirection. This builds a separate executable that handles the redirection for the processes, then starts the parallel code. Do not use this option for OpenPBS or Torque. See the man page for mpiexec-redir-helper for more information. Choosing devices and a default. Every version of MPI that mpiexec knows about will be supported in the code, but you can choose to disable the ones you don't want. Note that no MPI libraries are required, so there is no need to disable an option just because you don't currently use it on your system. They're harmless. You should choose a default, though. If you don't, it will be the first non-disabled device listed in the order below. This may very well not be what you want, e.g. if you have no Myrinet, you'll either want to disable it, or to use, say, --with-default-comm=mpich-p4 otherwise your users will always have to specify --comm mpich-p4 at every invocation of mpiexec (or use an environment variable). To choose a default: --with-default-comm=(mpich-gm|mpich-mx|mpich-p4|mpich-ib|mpich-rai| mpich2-pmi|lam|shmem|emp|none) If the user does not use the "-comm" argument to mpiexec, and does not set the MPIEXEC_COMM environment variable, this named communication device will be used. Another effect of this option is to guess the name of the MPI library when compiling the test program, but users will never see this. Rarely used configure options: --with-mpicc=PATH --with-mpif77=PATH Name of mpicc code or script used to compile an mpi program. This is only used for the test program and will not affect your mpiexec at all. Default is "mpicc" which will look in your path for a suitable script. Another possible choice would be, for example: "--with-mpicc=/home/frog/my-mpich/bin/mpicc". Similar option for finding a fortran compiler, again completely optional. --with-sed=PATH Name of external program to use to implement --transform-hostname. This defaults to "sed" whose location is then looked up in the current path when configure is run. The exact location must be available on the compute nodes when mpiexec runs. You may supply a different path or program name here too. If the argument is absolute, with a leading '/', it is accepted as given, otherwise it is searched for in the current path. Crazy example for perl devotees: configure --with-sed=perl. Then at runtime one might do: mpiexec --transform-hostname='while (<>) { s/amd/mamd/; print }'. (EXPERIMENTAL) --with-fast-dist=PATH Normally mpiexec expects all the compute nodes to share a file system where the executable program lives, such as NFS from a single server. If this is not the case, it is up to you to move the program out to the same location on all the nodes in advance. This option lets you use an external program to move the executable to the compute nodes with a fast, tree-based algorithm that operates natively on InfiniBand. It is extremely quick compared to NFS. To enable mpiexec to stage executables, install the code from http://www.osc.edu/~dennis/fastdist/ and compile mpiexec to tell it where to find the program "fast_dist". If you do not give an absolute path for PATH, configure will search for it in your current PATH. Now for the individual communication libraries, and their options. It is quite likely that you will not need to be concerned with any of this section. MPICH/GM and MPICH/MX --disable-mpich-gm Disable the use of Myrinet devices using MPICH over GM or MX. Default is to support MPICH/GM and MPICH/MX. Note that MX is the newer message passing interface from Myricom, but it is handled in mpiexec with the same code that does MPICH/GM. MPICH/p4 --disable-mpich-p4 Disable the use of sockets devices using MPICH with the p4 library. This is what people generally use with ethernet hardware. Default is to include support for MPICH/p4. --disable-p4-shmem For SMP machines, specify that MPICH/P4 was compiled without shared memory support. You must select whether you plan to use shared memory with MPICH/P4 when you compile the mpich library. To use shared memory, add the configure option "--with-comm=shared" when you build mpich. It is highly recommended that you enabled shared-memory communication in this way. Then when you configure mpiexec, if you have added that option to the mpich build, it is not necessary to do anything. However, if you choose NOT to build mpich/p4 to use shared memory, you should add the flag "--disable-p4-shmem" here. Note that you must make sure that mpich and mpiexec are compatible in this regard or applications will not start. The mpiexec command-line flags "-mpich-p4-no-shmem" and "-mpich-p4-shmem" can be used to specify MPICH/P4 configuration information explicitly at runtime, overriding this compile option. To summarize, configure lines should match as follows: mpich/configure --with-device=ch_p4 --with-comm=shared ... mpiexec/configure --with-default-comm=mpich-p4 ... Or mpich/configure --with-device=ch_p4 ... mpiexec/configure --with-default-comm=mpich-p4 --disable-p4-shmem ... MPICH/IB --disable-mpich-ib Disable the ability to start parallel processes compiled against an InfiniBand version of MPICH. More information about this device can be found at http://nowlab.cis.ohio-state.edu/projects/mpi-iba/. This version of mpiexec supports OSU MVAPICH releases 0.9.2 and 0.9.4 (and likely others) by autodetecting during process startup based on a version number in the protocol. MPICH/RAI --disable-mpich-rai Disable the code to start parallel processes compiled against the Rapid Array Interconnect version of MPICH used by Cray on their XD1 machines. These are Opteron clusters with custom message passing code on an Infiniband physical-layer transport. The MPICH device comes from the MVIA heritage and thus looks a lot like the old-style MPICH/IB startup code. MPICH2/PMI --disable-mpich2-pmi Disable the ability to start parallel processes compiled against the MPICH2 library PMI process management interface. This mechanism is designed to support all underlying communication hardware supported by the new MPICH2 library. More information is available at http://www-unix.mcs.anl.gov/mpi/mpich2/. This code is known to work with the ch3 device in MPICH2, but may work with other devices as they become available. When compiling ch3, you have a choice of channels. These are known to work as of mpich2-1.0.1 and mpiexec-0.78: --with-device=ch3:sock --with-device=ch3:shm --with-device=ch3:ssm Unlike with MPICH1, it is not necessary to explain to mpiexec which variant you plan to use. Note that as of mpich2-1.0.3, MPI_Abort called in one task does not try to terminate the entire parallel process. It would be nice if the aborting process told the process manager that an abort is in progress. This does happen in mpich1/gm, mpich1/ib, and partially in mpich1/p4. Instead, in mpich2, the processes not calling MPI_Abort will exit only if they happen to try to communicate with the aborting process. Watch PMI_Abort() in mpich2/src/pmi/simple/simple_pmi.c to see if they ever add this functionality, at which time we can add support to mpiexec. LAM --disable-lam Disable the use of the LAM device. There really isn't any code in here specific to LAM, as mpiexec is used only to startup the lamd on each node, and it spawns the actual user applications. The LAM device acts exactly like the "none" device. There are more notes on LAM at the bottom of this file, and in README.lam. SHMEM --disable-shmem Disable the use of the SHMEM device. The SHMEM device is only used on single-node configurations, like for large SMPs. There is no support for ethernet or any other out-of-box communication. The options above about shmem under the P4 and GM sections are not related to this SHMEM device, but rather sub-drivers in the P4 and GM drivers, respectively. If you have just one big Sun or HP SMP machine, for example, or some other single node multi-processor box you will want to use the SHMEM device. EMP --disable-emp Disable the use of the EMP device. The procedure to startup an EMP job is much like that of GM, without the need for a globally readable configuration file. More information about EMP is available at http://www.osc.edu/~pw/emp/. NONE --disable-none This communication layer does not set anything in the environment, or build any configuration files. Handy if you want to run something on each processor of your job allocation without wanting mpiexec to bother to build an environment for it. 3. Build it: make Note that GNU make is required. It may be called "gmake" on your system. 4. Run the tests. (OPTIONAL) You'll need a working MPICH of some flavor to build the hello test program. The default compiler used for this task is "mpicc" unless you have configured with the "--with-mpicc" switch. make hello After compiling, be sure to take a look at the script "runtests.pl", especially the comments towards the top where there are some configurable items. Then run it: ./runtests.pl It invokes the batch systemm once for each of about 50 tests. Each of these creates many little files: testqs.* - PBS job scripts submitted with qsub testqo.* - PBS joined stdout/stderr testho.* - mpiexec joined stdout/stderr testc.* - config file passed to mpiexec with "-config" flag Successful tests will show the one-line qsub output and print dots until the test is complete. Unsuccessful runs might say "Got 7 lines in ..., expected 8" or "Unexpected line: ...", in which case you may want to investigate the relevant output files. Expect the "-segv" tests to generate some unexpected lines which vary depending on the communication library. Also, the shell tests can cause some problems depending on what you have in /etc/shells and /etc/profile.d, etc. When done, rm test* to cleanup. There's no need to look at the successful output files unless you're curious what happened. 5. Install: make install This puts the executable in /usr/local/bin (or /bin if you have told configure otherwise using, e.g. --prefix), and a man page in /usr/local/man/man1. You may need to be root to do this. 6. Cleanup: make clean Or "make distclean" if you want to zap the config.* output files too. Concurrent tests ---------------- The concurrent mpiexec feature is described in the man page. It allows running multiple independent parallel programs in the same batch job. Each parallel program has its own invocation of mpiexec, with all the subsequent ones relying on the first for communication with PBS (as required per limitations in the TM library). It creates a directory /tmp/mpiexec-sock with permissions 01777 and a separate subdirectory in the format with permisssions 00700 under that, one for each user. Named pipes of the form . are used for communication in the context of a single PBS job. It tries to clean up after itself but will handle gracefully the case where any of the directories or files still exist. You can test this concurrent code by starting an interactive job with a bunch of processors, and inside the shell of the job, run "./contests.pl". It needs the "hello" program to exist just like runtests.pl. You'll see a bunch of dots indicating each invocation successfully running, for some large number of these in parallel. If there is any output text, try to figure out the problem and send mail to the list if there seems to be a bug. Problems? --------- Here are some notes collected from solving various installation and usage problems with mpiexec, organized into a FAQ format. 1. Does mpiexec work with OpenPBS 2.4? There is no OpenPBS 2.4. Veridian changed the code in 2.3.16 so that it claims to be "OpenPBS_2.4". Type "l s" at a qmgr prompt to see this. The code is still 2.3.16 in spirit since it is hardly different from 2.3.15 or the last couple years of earlier versions for that matter. 2. The configure script can't find my PBS library, but I gave it the correct path. You probably need to compile mpiexec using whatever compiler you used to build PBS, otherwise some symbols may not be defined. This will show up as configure complaining "PBS library not found ...". Check config.log to verify if it really was not found, or if you chose a different compiler. Override the compiler choice at configure time by setting the environment variables CC and CFLAGS. You can run "bash -x ./configure ..." to see everything it does to try to figure out what's wrong. 3. Mpiexec exits immediately with the message "mpiexec: Error: get_hosts: tm_init: tm: system error". This is the very first line in the code where mpiexec attemps to talk to the local PBS mom. Lots of things can go wrong so that PBS will not let that happen. One problem could be that name resolution is not working correctly. You need to have entries in /etc/hosts (or a working DNS resolver) for both localhost and for your PBS server, like this: 127.0.0.1 localhost 10.0.0.254 front-end fe # pbs server Other variations might work too. On the server, you probably need hosts entries for all the other nodes, too, but I suspect you'd notice something else broken before mpiexec. Don't forget to restart pbs_mom or pbs_server as appropriate after changing a system configuration file like /etc/hosts. 4. Are there any debugging tools to figure out why the entire mess does not work? Especially this confusing "system error" message? There are lots of bits that must cooperate to run a parallel job: PBS server, PBS mother superior, other PBS moms, mpiexec, mpich library, and your application code. It's tough to figure out where the fault lies when something fails. PBS problems are frequently logged. See on the mother superior node (the compute node which holds process #0 of your parallel job) the file /var/spool/pbs/mom_logs/20021030 or whatever the date is today. On the PBS server machine, you'll find log messages in /var/spool/pbs/server_logs/20021030 If you install into a different location you'll have to change the path prefix, of course. The "big hammer" of debugging tools here is strace. If mpiexec complains when talking to the PBS mom, grab the mpiexec with an strace and watch what it's doing right before it prints out the error message: strace -vfF -s 400 -o /tmp/strace.mpiexec.out mpiexec myjob Look through the output file for the error message, then back up a few lines and try to guess what went wrong. If it looks harmless, maybe the PBS mom is causing the problem. As root, find the pid of the pbs_mom on the node, then attach to it with strace in a different terminal session: strace -vfF -s 400 -o /tmp/strace.mom.out -p then start your job and watch what happens. 5. When I do "mpiexec