SYNOPSIS
mpiexec [OPTION]... executable [args]...
mpiexec [OPTION]... -config configfile
mpiexec -server
DESCRIPTION
Mpiexec is a replacement program for the script mpirun, which is part
of the mpich package. It is used to initialize a parallel job from
within a pbs batch or interactive environment. It further generates
the environment variables and configuration files necessary to intial-
ize a parallel program for the appropriate MPI message-passing library.
Mpiexec uses the task manager library, tm(3B), of PBS(1B) to spawn
copies of the executable on all the nodes in a pbs allocation. It is
almost functionally equivalent to
rsh node "cd $cwd; exec executable arguments",
using the current working directory from where mpiexec is invoked, and
the shell specified in the environment, or from the password file.
The standard input of the mpiexec process is forwarded to task number
zero in the parallel job, allowing for use of the construct
mpiexec mycode < inputfile
This behavior can be modified using the -nostdin or -allstdin flags.
Standard output and error are also forwarded to mpiexec, allowing redi-
rection of the outputs of all processes. This can be turned off using
-nostdout so that the standard output and error streams go through the
normal PBS mechanisms, to the batch job output files, or to your termi-
nal in the case of an interactive job. See qsub(1) for more informa-
tion.
OPTIONS
All options may be introduced using either a single dash, or double
dashes as are common in most gnu utilities. Options may be shortened
as long as they remain unambiguous. Options that require arguments may
appear as separate words in the argument list, or they may be separated
from the option by an equals sign.
-n numproc
Use only the specified number of processes. Default is to use
all which were provided in the pbs environment.
-verbose
Talk more about what mpiexec is doing.
process back to the mpiexec process. Output on these streams
will go through the normal PBS mechanisms instead, to wit: files
of the form job.ojobid and job.ejobid for batch jobs, and
directly to the controlling terminal for interactive jobs.
-comm type
Specify the communication library used by your code. Each MPI
library has different mechanisms for starting all the processes
of a parallel job, thus you must specify to mpiexec which library
you use so that it can set up the environment of the processes
correctly. The argument type must be one of: mpich-gm, mpich-
mx, mpich-p4, mpich-ib, mpich-rai, mpich2-pmi, lam, shmem, emp,
none; although the code may not have been compiled with support
for some of those. If this argument is not specified, mpiexec
will look for the environment variable MPIEXEC_COMM which could
specify one of those arguments. If this fails, the compiled-in
default communication library is chosen.
-mpich-p4-shmem
-mpich-p4-no-shmem
The MPICH/P4 library may be configured either to support shared
memory within a multiprocessor node or not. It is necessary that
mpiexec know in which way the library was configured to success-
fully start jobs. While this is generally chosen at compile time
using the --disable-p4-shmem configure flag, it is possible to
choose explicitly at runtime with one of these flags.
-pernode (SMP only)
Allocate only one process per compute node. For SMP nodes, only
one processor will be allocated a job. This flag is used to
implement multiple level parallelism with MPI between nodes, and
threads within a node, assmuming the code is set up to do that.
-npernode nprocs (SMP only)
Allocate no more than nprocs processes per compute node. This is
a generalization of the -pernode flag that can be used to place,
for example, two tasks on each 4-way SMP.
-nolocal (not MPICH/P4)
Do not run any MPI processes on the local compute node. In a
batch job, one of the machines allocated to run a parallel job
will run the batch script and thus invoke mpiexec. Normally it
participates in running the parallel appliacition, but this
option disables that for special situations where that node is
needed for other processing.
-transform-hostname sed_expression
Use an alternate hostname for message passing. Processes will be
spawned using a separate hostname for their message passing com-
munications. This is necessary if you use, say, one ethernet
card for PBS hostnames, and another ethernet card for message
expected to be used only by power users at sites with complex
network setups.
-gige This option is deprecated, but still accepted and synonymous to
the preferred option -transform-hostname=s/node/gige/.
-tv, -totalview
Debug using totalview. The process on node zero attempts to open
an X window to $DISPLAY, and all processes are attached by
totalview threads. See totalview(1) for more information.
-kill If any one of the processes dies, wait a little, then kill all
the other processes in the parallel job. Your message passing
library should handle this for you in most circumstances.
-config configfile
Process executable and arguments are specified in the given con-
figuration file. This flag permits the use of heterogeneous jobs
using multiple executables, architectures, and command line argu-
ments. No executable is given on the command line when using the
-config flag. If configfile is "-", then the configuration is
read from standard input. In this case the flag -nostdin is
mandatory, as it is not possible to separate the contents of the
configuration file from process input.
-version
Display the mpiexec version number and configure arguments.
MPI LIBRARY OPTIONS
Different MPI libraries may support tuning options which can change
their behavior or performance. Mpiexec does not explicitly support
these, but it does pass the environment variables used to set the
options, for example, MPICH/GM has an option to set the maximum size
for "eager" (as opposed to rendez-vous) messages. In sh or bash, this
can be set with:
GMPI_EAGER=16384 mpiexec mycode
or in csh or tcsh:
setenv GMPI_EAGER=16384
mpiexec mycode
Other options can be found in the MPI documentation, such as
GMPI_SHMEM, GMPI_RECV, P4_SOCKBUFSIZE and P4_GLOBMEMSIZE.
Although not an MPI library implementation, the "none" communication
device can be handy for running many copies of the same serial program.
Programs spawned with this device are provided an extra environment
variable, MPIEXEC_RANK, which they can use to generate a unique identi-
fier in the context of the pseudo-parallel job.
A node specification is a space-separated list of hostnames. Each ele-
ment in the list is interpreted using case-insensitive standard shell
wildcard patterns (see glob(7) and fnmatch(3)), to produce multiple
hostnames, possibly. It is not an error to specify nodes in the node-
spec that are not actually part of the pbs allocation. This allows a
single generic configuration file to be used in multiple situations.
Config file example
node03 node04 node1* : myexe -s 4
-n 5 : otherexe -f 2 -large
If processors are available on the nodes, run the code myexe on
node03, node04, and any machine with a hostname matching node1*.
Pick up to five other nodes on which to run otherexe, depending
on availability and any -n arguments.
Note that each node listed in a node specification is chosen only once
to run a given process. If using multiprocessor nodes, and you do want
to run two or more copies of the code on a given node, list that node
twice in the line, or duplicate the config file entry. Also note that
node-anonymous specifications (e.g., -n 6) may choose other processors
on a node that already has processes assigned; use the -pernode flag on
the command line if you want node-exclusive behavior.
There is no way to run more than one process per processor using
mpiexec. You must explicitly spawn threads in your code if you wish to
do this. The presence of a -n argument on the command line limits the
total number of processors available to the configuration file selec-
tion process, just as the flag -pernode limits the available nodes.
It is not an error if some lines in the configuration file can not be
satisfied with the available nodes. If, however, a -n <numproc> argu-
ment requests more than can be satisfied, or if no tasks could be allo-
cated, an error is reported.
Finally, the order of lines in the configuration file is the same as
the order of tasks in the MPI sense when the process is started. Com-
ments starting with '#' to the end of the line are ignored anywhere
they appear in the configuration file.
CONCURRENT MPIEXEC
You can run invoke mpiexec multiple times in the same batch job, one
after the other, sequentially. But you can also run multiple mpiexecs
in the same batch job concurrently. In a 10-node PBS allocation, for
example:
mpiexec -n 5 a.out args1 < input1 > output1 &
mpiexec -n 5 a.out args2 < input2 > output2 &
wait
Finally, since only one mpiexec can be the master at a time, if your
code setup requires that mpiexec exit to get a result, you can start a
"dummy" mpiexec first in your batch job:
mpiexec -server
It runs no tasks itself but handles the connections of other transient
mpiexec clients. It will shut down cleanly when the batch job exits or
you may kill the server explicitly. If the server is killed with
SIGTERM (or HUP or INT), it will exit with a status of zero if there
were no clients connected at the time. If there were still clients
using the server, the server will kill all their tasks, disconnect from
the clients, and exit with status 1.
If you are using mpich/p4, be aware that limitations in the mpich/p4
library restrict all task zeros to be on the same node as the mpiexec
process itself, hence concurrency is severely limited. You can use
-pernode to permit one concurrent job for each CPU in the node, though.
EXAMPLES
mpiexec a.out
Run the executable a.out as a parallel mpi code on each process
allocated by pbs.
mpiexec -n 2 a.out -b 4
Run the code with arguments -b 4 on only two processors.
mpiexec -pernode -conf my.config
Run only one process on each node, using the nodes and executa-
bles listed in the configuration file my.config.
mpiexec mycode >out 2>err
Using a sh-compatible shell, send the standard output of all pro-
cesses to the file out, and the stdandard error to err.
mpiexec mycode >& output
Using a csh-compatible shell, combine the standard output and
error streams of all processes to the file output.
mpiexec mycode | sort > output
Sort the output of the processes. Standard error will appear as
the standard error of the mpiexec process.
mpiexec -comm none -pernode mkdir /tmp/my-temp-dir
Run the standard unix command mkdir on each of the SMP nodes in
your PBS node allocation for this job.
mpiexec -comm mpich-p4 mycode-p4
Run a code compiled using MPICH/P4, even though your system
administrator has chosen MPICH/GM as a default.
To specify a default communication library, the variable MPIEXEC_COMM
may be set to one of the accepted values for -comm as documented above.
The command-line argument takes precedence over the environment vari-
able, and if neither is set, the compiled-in default is used.
Note that mpiexec does pass all variables in the environment which it
was given, but PBS will not copy your entire environment for batch jobs
at job submission time unless you use invoke qsub using the -V argu-
ment.
DIAGNOSTICS
mpiexec: Warning: tasks <tasknum>,... exited with status <exitval>.
One or more of the tasks in the parallel process exited with a
non-zero exit status. This is the value a program returns to its
environment when it finishes, either with "return exitval" or
"exit(exitval)", or in FORTRAN, "STOP exitval". Tradition holds
that a program which terminates correctly should return zero, and
hence mpiexec warns if it sees otherwise. Due to race conditions
inherent in the TM interface, sometimes mpiexec will report an
exit value of zero even though it was actually otherwise.
mpiexec: Warning: task <tasknum> died with signal <signum>
One of the tasks in the parallel process exited due to receipt of
an uncaught signal. The symbolic names of signal numbers can be
listed with "kill -l". Common ones are SIGSEGV (11) and SIGBUS
(7), both of which generally indicate a program error. Others,
SIGINT (2), SIGKILL (9), and SIGTERM (15), may occur when the
task is killed or interrupted externally.
ERRORS
tm: not connected
A fatal error occurred in communications between the mpiexec
process and the local pbs_mom. This might occur due to bugs in
pbs_mom, and is not recoverable.
mpiexec: Error: PBS_JOBID not set in environment. Code must be run
from a PBS script, perhaps interactively using "qsub -I".
It is not possible to run mpiexec unless you are within a PBS
environment, either created in a batch or interactive PBS job.
See tha man page for qsub on how to submit a job.
EXIT VALUE
Mpiexec returns to its environment the exit status of process number
zero in a parallel task. With this, scripts which use mpiexec can
access the return value of the parallel program. If task zero exited
with a signal, as opposed to naturally with STOP or exit(), mpiexec
returns 256 + signum, where signum is the signal that killed task zero.
This is a convention inherited from PBS.
Man(1) output converted with
man2html