LAM/MPI mpiexec support Ben Webb 28 May 2002 Original 6.6b1 patch and this document Mark Hartner 25 May 2003 Port to lam 6.5.8 These patches: lam-mpiexec-tm-6.6b1.patch lam-mpiexec-6.5.8.patch if applied to the 6.6beta1 or 6.5.8, respectively, of LAM/MPI, add a switch --with-mpiexec to configure. If specified, then LAM's lamboot utility will use mpiexec to start up the LAM system, rather than rsh. (You need to run autoheader and autoconf after applying this patch.) If using LAM from CVS, grab instead: http://bellatrix.pcl.ox.ac.uk/~ben/pbs/lam-mpiexec-tm-cvs.patch Expect the latter patch to vary frequently as the LAM CVS is updated. The LAM team is working on a generic system services model which will add support for multiple methods of booting, and include this patch or a variation. Notes on the design: I took this route as while I think it would be possible to have mpiexec do the lamboot / mpirun / lamhalt sequence itself, you lose the extra functionality that using LAM's tools gives you. (But I see no reason why this support could not be added to mpiexec's -comm=lam in the future.) The patch is non-destructive the code is all #ifdef'd properly, so the original LAM behaviour can be recovered by not defining LAM_WITH_MPIEXEC, and it adds a suitable configure test to detect an installed mpiexec. Basically, to use LAM at present with rsh/ssh/etc. you first run "lamboot", which sets up a lamd daemon on each node. Then you use mpirun to run your jobs, and this talks to the lamd daemons. When you're done, you use "lamhalt" to shutdown the lamd's. Both mpirun and lamhalt use the network of lamd daemons to do their business, so do not need rsh/ssh or Mpiexec; it's only lamboot that requires modification. Currently, lamboot does the following:- - Set up a listening TCP socket on node 0 - rsh to each node: - run "hboot" on it, telling it the hostname of node 0, and the listening port number - hboot in turn spawns the "lamd" daemon, which sets up another TCP listening socket, and connects back to node 0 - lamboot then accepts the connection from lamd, and receives the port number of lamd's listening socket - Once all nodes have been contacted, lamboot's TCP socket is closed - lamboot contacts each lamd via. the port received earlier, and tells each one the numbers of the listening ports on every other node - lamboot's job is now done; the lamd daemons now have full connectivity, and take over from here. Essentially, my patch changes this behaviour to:- - Set up N listening sockets, one for each node in the cluster - Create an mpiexec configuration file with the necessary hboot commands for each node - Fork and run mpiexec in the background, passing it the configuration file - Accept connections from each lamd, and receive a port number from each - Contact each lamd in the same way as before. I have hacked hboot and lamd such that they do not daemonise, so the spawned mpiexec lasts for the duration of the job. Once lamboot is completed, lamnodes, mpirun, etc. should work as per normal. When the job completes, PBS will kill the mpiexec process and thus the spawned lamd's, although you can do this the "proper" way, by running lamhalt, which will kill every lamd and thus prompt mpiexec to exit. "wipe" is very simple; it just runs "tkill" on each node to kill off the lamd process. I don't think you should ever need to do this, as just killing mpiexec should kill the spawned lamd's anyway. The patch does include code to use mpiexec to run the tkill commands, but it won't actually work in practice because MOM won't let mpiexec connect to TM twice (and it'll already be connected once, for the lamboot call). I haven't touched recon, lamgrow, or lamshrink, so these won't work. I don't think they'd be too difficult to fix though, if people really really wanted them. I've hacked lamboot so that the default boot schema is the PBS nodefile, so running a LAM/MPI job via. PBS and Mpiexec should be as simple as putting the following in a PBS script:- lamboot mpirun C /path/to/mpi/binary lamhalt CAVEATS: - This patch is not 100% perfect yet. Obviously. - My error handling isn't very robust, so if mpiexec isn't installed, or you feed it garbage, bad things will happen. - It'd be nicer if Mpiexec could read a configuration file from stdin, so that I didn't have to mess around with temporary files. Any suggestions for improvements to this patch, or comments, welcomed...