Wednesday, May 28, 2008

MPI with Python on EC2

For such a seemingly straightforward tutorial on setting up MPI to run on EC2 with Python wrappers (I *really* don't feel like writing glue code in C for lightly communicating processes), I struggled for way too long to get it to actually work.

So... the magic incantation is:


mpdboot -n 5 -f mpd.hosts

python /usr/local/bin/mpirun.py -n 5 pyMPI -c "import os; import mpi; print(mpi.rank); os.system('hostname')"


This assumes you made a mpd.hosts file listing the internal IP addresses of 5 running instances with MPICH2 and pyMPI installed and opened all your ports. You should see the numbers 0 through 4 as well as the machine names. Your mileage may vary - I clearly suck at this.

Hat tip: establishing connections between machines the 1st 1-2 times around is very slow. I suggest running a script, before MPI or anything else, that consumes your mpd.hosts file and SSHs, in both directions, between all pairs of instances. Twice. Running it periodically during your job may help too, but I'm not at that point yet in my development.

3 comments:

Peter said...

I found the secret to avoiding a lot of MPI errors on EC2, but haven't found time to do an additional post...

The secret seems to be that just because Amazon says that an instance is "running", doesn't mean that the ssh daemons are available. This caused all kinds of intermittent problems setting up the hosts and my old scripts would fail silently.

In my current codebase, I do some checks like the following:

print "Instance is %s" % BOOTING_INSTANCE

# wait for instance description to return "running" and grab HOSTNAME variable
print "Polling server status (ec2-describe-instances %s)" % BOOTING_INSTANCE
while 1:
print "waiting for instance to boot..."
HOSTNAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $4}'" % BOOTING_INSTANCE)
if len(HOSTNAME) > 1:
print "-------Instance booted, The server is available at %s" % HOSTNAME
DOM_NAME = commands.getoutput("ec2-describe-instances %s | grep running | awk '{print $5}'" % BOOTING_INSTANCE).split('.')[0]
break
time.sleep(1)

# sometimes it takes a while for the ssh service to start, even when the ec2 api describes an instance as running.
# A machine in the "running" state may not have finished booting. Try executing a no-op command until a valid response is found
print "verifying ssh daemon has started..."
counter=0
while 1:
print "Waiting for ssh daemon to start..."
counter += 1
REPLY = commands.getoutput('''ssh %s "root@%s" 'echo "hello"' ''' % (SSH_OPTS, HOSTNAME) )
if REPLY == 'hello':
print "-------ssh has started, proceeding with AMI build"
break
if counter > 24:
print "Instance not respoding to SSH hails, aborting..."
## sshd should not take more than 2 minutes to launch
terminate_status = commands.getoutput('ec2-terminate-instances %s' % BOOTING_INSTANCE)
ec2_launch_failed = True
print "Base Instance terminated"
break
time.sleep(5)

if ec2_launch_failed:
print "Aborting build"
return

Leo Meyerovich said...

A little late but:

I had actually been doing that from the get-go. I wrote a script that would do a point-to-point ssh communication between all peers. Ultimately, it seemed to just have been some sort of Python/MPI/*Nix flavor compatibility issue.

More fun was learning that PyMPI does not support Python threads: it gets nailed by the global interpreter lock. I ended up writing some dinky socket code and overriding send/recv with it. Probably have to do better when it isn't an issue of prototyping :)

Joanne said...

Hi,

Thanks for your writeup! It's very helpful. I'm running into an error with mpdtrace and was hoping for some of your insight into it. I am running mpd as root, with one node for simplicity.

I can successfully start up mpd on the instance and "mpd &":
root@...:/etc# mpdboot -n 1 -f mpd.hosts
root@...:/etc# mpd &
[1] 2280

but "mpdtrace -l" gives me an error:
root@ip-10-251-143-0:/etc# mpdtrace -l
mpdtrace: unexpected msg from mpd=:{'error_msg': 'invalid secretword to root mpd'}:

I have tried all pairwise combinations of having MPD_SECRETWORD=secretword or secretword=secretword in ~/.mpd.conf and /etc/mpd.conf, all of which were set to read/write for root only.

I also can't do "mpdallexit":
I can't mpdallexit:
root@...:~# mpdallexit
mpdallexit: mpd_uncaught_except_tb handling:
type 'exceptions.KeyError': 'cmd'
/usr/local/bin/mpich2-install/bin/mpdallexit 53 mpdallexit
elif msg['cmd'] != 'mpdallexit_ack':
/usr/local/bin/mpich2-install/bin/mpdallexit 59 module
mpdallexit()

I can also run mpdcheck as a server and have it listen for mpdcheck as a client from the same instance (in a different window).

Suggestions/help? I'd greatly appreciate any advice you have on this problem. Thanks --