libSkylark in SoftLayer Cloud

Running MPI computations over IBM SoftLayer cloud clusters

We detail the steps to take to:

  • launch a cluster of virtual server instances in IBM SoftLayer cloud

  • run parallel, MPI-based, computations with libSkylark over this cluster

Examples of interactive parallel computations on this cloud cluster are also explored.

With a starting cost of $0.04/hour/server - this is the default in our configuration - one needs to spend $0.12 for an one-hour single launch of the 3-machine cluster we are assembling below and play with the suggested setup and examples.

All steps can be conveniently performed in our laptop browser connected to local and remote jupyter notebook servers.

So, let's get started!

IBM SoftLayer credentials

When you open a SoftLayer account you get an IBMid and you use this together with a password to access your dashboard portal though https://control.softlayer.com. This portal has all the details for your current and past purcases (devices, storage, etc), charging and account information, etc, also allowing you to buy new resources, cancel existing ones, etc.

So, you go to Account > Users in the portal and you retrieve just two pieces of information from there:

  • your username, typically a short alphanumeric starting with SL
  • your api_key, typically a very long alphanumeric

You don't want to share this information with others(!!) but you definitely want to make it available to some of the python snippets in this session. One way for not including this information in the notebook is to populate two environment variables (and the python snippets will by default recover these values from there), e.g. in bash:

export SL_USERNAME=<your username>
export SL_API_KEY=<you api_key>

Imports

A common theme in cloud computing services is to make their API callable in the browser: A user constructs a long url essentially embedding his credentials and what service he wants and makes an https request. The answer that gets back is typically in the form of a json or xml string. On top of these, people build lightweight wrappers for many languages, all of which basically assemble the url, do the https request and parse the string that comes back.

IBM SoftLayer provides python bindings of the underlying XMLRPC API in a package called SoftLayer. We install it, also make sure that paramiko is also installed - this is a python ssh client - and import these together with a 2-3 more or less standard packages for what follows.

In [1]:
import SoftLayer
import paramiko
In [2]:
import numpy as np
import os
import cStringIO as StringIO

BTW, we can now confirm that the SoftLayer credentials are there for the SoftLayer API to fish, e.g.:

In [3]:
os.environ['SL_USERNAME']
Out[3]:
'SL1193343'

Launching the machines

The fundamental object in SoftLayer is the Client. Let's make one and then construct a VSManager object through which we are going to launch our virtual server instances.

In [4]:
client = SoftLayer.Client()
In [5]:
vs_manager = SoftLayer.VSManager(client)

We also need a description of the characteristics of the virtual server instances we plan to launch. Things like:

  • Where do we need these to be physically located (i.e. what datacenter)? This could be important if we need to upload to or download from these machines huge datasets with very strict latency constraints.
  • What is the domain to put these machines in?
  • How many cpus?
  • What is the RAM of the machines?
  • Is charging applied on a per hour basis?
  • What operating system these machine will have?
  • Available local disks?
In [6]:
vs_descriptor_base = {'datacenter' : {'name' : 'tor01'},
                      'domain': 'skylark.com',
                      'startCpus': 1,
                      'maxMemory': 1024,
                      'hourlyBillingFlag': 'true',
                      'operatingSystemReferenceCode': 'UBUNTU_14_64',
                      'localDiskFlag': 'false'}

Here we plan for launching servers at Toronto datacenter, put them under skylark.com - a dummy setting actually for our scenario since we do not further tweak DNS - each with 1 cpu and 1 GB of RAM with Ubuntu 14.04 preinstalled. Hourly billing and no local disks are also typical or minimal options, however good for our purposes.

We'll launch 3 nodes (i.e. 3 virtual server instances). One of them will be the server of the NFS shared folder and the other 2 will be the NFS clients, thus the names. We also assign these machines the short hostnames node00, node01 and node02, update their descriptions and look into print one of them and also to the lists of short hostnames (server_name, client_names).

In [7]:
num_nodes  = 3
In [8]:
nodename_list = ['node%02d' % i for i in range(num_nodes)]
server_name  = nodename_list[0]
client_names = nodename_list[1:]
In [9]:
vs_descriptor_dict = {}
for nodename in nodename_list:
    vs_descriptor = vs_descriptor_base.copy()
    vs_descriptor.update({'hostname' : nodename})
    vs_descriptor_dict[nodename] = vs_descriptor  
In [10]:
vs_descriptor_dict['node00']
Out[10]:
{'datacenter': {'name': 'tor01'},
 'domain': 'skylark.com',
 'hostname': 'node00',
 'hourlyBillingFlag': 'true',
 'localDiskFlag': 'false',
 'maxMemory': 1024,
 'operatingSystemReferenceCode': 'UBUNTU_14_64',
 'startCpus': 1}
In [11]:
server_name
Out[11]:
'node00'
In [12]:
client_names
Out[12]:
['node01', 'node02']

At this point, everything is setup for launching our machines (and thus start charging ourselves!). We collect the handles into the launch requests (asynchronously executing remotely at Toronto) in a dictionary keyed by the short hostnames...

In [13]:
vsi_dict = {}
vsi_dict[server_name] = client['Virtual_Guest'].createObject(vs_descriptor_dict[server_name])

for client_name in client_names:
    vsi_dict[client_name] = client['Virtual_Guest'].createObject(
        vs_descriptor_dict[client_name])

... and enter a polling loop, which when finished will basically tell us that our Toronto machines are ready to be accessed and used:

In [14]:
ready = [False] * num_nodes
while not np.all(ready):
    for i, nodename in enumerate(nodename_list):
        ready[i] = vs_manager.wait_for_ready(vsi_dict[nodename]['id'], 
                                         limit=10, delay=1, pending=False)

It can take something between 5-10 mins for these 3 machines to come to life.

Getting ready for setting up the cloud cluster

OK... Now we are ready to ssh into these machines and start working. For sshing we need the username (default: root), the password and the public IP that has been assigned to each of our machines. Let's collect them in a dictionary and have them handy.

In [15]:
credential_dict = {}
for nodename in nodename_list:
    vsi                = vs_manager.get_instance(vsi_dict[nodename]['id'])
    username           = vsi['operatingSystem']['passwords'][0]['username']
    password           = vsi['operatingSystem']['passwords'][0]['password']
    primary_ip_address = vsi['primaryIpAddress']
    vsi_dict[nodename] = vsi
    credential_dict[nodename] = {'username' : username, 
                                 'password' : password,
                                 'primary_ip_address' : primary_ip_address}
    
In [16]:
credential_dict
Out[16]:
{'node00': {'password': 'JaTCH8T8',
  'primary_ip_address': '169.55.169.150',
  'username': 'root'},
 'node01': {'password': 'DpgUDHc5',
  'primary_ip_address': '169.55.169.147',
  'username': 'root'},
 'node02': {'password': 'M4BZdT6s',
  'primary_ip_address': '169.55.169.148',
  'username': 'root'}}

We need to make sure that each machine knows the IPs of the others and has some proper mnemonics (names) for them. This resolving can be served by /etc/hosts which we'll append to later. So let's cook up what to append, now that we know all relevant information.

In [17]:
f = StringIO.StringIO()
domain = vs_descriptor_base['domain']
for key, value in credential_dict.iteritems():
    hostname = key
    full_hostname = '%s.%s' % (hostname, domain)
    primary_ip_address = value['primary_ip_address']
    print >> f, '%s\t%s\t%s' % (primary_ip_address, full_hostname, hostname)
etc_hosts_fragment = f.getvalue()
f.close()
print etc_hosts_fragment
169.55.169.148	node02.skylark.com	node02
169.55.169.150	node00.skylark.com	node00
169.55.169.147	node01.skylark.com	node01

Our method of automatically installing all necessary software to turn the bunch of generic machines we get from the IBM SoftLayer launch to a cluster of machines ready to run MPI computations with libSkylark is pretty simple but effective:

  • List the commands we need to run with root privileges and the commands we need to run with a normal user privileges

  • Script ssh sessions to each of the machines for automating the execution of these commands

In this context it makes sense to actually splice some parameters in these command lists/templates - for flexiblity purposes. We collect these parameters here:

In [18]:
cmd_params = {
    'etc_hosts_fragment' : etc_hosts_fragment,
    'shared_folder' : '/shared',
    'user_name' : 'user',
    'user_password' : 'user',
    'gecos' : 'user',
    'miniconda_url' : 'https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh',
    'miniconda_filename' : 'Miniconda2-latest-Linux-x86_64.sh',
    'channel_url' : 'https://github.com/xdata-skylark/libskylark/releases/download/v0.20/channel.tar.gz',
    'channel_filename' : 'channel.tar.gz'
}

So now we can put together the individual command template sets.

What do we need to execute as root at the (NFS) server node?

  • install NFS server software
  • install an editor of our choice, here emacs
  • export the folder to share with all nodes
  • add a "normal user" with its home being this exported folder
  • update '/etc/hosts'
In [19]:
server_root_setup_cmd_template_list = [
    'apt-get update',
    'apt-get install -y openssh-server',
    'apt-get install -y nfs-server', 
    'apt-get install -y emacs',
    'mkdir %(shared_folder)s',
    'echo "%(shared_folder)s *(rw,sync)" | tee -a /etc/exports',
    'service nfs-kernel-server restart', 
    'adduser --home %(shared_folder)s --uid 1100 --gecos %(gecos)s --disabled-password %(user_name)s',
    'echo "%(user_name)s:%(user_password)s" | chpasswd',
    'chown %(user_name)s %(shared_folder)s',
    'echo -ne "%(etc_hosts_fragment)s" >> /etc/hosts'
    ]

What do we need to execute as "normal user" at the (NFS) client node?

  • generate a (public, private) ssh key pair in order to allow password-less ssh login
  • install libSkylark which consists of the following steps:
    • install miniconda2 (i.e. miniconda for python 2.7.xx)
    • download the conda channel with libSkylark dependencies (i.e. tarballs of prebuilt binaries)
    • install fftw conda package (from conda-forge repo) and libskylark from the conda channel
    • update PATH and LD_LIBRARY_PATH environment variables and make sure these are exported at ssh login time
In [20]:
server_user_setup_cmd_template_list = [
    'cat /dev/zero | ssh-keygen -t rsa -q -N "" > /dev/null',
    'cd .ssh',
    'cat id_rsa.pub >> authorized_keys',
    'cd ${HOME}',
    'curl -L -O %(miniconda_url)s',
    'bash %(miniconda_filename)s -b -p ${HOME}/miniconda2',
    'curl -L -O %(channel_url)s',
    'tar -xvzf %(channel_filename)s',
    'export PATH=${HOME}/miniconda2/bin:${PATH}',
    'conda install fftw --yes --channel conda-forge',
    'conda install libskylark --yes --channel file://${HOME}/channel',
    'echo "export PATH=${PATH}" >> ${HOME}/.bashrc',
    'echo "export LD_LIBRARY_PATH=${HOME}/miniconda2/lib:${LD_LIBRARY_PATH}" >> ${HOME}/.bashrc',
    'echo "export LD_LIBRARY_PATH=${HOME}/miniconda2/lib64:${LD_LIBRARY_PATH}" >> ${HOME}/.bashrc',
    'echo "[[ -f ${HOME}/.bashrc ]] && . ${HOME}/.bashrc" >> ${HOME}/.bash_profile'
    ]

What do we need to execute as root at the (NFS) client nodes?

  • install NFS client software
  • install an editor of our choice, here emacs
  • mount the folder exported by the (NFS) server node
  • add a "normal user" with its home being this mounted folder
  • update '/etc/hosts'

No further commands need to be executed as a "normal user" at the (NFS) client nodes, because related settings (installs, environment variables) have already been configured at the (NFS) server node and have been shared with all nodes through the exported folder, which also serves as the "home" of this user.

In [21]:
client_root_setup_cmd_template_list = [
    'apt-get update',
    'apt-get install -y openssh-server',
    'apt-get install -y nfs-client',
    'apt-get install -y emacs',
    'mkdir %(shared_folder)s',
    'mount %(server_ip)s:%(shared_folder)s %(shared_folder)s',
    'adduser --home %(shared_folder)s --uid 1100 --gecos %(gecos)s --disabled-password %(user_name)s',
    'echo "%(user_name)s:%(user_password)s" | chpasswd',
    'echo -ne "%(etc_hosts_fragment)s" >> /etc/hosts'
]

client_user_setup_cmd_template_list = []

To complete the command parameters setup we update with the public ip of the (NFS) server machine:

In [22]:
cmd_params.update({'server_ip' : 
                   credential_dict[server_name]['primary_ip_address']})

Now the command parameters look like this:

In [23]:
cmd_params
Out[23]:
{'channel_filename': 'channel.tar.gz',
 'channel_url': 'https://github.com/xdata-skylark/libskylark/releases/download/v0.20/channel.tar.gz',
 'etc_hosts_fragment': '169.55.169.148\tnode02.skylark.com\tnode02\n169.55.169.150\tnode00.skylark.com\tnode00\n169.55.169.147\tnode01.skylark.com\tnode01\n',
 'gecos': 'user',
 'miniconda_filename': 'Miniconda2-latest-Linux-x86_64.sh',
 'miniconda_url': 'https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh',
 'server_ip': '169.55.169.150',
 'shared_folder': '/shared',
 'user_name': 'user',
 'user_password': 'user'}

We splice the parameters into the command templates:

In [24]:
server_root_setup_cmd = '; '.join(server_root_setup_cmd_template_list) % cmd_params
server_user_setup_cmd = '; '.join(server_user_setup_cmd_template_list) % cmd_params
In [25]:
client_root_setup_cmd = '; '.join(client_root_setup_cmd_template_list) % cmd_params
client_user_setup_cmd = '; '.join(client_user_setup_cmd_template_list) % cmd_params

We can also briefly inspect the command strings that are to be executed at the newly launched nodes:

In [26]:
server_root_setup_cmd
Out[26]:
'apt-get update; apt-get install -y openssh-server; apt-get install -y nfs-server; apt-get install -y emacs; mkdir /shared; echo "/shared *(rw,sync)" | tee -a /etc/exports; service nfs-kernel-server restart; adduser --home /shared --uid 1100 --gecos user --disabled-password user; echo "user:user" | chpasswd; chown user /shared; echo -ne "169.55.169.148\tnode02.skylark.com\tnode02\n169.55.169.150\tnode00.skylark.com\tnode00\n169.55.169.147\tnode01.skylark.com\tnode01\n" >> /etc/hosts'
In [27]:
server_user_setup_cmd
Out[27]:
'cat /dev/zero | ssh-keygen -t rsa -q -N "" > /dev/null; cd .ssh; cat id_rsa.pub >> authorized_keys; cd ${HOME}; curl -L -O https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh; bash Miniconda2-latest-Linux-x86_64.sh -b -p ${HOME}/miniconda2; curl -L -O https://github.com/xdata-skylark/libskylark/releases/download/v0.20/channel.tar.gz; tar -xvzf channel.tar.gz; export PATH=${HOME}/miniconda2/bin:${PATH}; conda install fftw --yes --channel conda-forge; conda install libskylark --yes --channel file://${HOME}/channel; echo "export PATH=${PATH}" >> ${HOME}/.bashrc; echo "export LD_LIBRARY_PATH=${HOME}/miniconda2/lib:${LD_LIBRARY_PATH}" >> ${HOME}/.bashrc; echo "export LD_LIBRARY_PATH=${HOME}/miniconda2/lib64:${LD_LIBRARY_PATH}" >> ${HOME}/.bashrc; echo "[[ -f ${HOME}/.bashrc ]] && . ${HOME}/.bashrc" >> ${HOME}/.bash_profile'
In [28]:
client_root_setup_cmd
Out[28]:
'apt-get update; apt-get install -y openssh-server; apt-get install -y nfs-client; apt-get install -y emacs; mkdir /shared; mount 169.55.169.150:/shared /shared; adduser --home /shared --uid 1100 --gecos user --disabled-password user; echo "user:user" | chpasswd; echo -ne "169.55.169.148\tnode02.skylark.com\tnode02\n169.55.169.150\tnode00.skylark.com\tnode00\n169.55.169.147\tnode01.skylark.com\tnode01\n" >> /etc/hosts'
In [29]:
client_user_setup_cmd
Out[29]:
''

Setting up the cloud cluster

At this point we are ready to execute our crafted setup commands by ssh-ing into the (NFS) client and server machines as root and "normal user"s and running the commands. We automate this process by scripting those ssh sessions. Also for convenient reference we capture the stdout and stderr streams of the setup commands.

In [30]:
setup_streams = {}
In [31]:
# server
hostname = server_name
In [32]:
# server root
cmd = server_root_setup_cmd
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(credential_dict[hostname]['primary_ip_address'], 
            username=credential_dict[hostname]['username'], 
            password=credential_dict[hostname]['password'])
stdin, stdout, stderr = ssh.exec_command(cmd)
stdout_lines = stdout.readlines()
stderr_lines = stderr.readlines()
primary_ip_address = credential_dict[hostname]['primary_ip_address']
setup_streams[primary_ip_address] = {'root' : {}}
setup_streams[primary_ip_address]['root']['stdout'] = ''.join(stdout_lines)
setup_streams[primary_ip_address]['root']['stderr'] = ''.join(stderr_lines)
In [33]:
# server user
cmd = server_user_setup_cmd
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(credential_dict[hostname]['primary_ip_address'], 
            username=cmd_params['user_name'], 
            password=cmd_params['user_password'])
stdin, stdout, stderr = ssh.exec_command(cmd)
stdout_lines = stdout.readlines()
stderr_lines = stderr.readlines()
primary_ip_address = credential_dict[hostname]['primary_ip_address']
setup_streams[primary_ip_address]['user'] = {}
setup_streams[primary_ip_address]['user']['stdout'] = ''.join(stdout_lines)
setup_streams[primary_ip_address]['user']['stderr'] = ''.join(stderr_lines)
In [34]:
# client root
cmd = client_root_setup_cmd
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
for hostname in client_names:
    ssh.connect(credential_dict[hostname]['primary_ip_address'], 
            username=credential_dict[hostname]['username'], 
            password=credential_dict[hostname]['password'])
    stdin, stdout, stderr = ssh.exec_command(cmd)
    stdout_lines = stdout.readlines()
    stderr_lines = stderr.readlines()
    primary_ip_address = credential_dict[hostname]['primary_ip_address']
    setup_streams[primary_ip_address] = {'root' : {}}
    setup_streams[primary_ip_address]['root']['stdout'] = ''.join(stdout_lines)
    setup_streams[primary_ip_address]['root']['stderr'] = ''.join(stderr_lines)
In [35]:
# client user
cmd = client_user_setup_cmd
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
for hostname in client_names:
    ssh.connect(credential_dict[hostname]['primary_ip_address'],
                username=cmd_params['user_name'], 
                password=cmd_params['user_password'])
    stdin, stdout, stderr = ssh.exec_command(cmd)
    stdout_lines = stdout.readlines()
    stderr_lines = stderr.readlines()
    primary_ip_address = credential_dict[hostname]['primary_ip_address']
    setup_streams[primary_ip_address]['user'] = {}
    setup_streams[primary_ip_address]['user']['stdout'] = ''.join(stdout_lines)
    setup_streams[primary_ip_address]['user']['stderr'] = ''.join(stderr_lines)

After setup commands have completed we can ssh into any of the cluster nodes, e.g. the NFS server one, by following the printed line instruction:

In [36]:
ssh_login_template = 'Type "ssh %s@%s" at a terminal and input "%s" as password when prompted'

print ssh_login_template % (cmd_params['user_name'], 
                            credential_dict[server_name]['primary_ip_address'], 
                            cmd_params['user_password'])
Type "ssh user@169.55.169.150" at a terminal and input "user" as password when prompted

We are in!

Working with the cloud cluster

As soon as we login it, it would be convenient to have a small text file with the short hostnames of the machines we intend to use in our MPI runs. In our case a file hosts containing these 3 lines would suffice:

node00
node01
node02

Let's see how this works. First let's do passwordless ssh to node01 and node02 from the front-end node node00:

ssh node01
exit

ssh node02
exit

We now print the hostname using either the linux command or a python package for this. Note that we ask for 6 processes (that is twice the number of launched machines) and then the MPI runtime will ensure that 2 processes per server are run

In [ ]:
mpirun -np 6 -f hosts hostname
In [ ]:
mpirun -np 6 -f hosts python -c "import socket; print socket.gethostname()"

MPI runs from the command line: libSkylark-based tools

We download and extract a tarball of data we'll need as input:

In [ ]:
wget  https://www.dropbox.com/s/0ilbyeem2ihmvzw/data.tar.gz
tar -xvzf data.tar.gz

Then we can try the examples in http://xdata-skylark.github.io/libskylark/docs/stable/sphinx/quick_start.html#command-line-usage.

We use all cloud cluster machines in our MPI runs and use 2 processes per machine - 6 MPI processes total.

In [ ]:
mpirun -np 6 -f hosts skylark_svd -k 10 --prefix usps data/usps.train
user@node00:~$ mpirun -np 6 -f hosts skylark_svd -k 10 --prefix usps data/usps.train
Reading the matrix... took 3.41e+00 sec
Computing approximate SVD...Took 2.29e+00 sec
Writing results...took 1.54e-01 sec
In [ ]:
mpirun -np 6 -f hosts skylark_linear data/cpu.train cpu.sol
user@node00:~$ mpirun -np 6 -f hosts skylark_linear data/cpu.train cpu.sol
Reading the matrix... took 1.28e-01 sec
Solving the least squares...Took 3.17e-01 sec
Writing results...took 6.23e-04 sec
In [ ]:
mpirun -np 6 -f hosts skylark_krr -a 1 -k 0 -g 10 -f 1000 --model model data/usps.train
user@node00:~$ mpirun -np 6 -f hosts skylark_krr -a 1 -k 0 -g 10 -f 1000 --model model data/usps.train
# Generated using kernel_regression using the following command-line:
#       skylark_krr -a 1 -k 0 -g 10 -f 1000 --model model data/usps.train
# Number of ranks is 6
Reading the matrix... took 3.06e+00 sec
Training...
        Dummy coding... took 8.92e-02 sec
        Solving...
                Computing kernel matrix... took 3.21e+00 sec
                Creating precoditioner...
                        Applying random features transform... took 1.88e+00 sec
                        Computing covariance matrix... took 6.55e+00 sec
                        Factorizing... took 5.37e-01 sec
                        Prepare factor... took 2.90e+00 sec
                Took 1.19e+01 sec
                Solving linear equation...
                        CG: Iteration 0, Relres = 1.52e+00, 0 rhs converged
                        CG: Iteration 10, Relres = 3.59e-01, 0 rhs converged
                        CG: Iteration 20, Relres = 1.28e-01, 0 rhs converged
                        CG: Iteration 30, Relres = 4.19e-02, 0 rhs converged
                        CG: Iteration 40, Relres = 2.18e-02, 0 rhs converged
                        CG: Iteration 50, Relres = 9.09e-03, 0 rhs converged
                        CG: Iteration 60, Relres = 3.60e-03, 0 rhs converged
                        CG: Iteration 70, Relres = 1.57e-03, 2 rhs converged
                        CG: Iteration 80, Relres = 9.98e-04, 7 rhs converged
                        CG: Iteration 85, Relres = 5.17e-04, 10 rhs converged
                        CG: Convergence!
                Took 2.80e+01 sec
        Solve took 4.31e+01 sec
Training took 4.32e+01 sec
Saving model... took 1.21e-01 sec
In [ ]:
mpirun -np 6 -f hosts skylark_krr --outputfile output --model model usps.test
user@node00:~$ mpirun -np 6 -f hosts skylark_krr --outputfile output --model model usps.test
# Generated using kernel_regression using the following command-line:
#       skylark_krr --outputfile output --model model usps.test
# Number of ranks is 6
Reading the matrix... took 1.90e-04 sec
Training...
        Dummy coding... took 1.14e-03 sec
        Solving...
                Computing kernel matrix... took 4.68e-03 sec
                Creating precoditioner...
                        Applying random features transform... took 2.62e-04 sec
                        Computing covariance matrix... took 5.35e-03 sec
                        Factorizing... took 2.05e+00 sec
                        Prepare factor... took 1.09e+00 sec
                Took 3.16e+00 sec
                Solving linear equation...
                        CG: Iteration 0, Relres = -nan, 0 rhs converged
                        CG: Convergence!
                Took 3.36e-03 sec
        Solve took 3.17e+00 sec
Training took 3.17e+00 sec
Saving model... took 6.83e-04 sec
In [ ]:
mpirun -np 6 -f hosts skylark_ml -g 10 -k 1 -l 2 -i 30 -f 1000 --trainfile data/usps.train --valfile data/usps.test --modelfile model
# [shortened output]
user@node00:~$ mpirun -np 6 -f hosts skylark_ml -g 10 -k 1 -l 2 -i 30 -f 1000 --trainfile data/usps.train --valfile data/usps.test --modelfile model
# Generated using skylark_mlusing the following command-line:
#       skylark_ml -g 10 -k 1 -l 2 -i 30 -f 1000 --trainfile data/usps.train --valfile data/usps.test --modelfile model
#
# Regression? = False
# Training File = data/usps.train
# Model File = model
# Validation File = data/usps.test
# Test File =
# File Format = 0
# Loss function = 2 (Hinge Loss (SVMs))
# Regularizer = 0 (No Regularizer)
# Kernel = 1 (Gaussian)
# Kernel Parameter = 10
# Second Kernel Parameter = 0
# Third Kernel Parameter = 1
# Regularization Parameter = 0
# Maximum Iterations = 30
# Tolerance = 0.001
# rho = 1
# Seed = 12345
# Random Features = 1000
# Cache transforms? = False
# Use fast, if availble? = False
# Sequence = 0 (Monte Carlo)
# Number of feature partitions = 1
# Threads = 1
# Number of MPI Processes = 6
Mode: Training. Loading data...
Reading from file data/usps.train
Reading and distributing chunk 0 to 7290 (7291 elements )
Read Matrix with dimensions: 7291 by 256 (3.03376secs)
Loading validation data.
Reading from file data/usps.test
Reading and distributing chunk 0 to 2006 (2007 elements )
Read Matrix with dimensions: 2007 by 256 (0.783422secs)
iteration 1 objective 80201 accuracy 0.00 time 1.82563 seconds
iteration 2 objective 80201 accuracy 0.00 time 2.38801 seconds
iteration 3 objective 16334.1 accuracy 92.87 time 2.94534 seconds
iteration 4 objective 2574.06 accuracy 93.02 time 3.49865 seconds
...
iteration 28 objective 735.573 accuracy 94.67 time 16.8254 seconds
iteration 29 objective 716.133 accuracy 94.67 time 17.3821 seconds
iteration 30 objective 697.646 accuracy 94.72 time 17.9359 seconds
In [ ]:
mpirun -np 6 -f hosts skylark_ml --outputfile output --testfile data/usps.test --modelfile model
user@node00:~$ mpirun -np 6 -f hosts skylark_ml --outputfile output --testfile data/usps.test --modelfile model
Mode: Predicting (from file). Loading data...
# Generated using skylark_mlusing the following command-line:
#       skylark_ml --outputfile output --testfile data/usps.test --modelfile model
#
# Regression? = False
# Training File =
# Model File = model
# Validation File =
# Test File = data/usps.test
# File Format = 0
# Loss function = 0 (Squared Loss)
# Regularizer = 0 (No Regularizer)
# Kernel = 0 (Linear)
# Kernel Parameter = 1
# Second Kernel Parameter = 0
# Third Kernel Parameter = 1
# Regularization Parameter = 0
# Maximum Iterations = 20
# Tolerance = 0.001
# rho = 1
# Seed = 12345
# Random Features = 0
# Cache transforms? = False
# Use fast, if availble? = False
# Sequence = 0 (Monte Carlo)
# Number of feature partitions = 1
# Threads = 1
# Number of MPI Processes = 6
Reading from file data/usps.test
Reading and distributing chunk 0 to 2006 (2007 elements )
Read Matrix with dimensions: 2007 by 256 (0.785275secs)
Error rate: 5.18%
In [ ]:
mpirun -np 6 -f hosts skylark_community --graphfile data/two_triangles --seed A1
user@node00:~$ mpirun -np 6 -f hosts skylark_community --graphfile data/two_triangles --seed A1
Reading the adjacency matrix... Reading the adjacency matrix... Reading the adjacency matrix...


Finished reading... Reading the adjacency matrix... Reading the adjacency matrix...

Finished reading... Finished reading... took took 1.45e-03 sec
Finished reading... 1.49e-03 sec
Reading the adjacency matrix...
Analysis complete! Took Finished reading... 9.35e-04 sec
Cluster found:Analysis complete! Took
A3took took 2.19e-032.21e-03 sec
 sec
Finished reading... took 5.02e-04Analysis complete! Took  A2 A1 Analysis complete! Took
1.16e-03Conductivity =  sec
0.142857Cluster found:8.71e-04

A3 sec
 sec
9.60e-04Cluster found:
 sec
A3A2 Cluster found:A2
A1A1 A3
 Conductivity =  0.142857

A2 Conductivity = A10.142857

took 6.01e-03 sec
Conductivity = 0.142857
Analysis complete! Took 6.14e-03 sec
Cluster found:
A3 A2 A1
Conductivity = 0.142857
Analysis complete! Took 6.83e-03 sec
Cluster found:
A3 A2 A1
Conductivity = 0.142857
In [ ]:
skylark_graph_se -k 2 data/two_triangles
user@node00:~$ skylark_graph_se -k 2 data/two_triangles
Reading the graph... took 1.87e-04 sec
Computing embeddings... took 3.42e-03 sec
Writing results... took 2.75e-04 sec

Interactive MPI runs in Jupyter

Setting things up

Let's install first the package enabling interactive MPI sessions in jupyter - which is already installed as part of the cluster setup process. Then we start a notebook server at any of the cloud cluster nodes and keep the token that is printed handy:

conda install -y ipyparallel
jupyter notebook --no-browser --port=12345

Then we open a terminal at our laptop and build an ssh tunnel to the jupyter notebook server we just started at the cloud cluster. In what follows 169.55.169.150 will be the IP of the node where the jupyter notebook server is running:

ssh -f -N -L3333:localhost:12345 user@169.55.169.150

Finally we open a local browser, point it to http://localhost:3333 in our case, input the token and we are ready to start a notebook (New > Python2) and two terminal sessions, which will be connected to the cluster node where the notebook server was started (New > Terminal)

At this point we can also start one controller and three engines from within the notebook terminal sessions. The engines are basically network-enabled Python kernels (can receive/send input out through network connections) and in our case - since they are launched through MPI launcher - they can also be involved in an (interactive) MPI computation, e.g. using mpi4py Python package. These engines register with the controller process we start separately.

ipcontroller --ip=169.55.169.150
mpirun -np 3 -f hosts ipengine --mpi=mpi4py

After connections are established the log messages in the two terminal sessions will look like (shortened outputs):

user@node00:~$ ipcontroller --ip=169.55.169.150
2017-02-06 18:36:47.058 [IPControllerApp] Hub listening on tcp://169.55.169.150:49374 for registration.
2017-02-06 18:36:47.061 [IPControllerApp] Hub using DB backend: 'DictDB'
2017-02-06 18:36:47.315 [IPControllerApp] hub::created hub
2017-02-06 18:36:47.338 [IPControllerApp] writing connection info to /shared/.ipython/profile_default/security/ipcontroller-client.j
son
...
2017-02-06 18:38:44.350 [IPControllerApp] registration::finished registering engine 2:2d6d9a64-aa11-4199-b10b-58df4f56258b
2017-02-06 18:38:44.351 [IPControllerApp] engine::Engine Connected: 2
user@node00:~$ mpirun -np 3 -f hosts ipengine --mpi=mpi4py
2017-02-06 18:38:36.186 [IPEngineApp] Initializing MPI:
2017-02-06 18:38:36.187 [IPEngineApp] from mpi4py import MPI as mpi
mpi.size = mpi.COMM_WORLD.Get_size()
mpi.rank = mpi.COMM_WORLD.Get_rank()
...
2017-02-06 18:38:41.080 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2017-02-06 18:38:41.110 [IPEngineApp] Completed registration with id 1
2017-02-06 18:38:41.112 [IPEngineApp] Completed registration with id 2
2017-02-06 18:38:41.089 [IPEngineApp] Completed registration with id 0

Interactive MPI examples

Let's now see the interactive MPI computation session in action.

In the notebook we can then enter the following snippet:

from ipyparallel import Client
import numpy as np

c = Client()
view = c[:]
view.activate()
view.scatter('a',np.arange(16,dtype='float'))
view['a']

and it should output:

[array([ 0.,  1.,  2.,  3.,  4.,  5.]),
 array([  6.,   7.,   8.,   9.,  10.]),
 array([ 11.,  12.,  13.,  14.,  15.])]

Or even revisit the "hostname" example above in an interactive setting by entering in the notebook:

view.execute('import socket; host=socket.gethostname()')
view['host']

and it should output:

['node02.skylark.com', 'node01.skylark.com', 'node00.skylark.com']

We can also enter the following snippet in a file (e.g. sketch_example.py using the editor that comes with the notebook, it will be saved at the cloud cluster node):

# sketch_example.py

import El
from skylark import sketch
A = El.DistMatrix()
El.Uniform(A, 10, 10)
S = sketch.FJLT(10, 4) 
SA = S * A
sketched_matrix = [[SA.Get(i, j) for j in range(10)] for i in range(4)]

Then we can run this over the cluster and get 3 identical 4 x 10 sketched matrices in the sketched_matrix variable (for each of the nodes); the view we get is a list of 3 nested lists which is shortened below:

view.run('sketch_example.py')
view['sketched_matrix']
[[[-0.3101992999959424,
   0.6483628527959093,
   0.1958483439974383,
   0.4446715646290061,
   0.9886868537882584,
   -0.26960721013585903,
   -0.9024278447032668,
   1.0498253423008397,
   -0.04086442729353306,
   -1.957760961341717],
  [-1.339483931706679,
   0.3757472310582834,
   ...
   -0.3312500853340126,
   0.46249798425486943,
   -1.9693807622535253,
   0.3678465840888163]]]

Terminating the cloud cluster

When we are done using the cloud cluster we can terminate all instances.

Doing so we stop charging, so we should always do this at the end of our computing session.

In [37]:
for nodename in nodename_list:
    vs_manager.cancel_instance(vsi_dict[nodename]['id'])