Content - 44d2cd73af1bb6c301056b7d29e9196fc477b215 - 31fae16/docs/HPC_Tunnel.rst

visit type:
Tip revision: 729608b88d2f6edd92c232c389d74c9864fb9afa authored by Isaac Overcast on 28 November 2022, 19:48:42 UTC
"Updating ipyrad/__init__.py to version - 0.9.87
Tip revision: 729608b
HPC_Tunnel.rst

.. include:: global.rst

.. _HPCscript:


Jupyter and the ipyrad API
===========================

The *ipyrad* API was specifically designed for use inside 
`jupyter-notebooks <http://jupyter.org>`_, a tool for reproducible science.
This section of the documentation is about how to start and run jupyter
notebooks, which you can then use to run your ipyrad analyses using
the ipyrad API. For instructions on how to use the ipyrad API 
(after you have a notebook started) go here: 
(`ipyrad API <http://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/API_user-guide.ipynb>`__). 
An example of a complete notebook showing assembly and analysis of 
a RAD data set with the ipyrad API can be found here:
(`Pedicularis API <http://nbviewer.jupyter.org/github/dereneaton/ipyrad/blob/master/tests/cookbook-empirical-API-1-pedicularis.ipynb>`__).

Jupyter notebooks allow you to run interactive code that can be 
documented with embedded Markdown (words and fancy text) 
to create a shareable and executable document. 
Running *ipyrad* interactively in a notebook 
is easy to do on a laptop or workstation, and slightly more difficult
to run an HPC cluster, but after reading this tutorial you will 
hopefully find it easy to do. If this is your
first time using jupyter it will be easiest to start by trying 
it on your laptop first before trying to use jupyter on a cluster. 
In the case of running on a cluster our example below include an 
example job submission script for SLURM, but other job 
submission systems should be similar. 


The following tools are used in this section:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ ipyrad (used for RAD-seq assembly)  
+ jupyter-notebook (an environment in which we run Python code)  
+ ipcluster (used to parallelize code within a notebook)  
+ ssh (used to connect to a notebook running on HPC)  


Starting a jupyter-notebook **locally**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To start a jupyter-notebook on your local computer (e.g., laptop)
execute the command below in a terminal. This will start a local 
notebook server and open a window in your default web-browser. 
Leave the server running the terminal. You will not need to 
touch that again until you want to stop the notebook server.
You can now interact with the notebook server through your 
web-browser. You should see a page showing the files and folders
in your directory where you started the notebook. In the upper
right you will see a tab where you can select <new> and then 
<Python 2> to start a new Python notebook.

.. code-block:: bash

    jupyter-notebook


Starting a jupyter-notebook **remotely** (HPC)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because jupyter works by sending and receiving information 
(i.e., it's a server) it is easy to run a jupyter notebook through 
your browser even if the notebook server is running on a remote computer 
that is far away, for example on a computing cluster. Start by 
assigning a password to your notebook server which will give it 
added security. 

.. code-block:: bash

    ## Run this on the remote mahcine (i.e., the cluster)
    ## It will ask you to enter a password which will be 
    ## encrypted and stored for use when connecting.
    jupyter-notebook password


.. code-block:: bash

    ## Run this on the remote machine (i.e., the cluster)
    jupyter-notebook --no-browser --ip=$(hostname -i) --port=9999  


Once the notebook starts it will print some information including 
the IP address of the machine you're connected to (this will be something
like 10.115.0.25), and the port number that it is using (this will
likely be 9999, however, if that port is already in use then it 
will select a different port number, so check the output). 
You will need these two pieces of information, the IP-address
and the port number, for the next command. Replace the values that are 
between brackets with the appropriate values. 


.. code-block:: bash

    ## Run this on your local machine (i.e., your laptop)
    ssh -N -L <port>:<ip-address>:<port> <user>@<login>


.. code-block:: bash

    ## This would be an example with real values entered:
    ssh -N -L 9999:10.115.0.25:9999 deren@hpc.columbia.edu  


You will now be able to connect to the jupyter notebook on your 
local machine (i.e., laptop) by going to the web address 
``localhost:<port>`` where you enter in the port number your 
notebook is being served on.


Starting jupyter through a batch script:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

tldr; short video tutorial.

.. raw:: html

    <div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%; height: auto;">
        <iframe src="https://www.youtube.com/embed/hjBJw1fY5Uo" frameborder="0" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe>
    </div>
    <br>


Step 1: Submit a batch script to start jupyter
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy and paste the code block below into a text editor and save the script as 
``slurm_jupyter.sbatch``. The #SBATCH section of the script may need to be edited 
slightly to conform to your cluster. The stdout (output) of the job will be 
printed to a log file named ``jupyter-log-%J.txt``, where %J will be replaced 
by the job ID number. We'll need to look at the log file once the job starts
to get information about how to connect to the jupyter server that we've started.


Single Node setup:
This example would connect to one node with 20 cores available. 

.. code-block:: bash

    #!/bin/bash
    #SBATCH --partition general
    #SBATCH --nodes 1
    #SBATCH --ntasks-per-node 20
    #SBATCH --exclusive
    #SBATCH --time 4:00:00
    #SBATCH --mem-per-cpu 4000
    #SBATCH --job-name tunnel
    #SBATCH --output jupyter-log-%J.txt

    ## get tunneling info
    XDG_RUNTIME_DIR=""
    ipnport=$(shuf -i8000-9999 -n1)
    ipnip=$(hostname -i)

    ## print tunneling instructions to jupyter-log-{jobid}.txt 
    echo -e "
        Copy/Paste this in your local terminal to ssh tunnel with remote 
        -----------------------------------------------------------------
        ssh -N -L $ipnport:$ipnip:$ipnport user@host                     
        -----------------------------------------------------------------

        Then open a browser on your local machine to the following address
        ------------------------------------------------------------------
        localhost:$ipnport  (prefix w/ https:// if using password)
        ------------------------------------------------------------------
        "

    ## start an ipcluster instance and launch jupyter server
    jupyter-notebook --no-browser --port=$ipnport --ip=$ipnip


Now submit the sbatch script to the cluster to reserve the node and 
start the jupyter notebook server running on it. The notebook server 
will continue running until it hits the walltime limit, or you stop it.

.. code-block:: bash

    ## submit the job script to your cluster job scheduler
    sbatch slurm_jupyter.sbatch



Step 2: Connecting to the jupyter server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
After submitting your sbatch script to the queue you can check to see if
it has started with the ``squeue -u {username}`` command. 
Once it starts information will be printed to the log file which 
we named ``jupyter-log-{jobid}.txt``. Use the command ``less`` 
to look at this file and you should see something like below. 

.. code-block:: yaml

     Copy/paste this in your local terminal to ssh tunnel with remote
     ---------------------------------------------------------------- 
     ssh -N -L 8193:xx.yyy.zzz:8193 user@host
     ---------------------------------------------------------------
 
     Then open a browser on your local machine to the following address
     ------------------------------------------------------------------
     localhost:8193  (prefix w/ https:// if using password)
     ------------------------------------------------------------------

Follow the instructions and paste the `ssh` code block into a terminal on your 
local machine (e.g., laptop). This creates the SSH tunnel from your local 
machine to the port on the cluster where the jupyter server is running. 
As long as the SSH tunnel is open you will be able to interact with the 
jupyter-notebook through your browser. You can close the SSH tunnel at any time 
and the notebook will continue running on the cluster. You can 
re-connect to it later by re-opening the tunnel with the same SSH command.


.. code-block:: bash

    ## This would be an example with real values entered:
    ssh -N -L 8193:10.115.0.25:8193 deren@hpc.columbia.edu  


Security/tokens
~~~~~~~~~~~~~~~~
If you did not create a password earlier, then when you connect to 
the jupyter-notebook server it will ask you for a password/token. 
You can find an automatically generated token in your jupyter-log 
file near the bottom. It is the long string printed after the word 
`token`. Copy just that portion and paste it in the token cell. I 
find it easier to use password. See the jupyter documentation for how
to setup further security. 


Using jupyter
~~~~~~~~~~~~~~
Once connected you can open an existing notebook or create a new one. 
The notebooks are physically located on your cluster, meaning all of your data 
and results will be saved there. I usually keep notebooks associated with 
different projects in different directories, where each directory is also a 
github repo, which makes them easy to share. When running ipyrad I usually set 
the "project_dir" be a location in the scratch directory of the cluster, since
it is faster for reading/writing large files. 


Using ipcluster on a multi-node MPI setup:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the example above we started a notebook on a node with 20 cores available. 
Once connected, the first I would do is typically to start an ipcluster instance
running in a terminal so that I can use it to parallelize computations
(see our `ipyparallel tutorial`__). If you want to connect to multiple nodes, 
however, then it is better to start the ipcluster instance separately in its 
own separate job submission script. Here is an example. Importantly, we will 
tell ipcluster to use a specific `--profile` name, in this case named `MPI60`, 
to indicate that we're connecting to 60 cores using MPI. When we connect
to the client later we will need to provide the profile name. I name this file
``slurm_ipcluster_MPI.sbatch``. 

For this setup we also add a command to load the MPI module. You will probably
need to modify ``module load OpenMPI`` to whatever the appropriate module 
name is for MPI on your system. If you do not know what this is then look
it up or ask the system administrator. 


.. code-block:: bash

    #!/bin/bash
    #SBATCH --partition general
    #SBATCH --nodes 3
    #SBATCH --ntasks-per-node 20
    #SBATCH --exclusive
    #SBATCH --time 30-00:00:00
    #SBATCH --mem-per-cpu 4000
    #SBATCH --job-name MPI60
    #SBATCH --output ipcluster-log-%J.txt

    ## set the profile name here
    profile="MPI60"

    ## Start an ipcluster instance. This server will run until killed.
    module load OpenMPI
    sleep 10
    ipcluster start --n=60 --engines=MPI --ip='*' --profile=$profile



Connecting to the ipcluster instance in Python
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now when you are in the jupyter notebook you can connect to this ipcluster
instance -- which is running as a completely separate job on your cluster -- 
with the following simple Python code. The object ``ipyclient`` can then
be used to distribute your computation on the remote cluster. When you
run ipyrad pass the ipyclient object to tell it this is the cluster you want
computation to occur on. The results of your computation will still be 
printed in your jupyter notebook.


.. code-block:: python
    
    import ipyrad as ip
    import ipyparallel as ipp

    ## connect to the client
    ipyclient = ipp.Client(profile="MPI60")

    ## print how many engines are connected
    print(len(ipyclient), 'cores')

    ## or, use ipyrad to print cluster info
    ip.cluster_info(ipyclient)


.. code-block:: yaml

    60 cores
    host compute node: [20 cores] on c14n02.farnam.hpc.yale.edu
    host compute node: [20 cores] on c14n03.farnam.hpc.yale.edu
    host compute node: [20 cores] on c14n04.farnam.hpc.yale.edu


When running the ipyrad API you would distribute work by passing the
ipyclient object in the ipyclient argument. See the ipyrad API for more
information. 

.. code-block:: python

    ## run step 3 of ipyrad assembly across 60 cores of the cluster    
    data.run(steps='3', ipyclient=ipyclient)



The slurm_jupyter.sbatch script explained
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
So what is the sbatch script above doing? 
The ``XDG_RUNTIME_DIR`` command is a little obscure, it simply fixes a 
bug where SLURM otherwise sets this variable to something that
is incompatible with jupyter. The ``ipnport`` is a random number between 8000-9999
that selects which port we will use to send data on. The ``ipnip`` is the ip 
address of the login node that we are tunneling through. The ``echo`` commands 
simply print the tunneling information to the log file. 

In the multi-node ipcluster script we use a the ``module load``
command to load the system-wide MPI software. Then we call ipcluster
with arguments to find cores across all available nodes using MPI, and
we provide a name (profile) for this cluster so it will be easy 
to connect to.


Restarting ipcluster
~~~~~~~~~~~~~~~~~~~~~
Once the connection is established you can later stop and restart ``ipcluster`` 
if you run into a problem with the parallel engines, for example, you might 
have a stalled job on one of the engines. The easiest way to do this is to stop 
the ``ipcluster`` instance by starting a new terminal from the jupyter dashboard, 
by selecting [new]/[terminal] on the right side, and then following
the commands below to restart ``ipcluster``. If you are using a multi-node
setup then you will need to resubmit the ipcluster job through a script in 
order to connect to multiple computers again. 

.. code-block:: bash

    ## stop the running ipcluster instance
    ipcluster stop

    ## start a new ipcluster instance viewing all nodes
    ipcluster start


Terminating the connection
~~~~~~~~~~~~~~~~~~~~~~~~~~~
To stop a running jupyter notebook just cancel the job on your cluster's queue, 
or if working locally, just press control-c in the terminal window. If you 
disconnect from a remote notebook and later reconnect you can continue 
using the notebook without needed to restart it by going to the menu and 
select kernel reconnect. If progress bars were printing output while you 
were disconnected it may not show up, but the job will have kept running.
The loss of progress bars is a shortcoming that will likely be 
fixed in the near future.
Browse the archive

https://github.com/dereneaton/ipyrad