\chapter{Running your \ocamlpiiil\ program}
\label{cap:run}

We give here a practical tutorial on how to use the system, without entering
into the implementation details of the current version of \ocamlpiiil.

As mentioned above, once you have written an \ocamlpiiil\ program, you have
several choices for its execution, since you can {\em without touching your
  source}:

\begin{description}

  \item[sequential] run your program sequentially on one machine, to test the
   logic of the algorithm you implemented with all the usual sequential
   debugging tools.

 \item[graphics] get a picture of the processor net described by your
   \ocamlpiiil skeleton expression to grasp the parallel structure of your
   program.

 \item[parallel] run your program in parallel over a network of workstations
   after a simple recompilation.

\end{description}
\noindent
Presumably, you would run the parallel version once the program has
satisfactorily passed the sequential debugging phase.

In the following sections, our running example is the computation of a
Mandelbrot fractal set. We will describe the implementation program, compile it
and run it in the three ways described above.

\section{The Mandelbrot example program}

The Mandelbrot example program performs the calculation of the Mandelbrot set
at a given resolution in a given area of the graphic display. This is the
actual program provided in the Examples directory of the distribution.

The computing engine of the program is the function {\tt pixel\_row} which
computes the color of a row of pixels from the convergence of a sequence of
complex numbers $z_n$ defined by the initial term $z_0$ and the formula $z_{n +
  1} = z_{n}^2 + z_0$. More precisely, given a point $p$ in the complex plane,
we associate to $p$ the sequence $z_n$ when starting with $z_0 = p$. Now, we
compute the integer $m$ such that $z_m$ is the first term of the sequence
satisfying the following condition: either the sum of the real and imaginary
parts of $z_n$ exceeds a given threshold, or the number of iterations exceeds
some fixed maximum {\em resolution} limit. Integer $m$ defines the color of
$p$.

$p$). This correspond to the following \ocaml\ code:

\begin{alltt}
open Graphics;;

let n   = 300;; (* the size of the square screen windows in pixels      *)
let res = 100;; (* the resolution: maximum number of iterations allowed *)

(* convert an integer in the range 0..res into a screen color *)

let color_of c res = Pervasives.truncate 
      (((float c)/.(float res))*.(float Graphics.white));;

(* compute the color of a pixel by iterating z_m+1=z_m^2+c             *)
(* j is the k-th row, initialized so that j.(i),k  are the coordinates *)
(* of the pixel (i,k)                                                  *)

let pixel_row (j,k,res,n) = 
  let zr = ref 0.0 in
  let zi = ref 0.0 in
  let cr = ref 0.0 in
  let ci = ref 0.0 in
  let zrs = ref 0.0 in
  let zis = ref 0.0 in
  let d   = ref (2.0 /. ((float  n) -. 1.0)) in
  let colored_row = Array.create n (Graphics.black) in

  for s = 0 to (n-1) do
    let j1 = ref (float  j.(s)) in
    let k1 = ref (float  k) in
    begin
      zr := !j1 *. !d -. 1.0;
      zi := !k1 *. !d -. 1.0;
      cr := !zr;
      ci := !zi;
      zrs := 0.0;
      zis := 0.0;
      for i=0 to (res-1) do
        begin
          if(not((!zrs +. !zis) > 4.0))
          then 
            begin
              zrs := !zr *. !zr;
              zis := !zi *. !zi;
              zi  := 2.0 *. !zr *. !zi +. !ci;
              zr  := !zrs -. !zis +. !cr;
              Array.set colored_row s (color_of i res);
            end;
        end
      done
    end
  done;
  (colored_row,k);;
\end{alltt}

In this code, the global complex interval sampled stays within {\tt (-1.0,
  -1.0)} and {\tt (1.0, 1.0)}. In this 2-unit wide square, the {\tt pixel\_row}
functions computes rows of pixels separated by the distance {\tt d}. The {\tt
  pixel\_row} function takes four parameters: {\tt size}, the number of pixels
in a row; {\tt resolution}, the resolution; {\tt k}, the index of the row to
be drawn; and, {\tt j}, an array which will be filled with the integers
representing the colors of pixels in the row. These values will be converted
into real colors by the {\tt color\_row} function. In this program, the
threshold is fixed to be {\tt 4.0}. We name {\tt zr} and {\tt zi} the real and
imaginary parts of $z_i$; similarly, the real and imaginary parts of $c$ are
{\tt cr} and {\tt ci}; {\tt zrs} and {\tt zis} are temporary variables for the
square of {\tt zr} and {\tt zi}; {\tt d} is the distance between two rows.

The Mandelbrot computation over the whole set of points within {\tt
  (-1.0,-1.0)} and {\tt (1.0,1.0)} in the complex plane can be computed in
parallel exploiting farm parallelism. The set of points is split by
\texttt{gen\_rows} into a bunch of pixel rows that build up the input stream,
the computation of the Mandelbrot set on each row of complex points is
independent and can be performed by the worker processes using {\tt pixel\_row}
and the result is a stream of rows of pixel colors, each corresponding to an
input pixel row.

\begin{small}
\begin{alltt}
(* draw a line on the screen using fast image functions *)
let show_a_result r =
  match r with
    (col,j) ->
      draw_image (make_image [| col |]) 0 j;;

(* generate the tasks *)
let gen_rows = 
  let seed = ref 0 in
  let ini = Array.create n 0 in
  let iniv = 
    for i=0 to (n-1) do
      Array.set ini i i
    done; ini in
  (function () -> 
    if(!seed < n) 
    then let r = (iniv,!seed,res,n) in (seed:=!seed+1;r)
    else raise End_of_file)
;;
\end{alltt}
\end{small}

The actual farm is defined by the \texttt{mandel} function which uses the
\texttt{parfun} skeleton to transform a farm instance with 10 workers into an
\ocaml\ sequential function. Notice that the \texttt{seq}
skeleton has been used to turn the \texttt{pixel\_row} function into a stream
process, which can be used to instantiate a skeleton.
Finally the \texttt{pardo} skeleton takes care of opening/closing a
display window on the end-node (the one running \texttt{pardo}), and of
actually activating the farm invoking \texttt{mandel}. The function
\texttt{show\_a\_result} actually displays a pixel row on the end-node. Notice
that this
code would need to be written anyway, maybe arranged in a different way, for a
purely sequential implementation.\\

\begin{small}
\begin{alltt}
(* the skeleton expression to compute the image *)
let mandel = parfun (fun () -> farm(seq(pixel_row),10));;

pardo (fun () ->
  print_string "opening...";print_newline();
  open_graph (" "^(string_of_int n)^"x"^(string_of_int n));

  (* here we do the parallel computation *)
  List.iter show_a_result 
        (P3lstream.to_list (mandel (P3lstream.of_fun gen_rows)));

  print_string "Finishing";print_newline();
  for i=0 to 50000000 do let _ =i*i in () done;
  print_string "Finishing";print_newline();
  close_graph()
)
\end{alltt}
\end{small}
%\index{streams!\verb|P3lstream.of_fun|}

\section{Sequential execution}

We assume the program being written in a file named \texttt{mandel.ml}. We
compile the sequential version using \texttt{ocamlp3lcc} as follows:

\begin{verbatim}
ocamlp3lcc --sequential mandel
\end{verbatim}

\begin{remark}
In the current implementation, this boils down to adding on top of
\texttt{mandel.ml} the line

\begin{verbatim}
open Seqp3l;;
\end{verbatim}

to obtain a temporary file \texttt{mandel.seq.ml} which is then compiled via
the regular Caml compiler \texttt{ocamlc} with the proper modules and libraries
linked. Depending on the configuration of your system, this may look like the
following

\begin{verbatim}
ocamlc -custom unix.cma graphics.cma seqp3l.cmo 
       -o mandel.seq mandel.seq.ml 
       -cclib -lunix -cclib -lgraphics -cclib -L/usr/X11R6/lib 
       -cclib -lX11
\end{verbatim}

We highly recommend not to use explicit call to \texttt{ocamlc}: use the
\texttt{ocamlp3lcc} compiler that is especially devoted to the compilation of
\ocamlpiiil programs. \hfill $\diamond$
\end{remark}

After the compilation, we get an executable file, \texttt{mandel.seq}, whose
execution produces the picture shown on the left side of \ref{f:mandelparseq}.
\begin{figure}
  \includegraphics[scale=.30]{mandelparseq}
\caption{A snapshot of the execution of \texttt{mandel.ml} %
         (left is sequential execution, right is parallel execution on 5 machines).}
\label{f:mandelparseq}
\end{figure}

\section{Graphical execution}

It is often useful to look at the structure of the application process network,
for example when tuning the performance of the final program.  In \ocamlpiiil,
this can be done by compiling the program with the special option
\texttt{--graphical} which automatically creates a picture displaying the
`logical' parallel program structure.

\begin{verbatim}
ocamlp3lcc --graphical mandel.ml
\end{verbatim}

\begin{remark}

In the current implementation, this boils down to adding on top of
\texttt{mandel.ml} the line

\begin{verbatim}
open Grafp3l;;
\end{verbatim}

to obtain a temporary file \texttt{mandel.gra.ml} which is then compiled via
\texttt{ocamlc} with the proper modules and libraries.  Depending on the
configuration of your system, this may look like the following

\begin{verbatim}
ocamlc -custom graphics.cma grafp3l.cmo -o mandel.gra mandel.gra.ml 
       -cclib -lgraphics -cclib -L/usr/X11R6/lib -cclib -lX11
\end{verbatim}

Once more, we highly recommend not to use explicit calls to \texttt{ocamlc}: use the
\texttt{ocamlp3lcc} compiler that is especially devoted to the compilation of
\ocamlpiiil programs. \hfill $\diamond$
\end{remark}

After compilation, we get the executable file \texttt{mandel.gra}, whose
execution produces the following picture.
\begin{center}
  \includegraphics[scale=.35]{mandelgra}
\end{center}

\section{Parallel execution}

Once we have checked the sequential version of our code, and got a picture of the
structure of the parallel network, we are ready to speed up the computation by
using a network of computers.

\subsection{Compilation for parallel execution}

We call the compiler with the special option \texttt{--parallel} devoted
to compilation for parallel execution:

\begin{verbatim}
ocamlp3lcc --parallel mandel
\end{verbatim}

\begin{remark}
In the current implementation this boils down to adding on top of
\texttt{mandel.ml} the lines

\begin{verbatim}
open Parp3l;;
open Nodecode;;
open Template;;
\end{verbatim}

to obtain a temporary file \texttt{mandel.par.ml} which is then compiled via
\texttt{ocamlc} with the proper modules and libraries. Depending on the
configuration of your system, this may look like the following

\begin{verbatim}
ocamlc -custom unix.cma p3lpar.cma -o mandel.par mandel.par.ml 
       -cclib -lunix -cclib -lgraphics -cclib -L/usr/X11R6/lib
       -cclib -lX11
\end{verbatim}

Once again, we highly recommend not to use explicit calls to \texttt{ocamlc}: use the
\texttt{ocamlp3lcc} compiler that is especially devoted to the compilation of
\ocamlpiiil programs. \hfill $\diamond$
\end{remark}

The compilation produces an executable file named \texttt{mandel.par}.

\section{Common options}

The parallel compilation of \ocamlpiiil\ programs creates executables that are
equipped with the following set of predefined options:

\begin{itemize}

\item \verb|-p3lroot|, to declare this invocation of the program as the root node.

\item \verb|-dynport|, to force this node to use a dynamic port number instead
  of the default \verb|p3lport|; in addition the option outputs it (useful if
  you want to run more slave copies on the same machine).

\item \verb|-debug|, to enable debugging for this node at level $n$. Currently all
  levels are equal.

\item \verb|-ip|, to force the usage of a specified ip address. Useful when you
  are on a laptop named localhost and you want to be able to choose among
  network interfaces.

\item \verb|-strict|, to specify a strict mapping between physical and virtual
  processors.

\item \verb|-version|, to print version information.

\item \verb|-help| or \verb|--help|  Displays this list of options.

\end{itemize}

\subsection{Parallel computation overview}

The executable produced by using the {\tt --parallel} option of the compiler
behaves either as a generic computation node, or as the unique {\em root
  configuration node}, according to the set of arguments provided at launch
time.

To set up and launch the parallel computation network, we need to run multiple
invocations of the parallel executable:
\begin{itemize}

  \item run one copy instance of \texttt{mandel.par}, with no arguments,
        on each machine that takes part to the parallel computation;
        These processes wait for configuration information sent by the designated
        {\em root node},

  \item create a root node, by launching one extra copy of \texttt{mandel.par}
        with the special option \texttt{-p3lroot}.

\end{itemize}
\noindent
As soon as created, the root node configures all other participating nodes and
then executes locally the \verb|pardo| encapsulated sequential code.

In addition to the \texttt{-p3lroot} special option, the root node invocation
must specify the information concerning the machines involved in the
computational network (their ip address or name, their port and color).

\section{Launching the parallel computation}

Here is a simple script to launch the parallel network on several machines:
\begin{small}
\begin{verbatim}
#!/bin/sh

# The list of machines
NODES="machine1 machine2 machine3 machine4"
# The name of executable to be launched
PAR="./mandel.par"

echo -n "Launching OcamlP3L $PAR on the cluster:" 
for NODE in $NODES; do   #(*1*)
    echo -n " $NODE"
#launching a generic computation node on each machine
    ssh $NODE $PAR 1> log-$NODE 2> err-$NODE & 

# a possible coloring of machines
    case $NODE in                                   #(*2*)
        machine1) COLORED_NODES="$COLOREDNODES $NODE#1";;
        *) COLORED_NODES="$COLOREDNODES $NODE#2";;
    esac
done

echo "Starting computation with $COUNT node(s): $COLORED_NODES..."
# launch the unique root configuration node #(*3*)
$PAR -p3lroot $COLOREDNODES 1> log-root 2> err-root

echo "Finished."
\end{verbatim}
\end{small}
\noindent
This script assumes \verb|mandel.par| to be accessible to all participating
machines and does the following:
\begin{itemize}
\item runs \verb|mandel.par| on all participating machines
  (\verb|#(*1*)|),
\item generates a coloring for participating nodes (\verb|#(*2*)|),
\item launches the computation starting the root process on the local machine
  (\verb|#(*3*)|) providing the list of colored participating hosts.
\end{itemize}

In future versions, especially those incorporating the MPI communication layer,
the startup mechanism will possibly work differently (typically, the
initialization steps will be performed by the MPI layer).

\section{Common errors}
\label{sec:commonerrors}

A few words of warning now: even if the user program is now easy to write,
compile and execute, you should not forget that the underlying machinery is
quite sophisticated, and that in some situations you may not get what you
expected. Two typical problems you may encounter are the following:

\begin{description}

\item[output value: code mismatch] If you see this error in the
  parallel execution of your program, it means that two incompatible
  versions of your program are trying to communicate. \ocaml\ uses an
  MD5 check of the code area before sending closures over a channel,
  because this operation only makes sense between ``identical''
  programs.\\
  Two possible reasons for the error are:
  \begin{itemize}
  \item an old version of your program is still running somewhere and
        is trying to communicate with the newer version you are
        running now. You should kill all the running processes and try
        again.
  \item you are running copies of the program compiled for different
        architectures. This feature is not yet supported, and you
        should run the program on homogeneous architectures.
  \end{itemize}

\item[references] You should remember that the user functions provided
     to the skeletons will be all executed on different machines, so
     their behaviour \emph{must not} rely on the existence of
     implicitly shared data, like global references: if you do, the
     sequential behaviour and the parallel one will be different.
     This does not imply that all user function be real functions
     (you can use local store to keep a counter for example), but an
     access to a global reference is certainly a mistake (since every node will
     access its \emph{own private} copy of the data, thus defeating the
     purpose of the shared data).

\end{description}