\chapter{Running your \ocamlpiiil\ program} \label{cap:run} We give here a practical tutorial on how to use the system, without entering into the implementation details of the current version of \ocamlpiiil. As mentioned above, once you have written an \ocamlpiiil\ program, you have several choices for its execution, since you can {\em without touching your source}: \begin{description} \item[sequential] run your program sequentially on one machine, to test the logic of the algorithm you implemented with all the usual sequential debugging tools. \item[graphics] get a picture of the processor net described by your \ocamlpiiil skeleton expression to grasp the parallel structure of your program. \item[parallel] run your program in parallel over a network of workstations after a simple recompilation. \end{description} \noindent Presumably, you would run the parallel version once the program has satisfactorily passed the sequential debugging phase. In the following sections, our running example is the computation of a Mandelbrot fractal set. We will describe the implementation program, compile it and run it in the three ways described above. \section{The Mandelbrot example program} The Mandelbrot example program performs the calculation of the Mandelbrot set at a given resolution in a given area of the graphic display. This is the actual program provided in the Examples directory of the distribution. The computing engine of the program is the function {\tt pixel\_row} which computes the color of a row of pixels from the convergence of a sequence of complex numbers $z_n$ defined by the initial term $z_0$ and the formula $z_{n + 1} = z_{n}^2 + z_0$. More precisely, given a point $p$ in the complex plane, we associate to $p$ the sequence $z_n$ when starting with $z_0 = p$. Now, we compute the integer $m$ such that $z_m$ is the first term of the sequence satisfying the following condition: either the sum of the real and imaginary parts of $z_n$ exceeds a given threshold, or the number of iterations exceeds some fixed maximum {\em resolution} limit. Integer $m$ defines the color of $p$. $p$). This correspond to the following \ocaml\ code: \begin{alltt} open Graphics;; let n = 300;; (* the size of the square screen windows in pixels *) let res = 100;; (* the resolution: maximum number of iterations allowed *) (* convert an integer in the range 0..res into a screen color *) let color_of c res = Pervasives.truncate (((float c)/.(float res))*.(float Graphics.white));; (* compute the color of a pixel by iterating z_m+1=z_m^2+c *) (* j is the k-th row, initialized so that j.(i),k are the coordinates *) (* of the pixel (i,k) *) let pixel_row (j,k,res,n) = let zr = ref 0.0 in let zi = ref 0.0 in let cr = ref 0.0 in let ci = ref 0.0 in let zrs = ref 0.0 in let zis = ref 0.0 in let d = ref (2.0 /. ((float n) -. 1.0)) in let colored_row = Array.create n (Graphics.black) in for s = 0 to (n-1) do let j1 = ref (float j.(s)) in let k1 = ref (float k) in begin zr := !j1 *. !d -. 1.0; zi := !k1 *. !d -. 1.0; cr := !zr; ci := !zi; zrs := 0.0; zis := 0.0; for i=0 to (res-1) do begin if(not((!zrs +. !zis) > 4.0)) then begin zrs := !zr *. !zr; zis := !zi *. !zi; zi := 2.0 *. !zr *. !zi +. !ci; zr := !zrs -. !zis +. !cr; Array.set colored_row s (color_of i res); end; end done end done; (colored_row,k);; \end{alltt} In this code, the global complex interval sampled stays within {\tt (-1.0, -1.0)} and {\tt (1.0, 1.0)}. In this 2-unit wide square, the {\tt pixel\_row} functions computes rows of pixels separated by the distance {\tt d}. The {\tt pixel\_row} function takes four parameters: {\tt size}, the number of pixels in a row; {\tt resolution}, the resolution; {\tt k}, the index of the row to be drawn; and, {\tt j}, an array which will be filled with the integers representing the colors of pixels in the row. These values will be converted into real colors by the {\tt color\_row} function. In this program, the threshold is fixed to be {\tt 4.0}. We name {\tt zr} and {\tt zi} the real and imaginary parts of $z_i$; similarly, the real and imaginary parts of $c$ are {\tt cr} and {\tt ci}; {\tt zrs} and {\tt zis} are temporary variables for the square of {\tt zr} and {\tt zi}; {\tt d} is the distance between two rows. The Mandelbrot computation over the whole set of points within {\tt (-1.0,-1.0)} and {\tt (1.0,1.0)} in the complex plane can be computed in parallel exploiting farm parallelism. The set of points is split by \texttt{gen\_rows} into a bunch of pixel rows that build up the input stream, the computation of the Mandelbrot set on each row of complex points is independent and can be performed by the worker processes using {\tt pixel\_row} and the result is a stream of rows of pixel colors, each corresponding to an input pixel row. \begin{small} \begin{alltt} (* draw a line on the screen using fast image functions *) let show_a_result r = match r with (col,j) -> draw_image (make_image [| col |]) 0 j;; (* generate the tasks *) let gen_rows = let seed = ref 0 in let ini = Array.create n 0 in let iniv = for i=0 to (n-1) do Array.set ini i i done; ini in (function () -> if(!seed < n) then let r = (iniv,!seed,res,n) in (seed:=!seed+1;r) else raise End_of_file) ;; \end{alltt} \end{small} The actual farm is defined by the \texttt{mandel} function which uses the \texttt{parfun} skeleton to transform a farm instance with 10 workers into an \ocaml\ sequential function. Notice that the \texttt{seq} skeleton has been used to turn the \texttt{pixel\_row} function into a stream process, which can be used to instantiate a skeleton. Finally the \texttt{pardo} skeleton takes care of opening/closing a display window on the end-node (the one running \texttt{pardo}), and of actually activating the farm invoking \texttt{mandel}. The function \texttt{show\_a\_result} actually displays a pixel row on the end-node. Notice that this code would need to be written anyway, maybe arranged in a different way, for a purely sequential implementation.\\ \begin{small} \begin{alltt} (* the skeleton expression to compute the image *) let mandel = parfun (fun () -> farm(seq(pixel_row),10));; pardo (fun () -> print_string "opening...";print_newline(); open_graph (" "^(string_of_int n)^"x"^(string_of_int n)); (* here we do the parallel computation *) List.iter show_a_result (P3lstream.to_list (mandel (P3lstream.of_fun gen_rows))); print_string "Finishing";print_newline(); for i=0 to 50000000 do let _ =i*i in () done; print_string "Finishing";print_newline(); close_graph() ) \end{alltt} \end{small} %\index{streams!\verb|P3lstream.of_fun|} \section{Sequential execution} We assume the program being written in a file named \texttt{mandel.ml}. We compile the sequential version using \texttt{ocamlp3lcc} as follows: \begin{verbatim} ocamlp3lcc --sequential mandel \end{verbatim} \begin{remark} In the current implementation, this boils down to adding on top of \texttt{mandel.ml} the line \begin{verbatim} open Seqp3l;; \end{verbatim} to obtain a temporary file \texttt{mandel.seq.ml} which is then compiled via the regular Caml compiler \texttt{ocamlc} with the proper modules and libraries linked. Depending on the configuration of your system, this may look like the following \begin{verbatim} ocamlc -custom unix.cma graphics.cma seqp3l.cmo -o mandel.seq mandel.seq.ml -cclib -lunix -cclib -lgraphics -cclib -L/usr/X11R6/lib -cclib -lX11 \end{verbatim} We highly recommend not to use explicit call to \texttt{ocamlc}: use the \texttt{ocamlp3lcc} compiler that is especially devoted to the compilation of \ocamlpiiil programs. \hfill $\diamond$ \end{remark} After the compilation, we get an executable file, \texttt{mandel.seq}, whose execution produces the picture shown on the left side of \ref{f:mandelparseq}. \begin{figure} \includegraphics[scale=.30]{mandelparseq} \caption{A snapshot of the execution of \texttt{mandel.ml} % (left is sequential execution, right is parallel execution on 5 machines).} \label{f:mandelparseq} \end{figure} \section{Graphical execution} It is often useful to look at the structure of the application process network, for example when tuning the performance of the final program. In \ocamlpiiil, this can be done by compiling the program with the special option \texttt{--graphical} which automatically creates a picture displaying the `logical' parallel program structure. \begin{verbatim} ocamlp3lcc --graphical mandel.ml \end{verbatim} \begin{remark} In the current implementation, this boils down to adding on top of \texttt{mandel.ml} the line \begin{verbatim} open Grafp3l;; \end{verbatim} to obtain a temporary file \texttt{mandel.gra.ml} which is then compiled via \texttt{ocamlc} with the proper modules and libraries. Depending on the configuration of your system, this may look like the following \begin{verbatim} ocamlc -custom graphics.cma grafp3l.cmo -o mandel.gra mandel.gra.ml -cclib -lgraphics -cclib -L/usr/X11R6/lib -cclib -lX11 \end{verbatim} Once more, we highly recommend not to use explicit calls to \texttt{ocamlc}: use the \texttt{ocamlp3lcc} compiler that is especially devoted to the compilation of \ocamlpiiil programs. \hfill $\diamond$ \end{remark} After compilation, we get the executable file \texttt{mandel.gra}, whose execution produces the following picture. \begin{center} \includegraphics[scale=.35]{mandelgra} \end{center} \section{Parallel execution} Once we have checked the sequential version of our code, and got a picture of the structure of the parallel network, we are ready to speed up the computation by using a network of computers. \subsection{Compilation for parallel execution} We call the compiler with the special option \texttt{--parallel} devoted to compilation for parallel execution: \begin{verbatim} ocamlp3lcc --parallel mandel \end{verbatim} \begin{remark} In the current implementation this boils down to adding on top of \texttt{mandel.ml} the lines \begin{verbatim} open Parp3l;; open Nodecode;; open Template;; \end{verbatim} to obtain a temporary file \texttt{mandel.par.ml} which is then compiled via \texttt{ocamlc} with the proper modules and libraries. Depending on the configuration of your system, this may look like the following \begin{verbatim} ocamlc -custom unix.cma p3lpar.cma -o mandel.par mandel.par.ml -cclib -lunix -cclib -lgraphics -cclib -L/usr/X11R6/lib -cclib -lX11 \end{verbatim} Once again, we highly recommend not to use explicit calls to \texttt{ocamlc}: use the \texttt{ocamlp3lcc} compiler that is especially devoted to the compilation of \ocamlpiiil programs. \hfill $\diamond$ \end{remark} The compilation produces an executable file named \texttt{mandel.par}. \section{Common options} The parallel compilation of \ocamlpiiil\ programs creates executables that are equipped with the following set of predefined options: \begin{itemize} \item \verb|-p3lroot|, to declare this invocation of the program as the root node. \item \verb|-dynport|, to force this node to use a dynamic port number instead of the default \verb|p3lport|; in addition the option outputs it (useful if you want to run more slave copies on the same machine). \item \verb|-debug|, to enable debugging for this node at level $n$. Currently all levels are equal. \item \verb|-ip|, to force the usage of a specified ip address. Useful when you are on a laptop named localhost and you want to be able to choose among network interfaces. \item \verb|-strict|, to specify a strict mapping between physical and virtual processors. \item \verb|-version|, to print version information. \item \verb|-help| or \verb|--help| Displays this list of options. \end{itemize} \subsection{Parallel computation overview} The executable produced by using the {\tt --parallel} option of the compiler behaves either as a generic computation node, or as the unique {\em root configuration node}, according to the set of arguments provided at launch time. To set up and launch the parallel computation network, we need to run multiple invocations of the parallel executable: \begin{itemize} \item run one copy instance of \texttt{mandel.par}, with no arguments, on each machine that takes part to the parallel computation; These processes wait for configuration information sent by the designated {\em root node}, \item create a root node, by launching one extra copy of \texttt{mandel.par} with the special option \texttt{-p3lroot}. \end{itemize} \noindent As soon as created, the root node configures all other participating nodes and then executes locally the \verb|pardo| encapsulated sequential code. In addition to the \texttt{-p3lroot} special option, the root node invocation must specify the information concerning the machines involved in the computational network (their ip address or name, their port and color). \section{Launching the parallel computation} Here is a simple script to launch the parallel network on several machines: \begin{small} \begin{verbatim} #!/bin/sh # The list of machines NODES="machine1 machine2 machine3 machine4" # The name of executable to be launched PAR="./mandel.par" echo -n "Launching OcamlP3L $PAR on the cluster:" for NODE in $NODES; do #(*1*) echo -n " $NODE" #launching a generic computation node on each machine ssh $NODE $PAR 1> log-$NODE 2> err-$NODE & # a possible coloring of machines case $NODE in #(*2*) machine1) COLORED_NODES="$COLOREDNODES $NODE#1";; *) COLORED_NODES="$COLOREDNODES $NODE#2";; esac done echo "Starting computation with $COUNT node(s): $COLORED_NODES..." # launch the unique root configuration node #(*3*) $PAR -p3lroot $COLOREDNODES 1> log-root 2> err-root echo "Finished." \end{verbatim} \end{small} \noindent This script assumes \verb|mandel.par| to be accessible to all participating machines and does the following: \begin{itemize} \item runs \verb|mandel.par| on all participating machines (\verb|#(*1*)|), \item generates a coloring for participating nodes (\verb|#(*2*)|), \item launches the computation starting the root process on the local machine (\verb|#(*3*)|) providing the list of colored participating hosts. \end{itemize} In future versions, especially those incorporating the MPI communication layer, the startup mechanism will possibly work differently (typically, the initialization steps will be performed by the MPI layer). \section{Common errors} \label{sec:commonerrors} A few words of warning now: even if the user program is now easy to write, compile and execute, you should not forget that the underlying machinery is quite sophisticated, and that in some situations you may not get what you expected. Two typical problems you may encounter are the following: \begin{description} \item[output value: code mismatch] If you see this error in the parallel execution of your program, it means that two incompatible versions of your program are trying to communicate. \ocaml\ uses an MD5 check of the code area before sending closures over a channel, because this operation only makes sense between ``identical'' programs.\\ Two possible reasons for the error are: \begin{itemize} \item an old version of your program is still running somewhere and is trying to communicate with the newer version you are running now. You should kill all the running processes and try again. \item you are running copies of the program compiled for different architectures. This feature is not yet supported, and you should run the program on homogeneous architectures. \end{itemize} \item[references] You should remember that the user functions provided to the skeletons will be all executed on different machines, so their behaviour \emph{must not} rely on the existence of implicitly shared data, like global references: if you do, the sequential behaviour and the parallel one will be different. This does not imply that all user function be real functions (you can use local store to keep a counter for example), but an access to a global reference is certainly a mistake (since every node will access its \emph{own private} copy of the data, thus defeating the purpose of the shared data). \end{description}