Content - 6f358f6094cd41be823d80b4002509b97c154e6c - a14ef5e/README.md

README.md
# Paper MPI Rank Reordering Reproducibility

[![SWH](https://archive.softwareheritage.org/badge/origin/https://gitlab.tuwien.ac.at/philippe.swartvagher/paper-mpi-rank-reordering-r13y/)](https://archive.softwareheritage.org/browse/origin/?origin_url=https://gitlab.tuwien.ac.at/philippe.swartvagher/paper-mpi-rank-reordering-r13y)

This repository contains code and scripts to reproduce experiments and results
presented in the paper *Using Mixed-Radix Decomposition to Enumerate Resources of
Deeply Hierarchical Architectures*, submitted at [ExaMPI
2023](https://sites.google.com/site/workshopexampi/home).

Authors:
- Philippe Swartvagher
- Sascha Hunold
- Jesper Larsson Träff
- Ioannis Vardas


## Requirements

### Hardware

Regarding the hardware, you need several interconnected HPC nodes to reproduce
presented experiments.


### Software

To build benchmark programs, run them and execute scripts, you need:
- tools to build benchmark programs: `hwloc`, a C compiler (e.g., `gcc`), an
  MPI library (OpenMPI is recommanded for its support of rankfiles),
  `pkg-config`, `make`;
- programs for various scripts: `python3` (version 3.11.2 was used), `xdot`,
  `graphviz`, `python-matplotlib` (version 3.6.3), `python-pandas` (version
  1.5.3), `cut` and `awk`.



## Experiment workflow


### List all possible orders

To list all possible orders for a hierarchy (here the hierarchy is denoted
`2,2,4`):
```sh
python3 scripts/list_orders.py 2,2,4
```
If you add the communicator size with `--comm-size 4`, it will also display the
metrics of each order.


### Visualize mappings

In the following example, the hierarchy is `4,2,2,2` and we visualize the
mapping of communicators with 8 processes each with an order `3,0,1,2`:
```sh
python3 scripts/visualize_order.py --level-names=Node,Socket,NUMA,Core 4,2,2,2 3,0,1,2 8 | ccomps -x - | dot | gvpack -array2 | neato -Txdot -n2 | xdot -n -
```
Or, if you want to save a PNG file instead of opening a xdot window:
```sh
python3 scripts/visualize_order.py --level-names=Node,Socket,NUMA,Core 4,2,2,2 3,0,1,2 8 | ccomps -x - | dot | gvpack -array2 | neato -Tpng -n2 > mapping.png
```


### Check actual mappings

Folder `mpi-mapping` contains source files of a small MPI program in which each
MPI process display on the standard output its `MPI_COMM_WORLD` rank, the
hostname of the compute where it is executed and the core logical ID on which
it is bound. Just run `make` to build it and launch it as a regular MPI program.


### Example of permutations on a rank

Script `scripts/example.py` illustrates how are permuted digits of a rank 10
and hierarchy elements to compute a new rank, for different orders. This script
was used to generate the Table 1 of the paper.



### Micro-benchmark

#### Building

```sh
cd micro-benchmark
make -j
```

#### Execution

`micro-benchmark/script.slurm.sh` is a Slurm script that can be used to launch
the micro-benchmark to evalute the impact of several orders on different
collective operations, with different communicator sizes and different data
sizes. It can be used with the following command:
```sh
sbatch -N 16 script.slurm.sh # to use 16 nodes
```

#### Plots

Once the Slurm job is executed, you can launch the script
`scripts/plot_microbenchmark.sh` (adapt values inside the script to your
machine first) from the folder where are located the output files of the job.



### Splatt

#### Building

```sh
git clone https://github.com/ShadenSmith/splatt.git
cd splatt
```
We used commit `6cb86283c1fbfddcc67c2564e025691de4f784cf`, checkout this commit
if necessary.

If you cannot use rankfiles, apply the patch `splatt/reorder.patch`.

```sh
./configure --with-mpi --no-openmp # this doesn't work inside a build folder
make -j
```

Download an input tensor. We used `nell-1`, from the [FROSST
collection](http://frostt.io/tensors/nell-1/):
```sh
wget https://s3.us-east-2.amazonaws.com/frostt/frostt_data/nell/nell-1.tns.gz
gunzip nell-1.tns.gz
```


#### mpisee

We executed Splatt and profiled it at the same time with `mpisee`:
```sh
git clone https://github.com/variemai/communicator_profiler.git
git checkout c88ae546cf49946f4f419c7b2a05e8c774bfa874
wget https://github.com/variemai/communicator_profiler/raw/main/mpisee-through/mpisee-through.py

# Apply patches located in `mpisee/`:
git apply /path/to/this/repository/mpisee/communicator_profiler.patch
git apply /path/to/this/repository/mpisee/mpisee-through.patch

mkdir lib
```
Adapt `CC` in Makefile to an MPI compiler and finally build the library:
```sh
make -j
```


#### Execution

First, you have to generate the list of possible orders:
```sh
python3 scripts/list_orders.py 32,2,2,8 > orders.txt
```

`splatt/script.slurm.sh` is a Slurm script that can be used to launch the
Splatt executions to evalute the impact of orders listed in `orders.txt`. It
will profile the execution with `mpisee` and execute all orders 3 times. You
can launch it in the following way:
```sh
sbatch script.slurm.sh
```

If you want to use OpenMPI's rankfile, first you can generate the rankfiles for
different orders:
```sh
# First argument is the hierarchy, second argument is the number of cores (ie number of MPI processes) per node:
pyton3 scripts/generate_rankfiles.py 32,2,2,8 36
```
It will generate one rankfile for each order. Then, you have to modify the
Slurm script to call `mpirun -rankfile $your_rankfile` instead of `srun`, and
delete environment variables `REORDER_HIERARCHY` and `REORDER_NEW_ORDER`.



#### Plots and analysis

Once the job is executed, you can preprocess the results by launching from the
folder containing all output files (change path to `mpisee-through.py` first):
```sh
/path/to/this/respository/splatt/generate_results.sh
```
This script will generate a file `results.out` summarizing performance of
Splatt with each order and a file per order describing the profiling done by
`mpisee`.

To generate a figure representing performance of different rank orders, you can
use (change the value of the variable `DEFAULT_ORDER` first):
```sh
python3 splatt/plot.py
```

To sort orders by the average duration of Splatt execution:
```sh
cut -d ' ' -f 1,2 results.out | sort -k 2
```

To compute the average duration with all orders:
```sh
cut -d ' ' -f 2 results.out | awk '{x+=$0}END{print x/NR}'
```

To compute correlation between durations of Splatt executions and durations of
MPI operations, according to rank orders, you can execute:
```sh
python3 splatt/compute_correlation.py *-*-*-*.*.mpisee --sort
```


### Conjugate Gradient

We use the implementation from the [NAS Parallel
Benchmarks](https://www.nas.nasa.gov/software/npb.html).

#### Building

```sh
wget https://www.nas.nasa.gov/assets/npb/NPB3.4.2.tar.gz
tar xf NPB3.4.2.tar.gz
cd NPB3.4.2/NPB3.4-MPI/
cp config/make.def{.template,}
```

If the compiler is GCC 10, set the following `FFLAGS` in `config/make.def`:
```make
FFLAGS  = -O3 -std=legacy
```

```sh
make cg CLASS=C # adapt the class size
```


#### Execution

First, you have to generate the possible core mappings:
```sh
# The hierarchy of the compute node is 2,4,2,8
# We want to evalute when the number of MPI processes / used cores are 1, 2, 4, 8, 16, 32, 64 and 128.
python3 scripts/generate_cpu_bind.py 2,4,2,8 1,2,4,8,16,32,64,128 > mappings.txt
```

`cg/script.slurm.sh` is a Slurm script that can be used to launch the Splatt
executions to evalute the impact of orders listed in `mappings.txt`. It will
execute each configuration 5 times. You can launch it in the following way:
```sh
sbatch script.slurm.sh
```


#### Plots

Once the job is executed, you can preprocess the results by launching from the
folder containing all output files:
```sh
/path/to/this/respository/cg/generate_results.sh > results.out
```

To generate a figure representing performance of different configurations, you can
use (change the value of the variable `DEFAULT_ORDER` first):
```sh
python3 cg/plot.py
```