https://github.com/estherjulien/HybridML
Raw File
Tip revision: 9985e6d930e8b98eb03330a964c2c3fc8788630c authored by estherjulien on 01 August 2022, 11:54:59 UTC
HybridCode deleted from test_data_gen.py
Tip revision: 9985e6d
README.md
# Code for the paper:
> **Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning**  
> *Giulia Bernardini, Leo van Iersel, Esther Julien and Leen Stougie*

To run the code, the following packages have to be installed: `numpy`, `pandas`, `scikit-learn`, `networkx`, and 
`joblib`.

### Cherry picking heuristic (CPH)
The CPH is implemented in `CPH.py`. To experiment with this file,

RUN in terminal:
```bash
python main_heuristic.py <instance num.> <ML model name> <leaf number> <bool (0/1) for exact input> <option>
```
`option`: 
if `exact input` = 0:
    `option` = reticulation number,
else:
    `option` = forest size

EXAMPLE: 
```bash
python main_heuristic.py 0 N10_maxL100_random_balanced 20 0 50
```
In the file `test_data_gen.py`, tree sets used for testing can be generated.

### Generate training data
The code for initializing and updating the features can be found in `Features.py`.

For training the random forest, we first have to generate training data. 
The code for this is given in `train_data_gen.py`. 
RUN in terminal:
```bash
python train_data_gen.py <number of networks> <maxL>
```
`maxL`: maximum number of leaves per network
EXAMPLE:
```bash
python train_data_gen.py 10 20
```

### Train random forest
In the folder `LearningCherries` you can find the code for training a random forest. 
In the folder `LearningCherries/TrainedRFModels`, there are some trained random forests in `joblib` format.

### Preprocessed data
The `Data` folder consists of:
- `Results` of CPH instances. The instances are divided in 4 categories: FTS (Full Tree Set), Real, RealSmall, and RTS (Restricted Tree Set).
  - `FINAL_heuristic_scores_ML[<random forest model used>]_TEST[<test instance type used>]_<number of instances>.pickle`
- `Test` for different test instances
- `Train` for generated train data
back to top