Content - f27d5f41d6572fb1cc40537799469efbf5fad8a4 - e0a8714/README.md

visit type:
Tip revision: 399051d40991d27f92d65c09bb2923ce26a6e00b authored by Rafael G. Mantovani on 06 February 2024, 23:16:05 UTC
Update README.md with raw results' drive link
Tip revision: 399051d
README.md
# Decision Tree Tuning Analysis

'DecisionTreeTuningAnalysis' is an automated R code used to generate automated graphical analysis of our paper ['Better Trees: An empirical study on hyperparameter tuning of classification decision trees'](https://link.springer.com/article/10.1007/s10618-024-01002-5) [01]. The automated analysis coded here handles data generated by our hyperparameter tuning project ([HpTuning](https://github.com/rgmantovani/HpTuning))
but may be easily extended. The main features available cover the hyperparameter profile of the decision tree induction algorithms, i.e. , answering the following questions:

* Question 01: Is tuning of trees really necessary?
* Question 02: When performing tuning, which are the most recommended techniques, considering our choices?
* Questino 03: Which hyperparameters most impact the induced trees? 
* Question 04: In which situations should we tune trees?

### Installation

The installation process is via git clone. You should use the following command inside your terminal session:

```
git clone https://github.com/rgmantovani/DecisionTreeTuningAnalysis
```

### General instructions

The classification algorithms analyzed must follow ['mlr'](https://github.com/mlr-org/mlr) R package implementation [02]. A complete list of the available learners may be found [here](http://mlr-org.github.io/mlr-tutorial/release/html/integrated_learners/). The code generated here provides results for two decision tree induction algorithms: J48 (classif.J48) and CART (classif.rpart). 

Hyperparameter tuning results should be placed in the ```data/hptuning_full_space/<algorithm.name>/results``` sub-directory. We did not upload raw results since they have more than 50GB of data (But you can download it from [here](https://drive.google.com/drive/folders/1Ltz63VCv4tFPLxxQesvjV6Cmt5_wFrdn?usp=sharing)). Thus, we developed some scripts to extract useful information from the executed jobs. These scripts are in the ```scripts``` folder. The automated analysis will only work if these scripts have run before. This is also checked by the automated code and returned to the user with instructions on how to proceed. There are 4 auxiliary scripts:

* **01_extractRepResults.R**: it extracts all the average performance measures obtained from 30 repetitions of a single job composed by an algorithm, a dataset, and a tuning technique. Most of the performance plots use this information;
* **02_extractOptPaths.R**: it extracts all the optimization paths obtained by the tuning techniques when executed 30 times in each dataset and algorithm configuration. All the convergence and learning curve plots use this information;
* **03_extractModelStats.R**: it extracts models' statistics (number of leaves, number of rules, tree's size, etc) for each job;
* **04_createFanovaInpus.R** : create the input files used by the [fAnova framework](https://github.com/automl/fanova) [03], which computes hyperparameter importance using marginal distributions.

All extraction scripts require the algorithm's name as a parameter (```<algorithm.name>```).
There is no order to run these scripts, but all of them must be executed. The files generated by these scripts will be later read and aggregated as ```data.frame``` objects and used by the automated code. 


#### A - Extracting main results

```shell
cd script
Rscript 01_extractRepResults.R --algo=<algorithm.name> &

# examples:
# Rscript 01_extractRepResults.R --algo="classif.J48" &
# Rscript 01_extractRepResults.R --algo="classif.rpart" &
```

#### B - Extracting optimization paths

```shell
cd script
Rscript 02_extractOptPaths.R --algo=<algorithm.name> &

# examples:
# Rscript 02_extractOptPaths.R --algo="classif.J48" &
# Rscript 02_extractOptPaths.R --algo="classif.rpart" &
```

#### C - Extracting models' statistics 

```shell
cd script
Rscript 03_extractModelStats.R --algo=<algorithm.name> &

# examples:
# Rscript 03_extractModelStats.R --algo="classif.J48" &
# Rscript 03_extractModelStats.R --algo="classif.rpart" &
```

#### D - FAnova hyperparameter marginal predictions

FAnova marginal predictions are obtained by an [external project](https://github.com/automl/fanova) [03]. This our script will generate input files in the pattern required by the FAnova Python script. To run it:

```shell
cd scripts
Rscript 04_createFanovaInputs.R --algo=<algorithm.name> &

# examples:
# Rscript 04_createFanovaInputs.R --algo="classif.J48" &
# Rscript 04_createFanovaInputs.R --algo="classif.rpart" &
```

The output will be placed in a folder named ```data/hptuning_full_space/<algorithm.name>/fanova_input```,
with one file per dataset. Provide these files to the external project, and it will also generate one correspondent file per dataset. These new files should be placed in the ```data/hptuning_full_space/<algorithm.name>/fanova_output``` sub-directory.

### Running the code

To run the project, please call it by the following command:
```shell
 Rscript 01_mainAnalysis.R --algo=<algorithm.name> &

 # examples:
 # Rscript 01_mainAnalysis.R --algo="classif.rpart" &
 # Rscript 01_mainAnalysis.R --algo="classif.J48"   &
```

Meta-level results are independent and can be generated by:
```shell
 Rscript 02_metaAnalysis.R &
```

Meta-level results are independent and can be generated by:


### Contact

Rafael Gomes Mantovani (rgmantovani@gmail.com / rafaelmantovani@utfpr.edu.br), Federal Technology University - Paraná (UTFPR) - Apucarana - PR, Brazil.

### References

[01] Rafael Gomes Mantovani, Tomas Horvath, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. Carvalho. Better Trees: An empirical study on hyperparameter tuning of classification decision trees. *Data Min Knowl Disc* (2024). [https://doi.org/10.1007/s10618-024-01002-5](https://doi.org/10.1007/s10618-024-01002-5).

[02] B. Bischl, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, Zachary Jones. mlr: [Machine Learning in R. Journal of Machine Learning in R](https://github.com/mlr-org/mlr), v.17, n.170, 2016, pgs 1-5.

[03] F. Hutter, H. Hoos, K. Leyton-Brown. [An Efficient Approach for Assessing Hyperparameter Importance](http://jmlr.org/proceedings/papers/v32/hutter14.html). In: *Proceedings of the 31th International Conference on Machine Learning*, ICMC 2014, Beijing, China, 2014, pgs 754-762.
Browse the archive

https://github.com/rgmantovani/TuningAnalysis