Content - d619cbbc4423cfc0c5de07d30f1542ff49f40a90 - e7df262/quick-start.md

quick-start.md
# Quick Start

Katib is a Kubernetes Native System for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. This short introduction illustrates how to use Katib to:

- Define a hyperparameter tuning experiment.
- Evaluate it using the resources in Kubernetes.
- Get the best hyperparameter combination in all these trials.

## Requirements

Before you run the hyperparameter tuning experiment, you need to have:

- A Kubernetes cluster with [installed TF operator and Katib](https://github.com/kubeflow/katib#installation)

## Katib in Kubeflow

See the following guides in the Kubeflow documentation:

* [Concepts](https://www.kubeflow.org/docs/components/hyperparameter-tuning/overview/) 
  in Katib, hyperparameter tuning, and neural architecture search.
* [Getting started with Katib](https://kubeflow.org/docs/components/hyperparameter-tuning/hyperparameter/).
* Detailed guide to [configuring and running a Katib 
  experiment](https://kubeflow.org/docs/components/hyperparameter-tuning/experiment/).

## Hyperparameter Tuning on MNIST

Katib supports multiple [Machine Learning Frameworks](https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software) (e.g. TensorFlow, PyTorch, MXNet, and XGBoost).

In this quick start guide, we demonstrate how to use TensorFlow in Katib, which is one of the most popular framework among the world, to run a hyperparameter tuning job on MNIST.

### Package Training Code

The first thing we need to do is to package the training code to a docker image. We use the [example code](https://github.com/kubeflow/tf-operator/blob/master/examples/v1/mnist_with_summaries/mnist_with_summaries.py), which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to `/tmp` by default.

You can use our prebuilt image `gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0`. Thus we can skip it.

### Create the Experiment

If you want to use Katib to automatically tune hyperparameters, you need to define the `Experiment`, which is a CRD that represents a single optimization run over a feasible space. Each `Experiment` contains:

1. Configuration about parallelism: The configuration about the parallelism.
1. Objective: The metric that we want to optimize.
1. Search space: The name and the distribution (discrete valued or continuous valued) of all the hyperparameters you need to search.
1. Search algorithm: The algorithm (e.g. Random Search, Grid Search, TPE, Bayesian Optimization) used to find the best hyperparameters.
1. Trial Template: The template used to define the trial.
1. Metrics Collection: Definition about how to collect the metrics (e.g. accuracy, loss).

The `Experiment`'s definition is defined here:

<details>
  <summary>Click here to get YAML configuration</summary>

```yaml
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /train
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
  trialTemplate:
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: tensorflow
                    image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                    imagePullPolicy: Always
                    command:
                      - "python"
                      - "/var/tf_mnist/mnist_with_summaries.py"
                      - "--log_dir=/train/metrics"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--batch_size=${trialParameters.batchSize}"

```

The experiment has two hyperparameters defined in `parameters`： `learning_rate` and `batch_size`. We decide to use random search algorithm, and collect metrics from the TF Events.

</details>

Or you could just run:

```bash
kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml
```

### Get trial results

You can get the trial results using the command (Need to install [`jq`](https://stedolan.github.io/jq/download/) to parse JSON):

```bash
kubectl -n kubeflow get trials -o json | jq ".items[] | {assignments: .spec.parameterAssignments, observation: .status.observation}"
```

You should get the output:

```json
...
{
  "assignments": [
    {
      "name": "learning_rate",
      "value": "0.01156268890324629"
    },
    {
      "name": "batch_size",
      "value": "196"
    }
  ],
  "observation": {
    "metrics": [
      {
        "latest": "0.968200027943",
        "max": "1.0",
        "min": "0.0714285746217",
        "name": "accuracy_1"
      }
    ]
  }
}
...
```

Or you could get the result in UI: `<Katib-URL>/katib/#/katib/hp_monitor/kubeflow/tfjob-example`.

![](./images/quickstart.png)

When you click the trial name, you should get the details about metrics:

![](./images/quickstart-trial.png)

<!-- ## Hyperparameter Tuning with Distributed Training on MNIST -->

[1]: https://en.wikipedia.org/wiki/Hyperparameter_optimization
[2]: https://en.wikipedia.org/wiki/Neural_architecture_search