# Quick Start

Katib is a Kubernetes Native System for [Hyperparameter Tuning][1] and [Neural Architecture Search][2]. This short introduction illustrates how to use Katib to:

- Define a hyperparameter tuning experiment.
- Evaluate it using the resources in Kubernetes.
- Get the best hyperparameter combination in all these trials.

## Requirements

Before you run the hyperparameter tuning experiment, you need to have:

- A Kubernetes cluster with [installed TF operator and Katib](https://github.com/kubeflow/katib#installation)

## Katib in Kubeflow

See the following guides in the Kubeflow documentation:

* [Concepts](https://www.kubeflow.org/docs/components/hyperparameter-tuning/overview/) 
  in Katib, hyperparameter tuning, and neural architecture search.
* [Getting started with Katib](https://kubeflow.org/docs/components/hyperparameter-tuning/hyperparameter/).
* Detailed guide to [configuring and running a Katib 

## Hyperparameter Tuning on MNIST

Katib supports multiple [Machine Learning Frameworks](https://en.wikipedia.org/wiki/Comparison_of_deep-learning_software) (e.g. TensorFlow, PyTorch, MXNet, and XGBoost).

In this quick start guide, we demonstrate how to use TensorFlow in Katib, which is one of the most popular framework among the world, to run a hyperparameter tuning job on MNIST.

### Package Training Code

The first thing we need to do is to package the training code to a docker image. We use the [example code](https://github.com/kubeflow/tf-operator/blob/master/examples/v1/mnist_with_summaries/mnist_with_summaries.py), which builds a simple neural network, to train on MNIST. The code trains the network and outputs the TFEvents to `/tmp` by default.

You can use our prebuilt image `gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0`. Thus we can skip it.

### Create the Experiment

If you want to use Katib to automatically tune hyperparameters, you need to define the `Experiment`, which is a CRD that represents a single optimization run over a feasible space. Each `Experiment` contains:

1. Configuration about parallelism: The configuration about the parallelism.
1. Objective: The metric that we want to optimize.
1. Search space: The name and the distribution (discrete valued or continuous valued) of all the hyperparameters you need to search.
1. Search algorithm: The algorithm (e.g. Random Search, Grid Search, TPE, Bayesian Optimization) used to find the best hyperparameters.
1. Trial Template: The template used to define the trial.
1. Metrics Collection: Definition about how to collect the metrics (e.g. accuracy, loss).

The `Experiment`'s definition is defined here:

apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
  namespace: kubeflow
  name: tfjob-example
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
    algorithmName: random
        path: /train
        kind: Directory
      kind: TensorFlowEvent
    - name: learning_rate
      parameterType: double
        min: "0.01"
        max: "0.05"
    - name: batch_size
      parameterType: int
        min: "100"
        max: "200"
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
            replicas: 2
            restartPolicy: OnFailure
                  - name: tensorflow
                    image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                    imagePullPolicy: Always
                      - "python"
                      - "/var/tf_mnist/mnist_with_summaries.py"
                      - "--log_dir=/train/metrics"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--batch_size=${trialParameters.batchSize}"


The experiment has two hyperparameters defined in `parameters`: `learning_rate` and `batch_size`. We decide to use random search algorithm, and collect metrics from the TF Events.


Or you could just run:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/tfjob-example.yaml

### Get trial results

You can get the trial results using the command (Need to install [`jq`](https://stedolan.github.io/jq/download/) to parse JSON):

kubectl -n kubeflow get trials -o json | jq ".items[] | {assignments: .spec.parameterAssignments, observation: .status.observation}"

You should get the output:

  "assignments": [
      "name": "learning_rate",
      "value": "0.01156268890324629"
      "name": "batch_size",
      "value": "196"
  "observation": {
    "metrics": [
        "latest": "0.968200027943",
        "max": "1.0",
        "min": "0.0714285746217",
        "name": "accuracy_1"

Or you could get the result in UI: `<Katib-URL>/katib/#/katib/hp_monitor/kubeflow/tfjob-example`.


When you click the trial name, you should get the details about metrics:


<!-- ## Hyperparameter Tuning with Distributed Training on MNIST -->

[1]: https://en.wikipedia.org/wiki/Hyperparameter_optimization
[2]: https://en.wikipedia.org/wiki/Neural_architecture_search
