Content - 250e17dc4bc1bfbf0d7e74cb4d05509b30794f40 - 242f99e/examples/MinikubeDemo.md

visit type:
Tip revision: df0adb51ec0a9061ef76568f0756fe656467402e authored by YujiOshima on 19 November 2018, 05:05:30 UTC
reset x_train in burn-in
Tip revision: df0adb5
MinikubeDemo.md
# Simple Minikube Demo
You can deploy katib components and try a simple mnist demo on your laptop!

## Requirement
* VirtualBox
* Minikube
* kubectl

## deploy
Start Katib on Minikube with [deploy.sh](./MinikubeDemo/deploy.sh).
A Minikube cluster and Katib components will be deployed!

You can check them with `kubectl -n katib get pods`.
Don't worry if the `vizier-core` get an error. 
It will be recovered after DB will be prepared.
Wait until all components will be Running status.

Then, start port-forward for katib services `6789 -> manager` and `8000 -> UI`.

kubectl v1.10~
```
$ kubectl -n kubeflow port-forward svc/vizier-core 6789:6789 &
$ kubectl -n kubeflow port-forward svc/katib-ui 8000:80 &
```

kubectl ~v1.9

```
& kubectl -n kubeflow port-forward $(kubectl -n kubeflow get pod -o=name | grep vizier-core | sed -e "s@pods\/@@") 6789:6789 &
& kubectl -n kubeflow port-forward $(kubectl -n kubeflow get pod -o=name | grep katib-ui | sed -e "s@pods\/@@") 8000:80 &
```

## Create Study
### Random Suggestion Demo
```
$ kubectl apply -f random-example.yaml
```
Only this command, a study will start, generate hyper-parameters and save the results.
The configurations for the study(hyper-parameter feasible space, optimization parameter, optimization goal, suggestion algorithm, and so on) are defined in `random-example.yaml`,
In this demo, hyper-parameters are embbeded as args.
You can embbed in another way(e.g. eviroment values) by using template.
It defined in `WorkerSpec.GoTemplate.RawTemplate`.
It is written in [go template](https://golang.org/pkg/text/template/) format.

In this demo, 3 hyper parameters 
* Learning Rate (--lr) - type: double
* Number of NN Layer (--num-layers) - type: int
* optimizer (--optimizer) - type: categorical
are randomly generated.

```
$ kubectl -n kubeflow get studyjob
NAME             AGE
random-example   2m
```

Check the study status.

```
$ kubectl -n kubeflow describe studyjobs random-example
Name:         random-example
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"kubeflow.org/v1alpha1","kind":"StudyJob","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"random-example"...
API Version:  kubeflow.org/v1alpha1
Kind:         StudyJob
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-08-15T01:29:13Z
  Generation:          0
  Resource Version:    173289
  Self Link:           /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/random-example
  UID:                 9e136400-a02a-11e8-b88c-42010af0008b
Spec:
  Study Spec:
    Metricsnames:
      accuracy
    Name:                random-example
    Objectivevaluename:  Validation-accuracy
    Optimizationgoal:    0.98
    Optimizationtype:    maximize
    Owner:               crd
    Parameterconfigs:
      Feasible:
        Max:          0.03
        Min:          0.01
      Name:           --lr
      Parametertype:  double
      Feasible:
        Max:          3
        Min:          2
      Name:           --num-layers
      Parametertype:  int
      Feasible:
        List:
          sgd
          adam
          ftrl
      Name:           --optimizer
      Parametertype:  categorical
  Suggestion Spec:
    Request Number:         3
    Suggestion Algorithm:   random
    Suggestion Parameters:  <nil>
  Worker Spec:
    Command:
      python
      /mxnet/example/image-classification/train_mnist.py
      --batch-size=64
    Image:        katib/mxnet-mnist-example
    Worker Type:  Default
Status:
  Best Objective Value:         <nil>
  Conditon:                     Running
  Early Stopping Parameter Id:
  Studyid:                      qb397cc06d1f8302
  Suggestion Parameter Id:
  Trials:
    Trialid:  p18ee16163b85678
    Workeridlist:
      Objective Value: <nil>
      Conditon:        Running
      Workerid:        td08f74b9939350d
    Trialid:           pb1be3dbe53a5cb0
    Workeridlist:
      Objective Value: <nil>
      Conditon:        Running
      Workerid:        p2b23e25cce4092c
    Trialid:           m64209fe0867e91a
    Workeridlist:
      Objective Value: <nil>
      Conditon:        Running
      Workerid:        q6258c1ac98a00a5
Events:                <none>
```

When the Spec.Status.State become `Completed`, the study is completed.
You can look the result on `http://127.0.0.1:8000/katib`.

### Use ConfigMap for Worker Template
In Random example, the template for workers is defined in StudyJob manifest.
A ConfigMap is also used for worker template.
Let's use [this](./workerConfigMap.yaml) template.
```
kubectl apply -f workerConfigMap.yaml
```
This template will share among blow three demos(Grid, Hyperband, and GPU).

### Grid Demo
Almost same as random suggestion.

In this demo, Katib will make 4 grids for learning rate (--lr) Min 0.03 and Max 0.07.
```
kubectl apply -f grid-example.yaml
```

### Hyperband Demo
In this demo, the eta is 3 and the R is 9.
```
kubectl apply -f random-example.yaml
```

## UI
You can check your study results with Web UI.
Acsess to `http://127.0.0.1:8000/katib`
The Results will be saved automatically.

### Using GPU demo
You can set any configuration for your worker pods.
Here, try to set config for GPU.
The manifest of the worker pods are generated from a template.
The templates are defined in [ConfigMap](./workerConfigMap.yaml).
There are two templates, defaultWorkerTemplate.yaml and gpuWorkerTemplate.yaml.
You can add your template for worker.
Then you should specify the template in your studyjob spec.
[This](/examples/gpu-example.yaml) is example for using `gpuWorkerTemplate.yaml`.
Set "/worker-template/gpuWorkerTemplate.yaml at `workerTemplatePath` field and specify gpu number at `workerParameters/Gpu`
You can apply it same as other examples.
```
$ kubectl apply -f gpu-example.yaml
$ kubectl -n kubeflow get studyjob

NAME             AGE
gpu-example      1m
random-example   17m

$ kubectl -n kubeflow describe studyjob gpu-example

Name:         gpu-example
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"kubeflow.org/v1alpha1","kind":"StudyJob","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"gpu-example","n...
API Version:  kubeflow.org/v1alpha1
Kind:         StudyJob
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-08-15T01:48:12Z
  Generation:          0
  Resource Version:    175002
  Self Link:           /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/gpu-example
  UID:                 44afac4c-a02d-11e8-b88c-42010af0008b
Spec:
  Study Spec:
    Metricsnames:
      accuracy
    Name:                gpu-example

	...

  Worker Spec:
    Command:
      python
      /mxnet/example/image-classification/train_mnist.py
      --batch-size=64
    Image:  katib/mxnet-mnist-example
    Worker Parameters:
      Gpu:                 1
    Worker Template Path:  /worker-template/gpuWorkerTemplate.yaml
    Worker Type:           Default
Status:
  Best Objective Value:         <nil>
  Conditon:                     Running
  Early Stopping Parameter Id:
  Studyid:                      k549e927046f2136
  Suggestion Parameter Id:
  Trials:
    Trialid:  t721857cd426b68b
    Workeridlist:
      Objective Value: <nil>
      Conditon:        Running
      Workerid:        g07cba174ada521e
    Trialid:           f27c0ac1c6664533
    Workeridlist:
      Objective Value: <nil>
      Conditon:        Running
      Workerid:        h8d5062f2f1b8633
    Trialid:           v129109d1331a98e
    Workeridlist:
      Objective Value: <nil>
      Conditon:        Running
      Workerid:        x8f172a64645690e
```

Check the GPU configuration works correctly.

```
$ kubectl -n kubeflow describe pod g07cba174ada521e-88wpn
Name:           g07cba174ada521e-88wpn
Namespace:      kubeflow
Node:           <none>
Labels:         controller-uid=44bfb99f-a02d-11e8-b88c-42010af0008b
                job-name=g07cba174ada521e
Annotations:    <none>
Status:         Pending
IP:
Controlled By:  Job/g07cba174ada521e
Containers:
  g07cba174ada521e:
    Image:  katib/mxnet-mnist-example
    Port:   <none>
    Command:
      python
      /mxnet/example/image-classification/train_mnist.py
      --batch-size=64
      --lr=0.0175
      --num-layers=2
      --optimizer=adam
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-knffp (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-knffp:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-knffp
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  6s (x21 over 4m)  default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu.
```

## Metrics Collection

### Design of Metrics Collector
![metricscollectordesign](https://user-images.githubusercontent.com/10014831/47256754-e32cb480-d4bf-11e8-98e9-4bbec562ad75.png)

### Default Metrics Collector

The default metrics will be collect from the StdOut of workers.
It is deploy as a cronjob. It will collect and report metrics periodically.
It collect metrics through k8s pod log API.
You should print logs {metrics name}={value} style.
In the above demo, the objective value name is Validation-accuracy and the metrics are accuracy, your training code should print like this.
```
epoch 1:
batch1 accuracy=0.3
batch2 accuracy=0.5

Validation-accuracy=0.4

epoch 2:
batch1 accuracy=0.7
batch2 accuracy=0.8

Validation-accuracy=0.75
```
The metrics collector will collect all logs of metrics.
The manifest of metrics collector is also generated from template and defined [here](/manifests/studyjobcontroller/metricsControllerConfigMap.yaml).
You can add your template and specify `spec.metricsCollectorSpec.metricsCollectorTemplatePath` in a studyjob manifest.

### TF Event File Metrics Collector

The TF Event file metrics collector will collect metrics from tf.event files.
It is also deploy as a cronjob.
When you use TF Event File Metrics Collector, you need to share files between a metrics collector and worker with PVC.
There is an example for TF Event file metrics collector.
First, please create PV and PVC for share event file.
```
$ kubectl apply -f tfevent-volume/
```
Then, create studyjob that use TF Event file metrics collector.
```
$ kubectl apply -f tf-event_test.yaml
```

It will create tensorflow worker and collect metrics from its eventfile.
The code of tensorflow is [the official tutorial for mnist with summary](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py).
It will save event file to `/log/train` and `/log/test` directory.
They have same named metrics ('accuracy' and 'cross_entropy').
The accuracy in training will be save in train directory and test is in test directory.
In a studyjob, please add directry name to the name of metrics as a prefix e.g. `train/accuracy`, `test/accuracy`.

## ModelManagement

You can export model data to yaml file with CLI.

```
katib-cli -s {{server-cli}} pull study {{study ID or name}}  -o {{filename}}
```

And you can push your existing models to Katib with CLI.
`mnist-models.yaml` is traind 22 models using random suggestion with this Parameter Config.

```
configs:
    - name: --lr
      parametertype: 1
      feasible:
        max: "0.07"
        min: "0.03"
        list: []
    - name: --lr-factor
      parametertype: 1
      feasible:
        max: "0.05"
        min: "0.005"
        list: []
    - name: --lr-step
      parametertype: 2
      feasible:
        max: "20"
        min: "5"
        list: []
    - name: --optimizer
      parametertype: 4
      feasible:
        max: ""
        min: ""
        list:
        - sgd
        - adam
        - ftrl
```
You can easy to explore the model on KatibUI.

```
katib-cli push md -f mnist-models.yaml
```

## Clean
Clean up with `./destroy.sh` script.
It will stop port-forward process and delete minikube cluster.
Browse the archive

https://github.com/kubeflow/katib