https://github.com/kubeflow/katib
Tip revision: df0adb51ec0a9061ef76568f0756fe656467402e authored by YujiOshima on 19 November 2018, 05:05:30 UTC
reset x_train in burn-in
reset x_train in burn-in
Tip revision: df0adb5
MinikubeDemo.md
# Simple Minikube Demo
You can deploy katib components and try a simple mnist demo on your laptop!
## Requirement
* VirtualBox
* Minikube
* kubectl
## deploy
Start Katib on Minikube with [deploy.sh](./MinikubeDemo/deploy.sh).
A Minikube cluster and Katib components will be deployed!
You can check them with `kubectl -n katib get pods`.
Don't worry if the `vizier-core` get an error.
It will be recovered after DB will be prepared.
Wait until all components will be Running status.
Then, start port-forward for katib services `6789 -> manager` and `8000 -> UI`.
kubectl v1.10~
```
$ kubectl -n kubeflow port-forward svc/vizier-core 6789:6789 &
$ kubectl -n kubeflow port-forward svc/katib-ui 8000:80 &
```
kubectl ~v1.9
```
& kubectl -n kubeflow port-forward $(kubectl -n kubeflow get pod -o=name | grep vizier-core | sed -e "s@pods\/@@") 6789:6789 &
& kubectl -n kubeflow port-forward $(kubectl -n kubeflow get pod -o=name | grep katib-ui | sed -e "s@pods\/@@") 8000:80 &
```
## Create Study
### Random Suggestion Demo
```
$ kubectl apply -f random-example.yaml
```
Only this command, a study will start, generate hyper-parameters and save the results.
The configurations for the study(hyper-parameter feasible space, optimization parameter, optimization goal, suggestion algorithm, and so on) are defined in `random-example.yaml`,
In this demo, hyper-parameters are embbeded as args.
You can embbed in another way(e.g. eviroment values) by using template.
It defined in `WorkerSpec.GoTemplate.RawTemplate`.
It is written in [go template](https://golang.org/pkg/text/template/) format.
In this demo, 3 hyper parameters
* Learning Rate (--lr) - type: double
* Number of NN Layer (--num-layers) - type: int
* optimizer (--optimizer) - type: categorical
are randomly generated.
```
$ kubectl -n kubeflow get studyjob
NAME AGE
random-example 2m
```
Check the study status.
```
$ kubectl -n kubeflow describe studyjobs random-example
Name: random-example
Namespace: kubeflow
Labels: controller-tools.k8s.io=1.0
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"kubeflow.org/v1alpha1","kind":"StudyJob","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"random-example"...
API Version: kubeflow.org/v1alpha1
Kind: StudyJob
Metadata:
Cluster Name:
Creation Timestamp: 2018-08-15T01:29:13Z
Generation: 0
Resource Version: 173289
Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/random-example
UID: 9e136400-a02a-11e8-b88c-42010af0008b
Spec:
Study Spec:
Metricsnames:
accuracy
Name: random-example
Objectivevaluename: Validation-accuracy
Optimizationgoal: 0.98
Optimizationtype: maximize
Owner: crd
Parameterconfigs:
Feasible:
Max: 0.03
Min: 0.01
Name: --lr
Parametertype: double
Feasible:
Max: 3
Min: 2
Name: --num-layers
Parametertype: int
Feasible:
List:
sgd
adam
ftrl
Name: --optimizer
Parametertype: categorical
Suggestion Spec:
Request Number: 3
Suggestion Algorithm: random
Suggestion Parameters: <nil>
Worker Spec:
Command:
python
/mxnet/example/image-classification/train_mnist.py
--batch-size=64
Image: katib/mxnet-mnist-example
Worker Type: Default
Status:
Best Objective Value: <nil>
Conditon: Running
Early Stopping Parameter Id:
Studyid: qb397cc06d1f8302
Suggestion Parameter Id:
Trials:
Trialid: p18ee16163b85678
Workeridlist:
Objective Value: <nil>
Conditon: Running
Workerid: td08f74b9939350d
Trialid: pb1be3dbe53a5cb0
Workeridlist:
Objective Value: <nil>
Conditon: Running
Workerid: p2b23e25cce4092c
Trialid: m64209fe0867e91a
Workeridlist:
Objective Value: <nil>
Conditon: Running
Workerid: q6258c1ac98a00a5
Events: <none>
```
When the Spec.Status.State become `Completed`, the study is completed.
You can look the result on `http://127.0.0.1:8000/katib`.
### Use ConfigMap for Worker Template
In Random example, the template for workers is defined in StudyJob manifest.
A ConfigMap is also used for worker template.
Let's use [this](./workerConfigMap.yaml) template.
```
kubectl apply -f workerConfigMap.yaml
```
This template will share among blow three demos(Grid, Hyperband, and GPU).
### Grid Demo
Almost same as random suggestion.
In this demo, Katib will make 4 grids for learning rate (--lr) Min 0.03 and Max 0.07.
```
kubectl apply -f grid-example.yaml
```
### Hyperband Demo
In this demo, the eta is 3 and the R is 9.
```
kubectl apply -f random-example.yaml
```
## UI
You can check your study results with Web UI.
Acsess to `http://127.0.0.1:8000/katib`
The Results will be saved automatically.
### Using GPU demo
You can set any configuration for your worker pods.
Here, try to set config for GPU.
The manifest of the worker pods are generated from a template.
The templates are defined in [ConfigMap](./workerConfigMap.yaml).
There are two templates, defaultWorkerTemplate.yaml and gpuWorkerTemplate.yaml.
You can add your template for worker.
Then you should specify the template in your studyjob spec.
[This](/examples/gpu-example.yaml) is example for using `gpuWorkerTemplate.yaml`.
Set "/worker-template/gpuWorkerTemplate.yaml at `workerTemplatePath` field and specify gpu number at `workerParameters/Gpu`
You can apply it same as other examples.
```
$ kubectl apply -f gpu-example.yaml
$ kubectl -n kubeflow get studyjob
NAME AGE
gpu-example 1m
random-example 17m
$ kubectl -n kubeflow describe studyjob gpu-example
Name: gpu-example
Namespace: kubeflow
Labels: controller-tools.k8s.io=1.0
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"kubeflow.org/v1alpha1","kind":"StudyJob","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"gpu-example","n...
API Version: kubeflow.org/v1alpha1
Kind: StudyJob
Metadata:
Cluster Name:
Creation Timestamp: 2018-08-15T01:48:12Z
Generation: 0
Resource Version: 175002
Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/gpu-example
UID: 44afac4c-a02d-11e8-b88c-42010af0008b
Spec:
Study Spec:
Metricsnames:
accuracy
Name: gpu-example
...
Worker Spec:
Command:
python
/mxnet/example/image-classification/train_mnist.py
--batch-size=64
Image: katib/mxnet-mnist-example
Worker Parameters:
Gpu: 1
Worker Template Path: /worker-template/gpuWorkerTemplate.yaml
Worker Type: Default
Status:
Best Objective Value: <nil>
Conditon: Running
Early Stopping Parameter Id:
Studyid: k549e927046f2136
Suggestion Parameter Id:
Trials:
Trialid: t721857cd426b68b
Workeridlist:
Objective Value: <nil>
Conditon: Running
Workerid: g07cba174ada521e
Trialid: f27c0ac1c6664533
Workeridlist:
Objective Value: <nil>
Conditon: Running
Workerid: h8d5062f2f1b8633
Trialid: v129109d1331a98e
Workeridlist:
Objective Value: <nil>
Conditon: Running
Workerid: x8f172a64645690e
```
Check the GPU configuration works correctly.
```
$ kubectl -n kubeflow describe pod g07cba174ada521e-88wpn
Name: g07cba174ada521e-88wpn
Namespace: kubeflow
Node: <none>
Labels: controller-uid=44bfb99f-a02d-11e8-b88c-42010af0008b
job-name=g07cba174ada521e
Annotations: <none>
Status: Pending
IP:
Controlled By: Job/g07cba174ada521e
Containers:
g07cba174ada521e:
Image: katib/mxnet-mnist-example
Port: <none>
Command:
python
/mxnet/example/image-classification/train_mnist.py
--batch-size=64
--lr=0.0175
--num-layers=2
--optimizer=adam
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-knffp (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-knffp:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-knffp
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 6s (x21 over 4m) default-scheduler 0/3 nodes are available: 3 Insufficient nvidia.com/gpu.
```
## Metrics Collection
### Design of Metrics Collector
![metricscollectordesign](https://user-images.githubusercontent.com/10014831/47256754-e32cb480-d4bf-11e8-98e9-4bbec562ad75.png)
### Default Metrics Collector
The default metrics will be collect from the StdOut of workers.
It is deploy as a cronjob. It will collect and report metrics periodically.
It collect metrics through k8s pod log API.
You should print logs {metrics name}={value} style.
In the above demo, the objective value name is Validation-accuracy and the metrics are accuracy, your training code should print like this.
```
epoch 1:
batch1 accuracy=0.3
batch2 accuracy=0.5
Validation-accuracy=0.4
epoch 2:
batch1 accuracy=0.7
batch2 accuracy=0.8
Validation-accuracy=0.75
```
The metrics collector will collect all logs of metrics.
The manifest of metrics collector is also generated from template and defined [here](/manifests/studyjobcontroller/metricsControllerConfigMap.yaml).
You can add your template and specify `spec.metricsCollectorSpec.metricsCollectorTemplatePath` in a studyjob manifest.
### TF Event File Metrics Collector
The TF Event file metrics collector will collect metrics from tf.event files.
It is also deploy as a cronjob.
When you use TF Event File Metrics Collector, you need to share files between a metrics collector and worker with PVC.
There is an example for TF Event file metrics collector.
First, please create PV and PVC for share event file.
```
$ kubectl apply -f tfevent-volume/
```
Then, create studyjob that use TF Event file metrics collector.
```
$ kubectl apply -f tf-event_test.yaml
```
It will create tensorflow worker and collect metrics from its eventfile.
The code of tensorflow is [the official tutorial for mnist with summary](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py).
It will save event file to `/log/train` and `/log/test` directory.
They have same named metrics ('accuracy' and 'cross_entropy').
The accuracy in training will be save in train directory and test is in test directory.
In a studyjob, please add directry name to the name of metrics as a prefix e.g. `train/accuracy`, `test/accuracy`.
## ModelManagement
You can export model data to yaml file with CLI.
```
katib-cli -s {{server-cli}} pull study {{study ID or name}} -o {{filename}}
```
And you can push your existing models to Katib with CLI.
`mnist-models.yaml` is traind 22 models using random suggestion with this Parameter Config.
```
configs:
- name: --lr
parametertype: 1
feasible:
max: "0.07"
min: "0.03"
list: []
- name: --lr-factor
parametertype: 1
feasible:
max: "0.05"
min: "0.005"
list: []
- name: --lr-step
parametertype: 2
feasible:
max: "20"
min: "5"
list: []
- name: --optimizer
parametertype: 4
feasible:
max: ""
min: ""
list:
- sgd
- adam
- ftrl
```
You can easy to explore the model on KatibUI.
```
katib-cli push md -f mnist-models.yaml
```
## Clean
Clean up with `./destroy.sh` script.
It will stop port-forward process and delete minikube cluster.