Revision c8cb2cbbe3aaaebb257531ae4d4b9f3b4b48d694 authored by Ce Gao on 06 January 2020, 04:17:41 UTC, committed by Kubernetes Prow Robot on 06 January 2020, 04:17:41 UTC
* feat: Support resource in sidecar

Signed-off-by: Ce Gao <>

* feat: Support webhook service name

Signed-off-by: Ce Gao <>

* feat: Fix

Signed-off-by: Ce Gao <>

* fix: Have a large mem

Signed-off-by: Ce Gao <>

* fix: Fix import

Signed-off-by: Ce Gao <>

* fix: Add comma

Signed-off-by: Ce Gao <>
1 parent f3e8405
Raw File
Tip revision: c8cb2cbbe3aaaebb257531ae4d4b9f3b4b48d694 authored by Ce Gao on 06 January 2020, 04:17:41 UTC
feat: Support resource in sidecar (#991)
Tip revision: c8cb2cb
<h1 align="center">
    <img src="./docs/images/Katib_Logo.png" alt="logo" width="200">

[![Build Status](](
[![Coverage Status](](
[![Go Report Card](](

Katib is a Kubernetes Native System for [Hyperparameter Tuning][1] and [Neural Architecture Search][2].
The system is inspired by [Google vizier][3] and supports multiple ML/DL frameworks (e.g. TensorFlow, MXNet, and PyTorch).

<!-- START doctoc generated TOC please keep comment here to allow auto update -->
**Table of Contents**  *generated with [DocToc](*

- [Name](#name)
- [Concepts in Katib](#concepts-in-katib)
  - [Experiment](#experiment)
  - [Suggestion](#suggestion)
  - [Trial](#trial)
  - [Worker Job](#worker-job)
- [Components in Katib](#components-in-katib)
- [Getting Started](#getting-started)
- [Web UI](#web-ui)
- [API Documentation](#api-documentation)
- [Installation](#installation)
  - [TF operator](#tf-operator)
  - [Pytorch operator](#pytorch-operator)
  - [Katib](#katib)
  - [Running examples](#running-examples)
  - [Cleanups](#cleanups)
- [CONTRIBUTING](#contributing)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->

## Name

Katib stands for `secretary` in Arabic. As `Vizier` stands for a high official or a prime minister in Arabic, this project Katib is named in the honor of Vizier.

## Concepts in Katib

Katib has the concepts of Experiment, Trial, Job and Suggestion.

### Experiment

`Experiment` represents a single optimization run over a feasible space.
Each `Experiment` contains a configuration 
1. Objective: What we are trying to optimize
2. Search Space: Constraints for configurations describing the feasible space.
3. Search Algorithm: How to find the optimal configurations

`Experiment` is defined as a CRD. Refer to [here](docs/ about how to customize a `Experiment`.

### Suggestion

A Suggestion is a proposed solution to the optimization problem which is one set of hyperparameter values or a list of parameter assignments. Then a `Trial` will be created to evaluate the parameter assignments.

`Suggestion` is defined as a CRD

### Trial

A `Trial` is one iteration of the optimization process, which is one `worker job` instance with a list of parameter assignments(corresponding to a suggestion).

`Trial` is defined as a CRD

### Worker Job 

A `Worker Job` refers to a process responsible for evaluating a `Trial` and calculating its objective value. 

The worker kind can be [Kubernetes Job]( which is a non distributed execution, [Kubeflow TFJob]( or [Kubeflow PyTorchJob]( which are distributed executions.
Thus, Katib supports multiple frameworks with the help of different job kinds. 

Currently Katib supports the following exploration algorithms:

* random search
* grid search
* [hyperband](
* [bayesian optimization](
* [NAS based on reinforcement learning](

## Components in Katib

Katib consists of several components as shown below. Each component is running on k8s as a deployment.
Each component communicates with others via GRPC and the API is defined at `pkg/apis/manager/v1alpha3/api.proto`.

- katib: main components.
  - katib-manager: GRPC API server of katib which is the DB Interface.
  - katib-db: Data storage backend of katib.
  - katib-ui: User interface of katib.
  - katib-controller: Controller for katib CRDs in Kubernetes.

## Getting Started

Please see [here](./docs/ for more details.

## Web UI

Katib provides a Web UI.
You can visualize general trend of Hyper parameter space and each training history. You can use
[random-example]( or
[other examples]( to generate a similar UI.

## API Documentation

Please refer to [](./pkg/apis/manager/v1alpha3/gen-doc/

## Installation

For standard installation of Katib with support for all job operators, refer to [Kubeflow Official Docs]( and skip this section. Or if you want to install Katib manually, follow these steps

git clone
Set `MANIFESTS_DIR` to the cloned folder.


### TF operator

For installing tfjob operator, run the following

cd "${MANIFESTS_DIR}/tf-training/tf-job-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/tf-training/tf-job-operator/base"
kustomize build . | kubectl apply -n kubeflow -f -


### Pytorch operator
For installing pytorch operator, run the following

cd "${MANIFESTS_DIR}/pytorch-job/pytorch-job-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/pytorch-job/pytorch-operator/base/"
kustomize build . | kubectl apply -n kubeflow -f -

### Katib

Finally, you can install Katib

cd "${MANIFESTS_DIR}/katib/katib-crds/base"
kustomize build . | kubectl apply -f -
cd "${MANIFESTS_DIR}/katib/katib-controller/base"
kustomize build . | kubectl apply -f -


If you want to use Katib in a cluster that doesn't have a StorageClass for dynamic volume provisioning at your cluster, you have to create persistent volume manually to bound your persistent volume claim.

This is sample yaml file for creating a persistent volume

apiVersion: v1
kind: PersistentVolume
  name: katib-mysql
    type: local
    app: katib
    storage: 10Gi
    - ReadWriteOnce
    path: /data/katib

Create this pv after deploying Katib package

### Running examples

After deploy everything, you can run examples to verify the installation.

This is example for tfjob operator

kubectl create -f

This is example for pytorch operator

kubectl create -f

You can check status of experiment 

$ kubectl describe experiment tfjob-example -n kubeflow

Name:         tfjob-example
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:
Kind:         Experiment
  Creation Timestamp:  2019-10-06T12:25:44Z
  Generation:          1
  Resource Version:    2110410
  Self Link:           /apis/
  UID:                 6b2bef2d-e834-11e9-93ee-42010aa00075
    Algorithm Name:        random
  Max Failed Trial Count:  3
  Max Trial Count:         12
  Metrics Collector Spec:
      Kind:  TensorFlowEvent
      File System Path:
        Kind:  Directory
        Path:  /train
    Goal:                   0.99
    Objective Metric Name:  accuracy_1
    Type:                   maximize
  Parallel Trial Count:     3
    Feasible Space:
      Max:           0.05
      Min:           0.01
    Name:            --learning_rate
    Parameter Type:  double
    Feasible Space:
      Max:           200
      Min:           100
    Name:            --batch_size
    Parameter Type:  int
  Trial Template:
    Go Template:
      Raw Template:  apiVersion: ""
kind: TFJob
  name: {{.Trial}}
  namespace: {{.NameSpace}}
    replicas: 1 
    restartPolicy: OnFailure
          - name: tensorflow 
            imagePullPolicy: Always
              - "python"
              - "/var/tf_mnist/"
              - "--log_dir=/train/metrics"
              {{- with .HyperParameters}}
              {{- range .}}
              - "{{.Name}}={{.Value}}"
              {{- end}}
              {{- end}}
  Completion Time:  2019-10-06T12:28:50Z
    Last Transition Time:  2019-10-06T12:25:44Z
    Last Update Time:      2019-10-06T12:25:44Z
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2019-10-06T12:28:50Z
    Last Update Time:      2019-10-06T12:28:50Z
    Message:               Experiment is running
    Reason:                ExperimentRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2019-10-06T12:28:50Z
    Last Update Time:      2019-10-06T12:28:50Z
    Message:               Experiment has succeeded because Objective goal has reached
    Reason:                ExperimentSucceeded
    Status:                True
    Type:                  Succeeded
  Current Optimal Trial:
        Name:   accuracy_1
        Value:  1
    Parameter Assignments:
      Name:          --learning_rate
      Value:         0.018532845700535087
      Name:          --batch_size
      Value:         109
  Start Time:        2019-10-06T12:25:44Z
  Trials:            4
  Trials Running:    2
  Trials Succeeded:  2
Events:              <none>

When the spec.Status.Condition becomes ```Succeeded```, the experiment is finished.

You can monitor your results in Katib UI. 
Access Katib UI via Kubeflow dashboard if you have used standard installation or port-forward the `katib-ui` service if you have installed manually.

kubectl -n kubeflow port-forward svc/katib-ui 8080:80

You can access the Katib UI using this URL: ```http://localhost:8080/katib/```.

### Cleanups

Delete installed components using `kubectl delete -f` on the respective folders. 


Please feel free to test the system! [](./docs/ is a good starting point for developers.

back to top