https://github.com/kubeflow/katib
Raw File
Tip revision: 0a1cb313642a4fe6bdd72fd6031fca8cc0b43d6c authored by Orfeas Kourkakis on 23 November 2022, 13:48:43 UTC
[bugfix] Fix value passing bug in New Experiment form (#2027)
Tip revision: 0a1cb31
workflow-design.md
# How Katib v1beta1 tunes hyperparameters automatically in a Kubernetes native way

Follow the Kubeflow documentation guides:

- [Concepts](https://www.kubeflow.org/docs/components/katib/overview/)
  in Katib, hyperparameter tuning, and neural architecture search.
- [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/).
- Detailed guide to
  [configuring and running a Katib `Experiment`](https://kubeflow.org/docs/components/katib/experiment/).

## Example and Illustration

After install Katib v1beta1, you can try the first Katib Experiment:

```
kubectl apply -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/hp-tuning/random.yaml
```

### Experiment

When you want to tune hyperparameters for your machine learning model before
training it further, you just need to create an `Experiment` CR. To
learn what fields are included in the `Experiment.spec`, follow
the detailed guide to
[configuring and running a Katib `Experiment`](https://kubeflow.org/docs/components/katib/experiment/).
Then you can get the new `Experiment` as below.
Katib concepts are introduced based on this example.

```yaml
$ kubectl get experiment random -n kubeflow -o yaml

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  ...
  name: random
  namespace: kubeflow
  ...
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.99
    metricStrategies:
    - name: Validation-accuracy
      value: max
    - name: Train-accuracy
      value: max
    objectiveMetricName: Validation-accuracy
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "5"
      min: "2"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: lr
    - description: Number of training model layers
      name: numberLayers
      reference: num-layers
    - description: Training model optimizer (sdg, adam or ftrl)
      name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=64
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727
              name: training-container
            restartPolicy: Never
status:
  completionTime: "2021-10-01T21:47:35Z"
  conditions:
  - lastTransitionTime: "2021-10-01T21:27:46Z"
    lastUpdateTime: "2021-10-01T21:27:46Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-10-01T21:47:35Z"
    lastUpdateTime: "2021-10-01T21:47:35Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2021-10-01T21:47:35Z"
    lastUpdateTime: "2021-10-01T21:47:35Z"
    message: Experiment has succeeded because max trial count has reached
    reason: ExperimentMaxTrialsReached
    status: "True"
    type: Succeeded
  currentOptimalTrial:
    bestTrialName: random-gh8psfcz
    observation:
      metrics:
      - latest: "0.977707"
        max: "0.979299"
        min: "0.955215"
        name: Validation-accuracy
      - latest: "0.993570"
        max: "0.993570"
        min: "0.907932"
        name: Train-accuracy
    parameterAssignments:
    - name: lr
      value: "0.014431754535687558"
    - name: num-layers
      value: "3"
    - name: optimizer
      value: sgd
  startTime: "2021-10-01T21:27:46Z"
  succeededTrialList:
  - random-ghvj6q8z
  - random-4z4kqr5l
  - random-8ssrzrzr
  - random-gw7xtn84
  - random-zlldw6v9
  - random-9jx47rsk
  - random-rzx6zcwb
  - random-46rqvb9k
  - random-nd8d2lmc
  - random-gw7wzdw2
  - random-hq2fghf6
  - random-gh8psfcz
  trials: 12
  trialsSucceeded: 12
```

### Suggestion

Katib internally creates a `Suggestion` CR for each `Experiment` CR. The
`Suggestion` CR includes the hyperparameter algorithm name by `algorithmName`
field and how many sets of hyperparameter Katib asks to be generated by
`requests` field. The `Suggestion` also traces all already generated sets of
hyperparameter in `status.suggestions`. The `Suggestion` CR is used for internal
logic control and end user can even ignore it.

```yaml
$ kubectl get suggestion random -n kubeflow -o yaml

apiVersion: kubeflow.org/v1beta1
kind: Suggestion
metadata:
  ...
  name: random
  namespace: kubeflow
  ownerReferences:
  - apiVersion: kubeflow.org/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Experiment
    name: random
    uid: 355b05f5-6951-47b2-85f6-d0b9b8be5a64
  ...
spec:
  algorithm:
    algorithmName: random
  requests: 12
  resumePolicy: LongRunning
status:
  conditions:
  - lastTransitionTime: "2021-10-01T21:27:46Z"
    lastUpdateTime: "2021-10-01T21:27:46Z"
    message: Suggestion is created
    reason: SuggestionCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-10-01T21:28:56Z"
    lastUpdateTime: "2021-10-01T21:28:56Z"
    message: Deployment is ready
    reason: DeploymentReady
    status: "True"
    type: DeploymentReady
  - lastTransitionTime: "2021-10-01T21:28:57Z"
    lastUpdateTime: "2021-10-01T21:28:57Z"
    message: Suggestion is running
    reason: SuggestionRunning
    status: "True"
    type: Running
  startTime: "2021-10-01T21:27:46Z"
  suggestionCount: 12
  suggestions:
  ...
  - name: random-gw7wzdw2
    parameterAssignments:
    - name: lr
      value: "0.020202241839540558"
    - name: num-layers
      value: "4"
    - name: optimizer
      value: adam
  - name: random-hq2fghf6
    parameterAssignments:
    - name: lr
      value: "0.01841281609693181"
    - name: num-layers
      value: "3"
    - name: optimizer
      value: sgd
  - name: random-8ssrzrzr
    parameterAssignments:
    - name: lr
      value: "0.021473410597867483"
    - name: num-layers
      value: "2"
    - name: optimizer
      value: adam
  ...
```

### Trial

For each set of hyperparameters, Katib internally generates a `Trial` CR
with the hyperparameters key-value pairs, `Worker Job` run specification with
parameters instantiated and some other fields like below. The `Trial` CR
is used for internal logic control and end user can even ignore it.

```yaml
$ kubectl get trial -n kubeflow

NAME              TYPE        STATUS   AGE
random-46rqvb9k   Succeeded   True     20m
random-4z4kqr5l   Succeeded   True     23m
random-8ssrzrzr   Succeeded   True     14m
random-9jx47rsk   Succeeded   True     23m
random-gh8psfcz   Succeeded   True     8m15s
random-ghvj6q8z   Succeeded   True     23m
random-gw7wzdw2   Succeeded   True     17m
random-gw7xtn84   Succeeded   True     12m
random-hq2fghf6   Succeeded   True     17m
random-nd8d2lmc   Succeeded   True     17m
random-rzx6zcwb   Succeeded   True     20m
random-zlldw6v9   Succeeded   True     11m

$ kubectl get trial random-gw7wzdw2 -o yaml -n kubeflow

apiVersion: kubeflow.org/v1beta1
kind: Trial
metadata:
  creationTimestamp: "2021-10-01T21:35:18Z"
  finalizers:
  - clean-metrics-in-db
  generation: 1
  labels:
    katib.kubeflow.org/experiment: random
  name: random-gw7wzdw2
  namespace: kubeflow
  ownerReferences:
  - apiVersion: kubeflow.org/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Experiment
    name: random
    uid: 355b05f5-6951-47b2-85f6-d0b9b8be5a64
  ...
spec:
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  metricsCollector:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.99
    metricStrategies:
    - name: Validation-accuracy
      value: max
    - name: Train-accuracy
      value: max
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameterAssignments:
  - name: lr
    value: "0.020202241839540558"
  - name: num-layers
    value: "4"
  - name: optimizer
    value: adam
  primaryContainerName: training-container
  runSpec:
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: random-gw7wzdw2
      namespace: kubeflow
    spec:
      template:
        spec:
          containers:
          - command:
            - python3
            - /opt/mxnet-mnist/mnist.py
            - --batch-size=64
            - --lr=0.020202241839540558
            - --num-layers=4
            - --optimizer=adam
            image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727
            name: training-container
          restartPolicy: Never
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
status:
  completionTime: "2021-10-01T21:40:59Z"
  conditions:
  - lastTransitionTime: "2021-10-01T21:35:18Z"
    lastUpdateTime: "2021-10-01T21:35:18Z"
    message: Trial is created
    reason: TrialCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-10-01T21:40:59Z"
    lastUpdateTime: "2021-10-01T21:40:59Z"
    message: Trial is running
    reason: TrialRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2021-10-01T21:40:59Z"
    lastUpdateTime: "2021-10-01T21:40:59Z"
    message: Trial has succeeded
    reason: TrialSucceeded
    status: "True"
    type: Succeeded
  observation:
    metrics:
    - latest: "0.949542"
      max: "0.949542"
      min: "0.938396"
      name: Validation-accuracy
    - latest: "0.943164"
      max: "0.944463"
      min: "0.911081"
      name: Train-accuracy
  startTime: "2021-10-01T21:35:18Z"
```

## What happens after an `Experiment` CR is created

When user creates an `Experiment` CR, Katib `Experiment` controller,
`Suggestion` controller and `Trial` controller is working together to achieve
hyperparameters tuning for user's Machine learning model. The Experiment
workflow looks as follows:

<center>
<img width="100%" alt="image" src="images/katib-workflow.png">
</center>

1. The `Experiment` CR is submitted to the Kubernetes API server. Katib
   `Experiment` mutating and validating webhook is called to set the default
   values for the `Experiment` CR and validate the CR separately.

1. The `Experiment` controller creates the `Suggestion` CR.

1. The `Suggestion` controller creates the algorithm deployment and service
   based on the new `Suggestion` CR.

1. When the `Suggestion` controller verifies that the algorithm service is
   ready, it calls the service to generate
   `spec.request - len(status.suggestions)` sets of hyperparameters and append
   them into `status.suggestions`.

1. The `Experiment` controller finds that `Suggestion` CR had been updated and
   generates each `Trial` for the each new hyperparameters set.

1. The `Trial` controller generates `Worker Job` based on the `runSpec`
   from the `Trial` CR with the new hyperparameters set.

1. The related job controller
   (Kubernetes batch Job, Kubeflow TFJob, Tekton Pipeline, etc.) generates
   Kubernetes Pods.

1. Katib Pod mutating webhook is called to inject the metrics collector sidecar
   container to the candidate Pods.

1. During the ML model container runs, the metrics collector container
   collects metrics from the injected pod and persists metrics to the Katib
   DB backend.

1. When the ML model training ends, the `Trial` controller updates status
   of the corresponding `Trial` CR.

1. When the `Trial` CR goes to end, the `Experiment` controller increases
   `request` field of the corresponding `Suggestion` CR if it is needed,
   then everything goes to `step 4` again.
   Of course, if the `Trial` CRs meet one of `end` condition
   (exceeds `maxTrialCount`, `maxFailedTrialCount` or `goal`),
   the `Experiment` controller takes everything done.
back to top