diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index f2be2ee23ed..5881e55a6e2 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,6 +1,6 @@ diff --git a/docs/developer-guide.md b/CONTRIBUTING.md similarity index 94% rename from docs/developer-guide.md rename to CONTRIBUTING.md index 955c44e5ae5..705aba59e6c 100644 --- a/docs/developer-guide.md +++ b/CONTRIBUTING.md @@ -2,13 +2,12 @@ This developer guide is for people who want to contribute to the Katib project. If you're interesting in using Katib in your machine learning project, -see the following user guides: +see the following guides: -- [Concepts](https://www.kubeflow.org/docs/components/katib/overview/) - in Katib, hyperparameter tuning, and neural architecture search. - [Getting started with Katib](https://kubeflow.org/docs/components/katib/hyperparameter/). -- Detailed guide to [configuring and running a Katib - experiment](https://kubeflow.org/docs/components/katib/experiment/). +- [How to configure Katib Experiment](https://kubeflow.org/docs/components/katib/experiment/). +- [Katib architecture and concepts](https://www.kubeflow.org/docs/components/katib/reference/architecture/) + for hyperparameter tuning and neural architecture search. ## Requirements @@ -93,10 +92,6 @@ Below is a list of command-line flags accepted by Katib DB Manager: | --------------- | ------------- | ------- | ------------------------------------------------------- | | connect-timeout | time.Duration | 60s | Timeout before calling error during database connection | -## Workflow design - -Please see [workflow-design.md](./workflow-design.md). - ## Katib admission webhooks Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/). @@ -113,7 +108,7 @@ Katib uses three [Kubernetes admission webhooks](https://kubernetes.io/docs/refe 1. `mutator.pod.katib.kubeflow.org` - Mutating admission webhook to inject the metrics collector sidecar container to the training pod. Learn more about the Katib's metrics collector in the - [Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector). + [Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/). You can find the YAMLs for the Katib webhooks [here](../manifests/v1beta1/components/webhook/webhooks.yaml). @@ -168,4 +163,4 @@ they'll be executed against every file in the repository. Specific programmatically generated files listed in the `exclude` field in [.pre-commit-config.yaml](../.pre-commit-config.yaml) are deliberately excluded -from the hooks. +from the hooks. diff --git a/README.md b/README.md index f9c0dc1806a..3d7cdeb1f9b 100644 --- a/README.md +++ b/README.md @@ -29,13 +29,13 @@ and many more. Katib stands for `secretary` in Arabic. -# Search Algorithms +## Search Algorithms Katib supports several search algorithms. Follow the -[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail) +[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#hp-tuning-algorithms) to know more about each algorithm and check the -[Suggestion service guide](/docs/new-algorithm-service.md) to implement your -custom algorithm. +[this guide](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#use-custom-algorithm-in-katib) +to implement your custom algorithm. @@ -137,141 +137,68 @@ custom algorithm.
-To perform above algorithms Katib supports the following frameworks: +To perform the above algorithms Katib supports the following frameworks: - [Goptuna](https://github.com/c-bata/goptuna) - [Hyperopt](https://github.com/hyperopt/hyperopt) - [Optuna](https://github.com/optuna/optuna) - [Scikit Optimize](https://github.com/scikit-optimize/scikit-optimize) -# Installation - -For the various Katib installs check the -[Kubeflow guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-setup). -Follow the next steps to install Katib standalone. - ## Prerequisites -This is the minimal requirements to install Katib: - -- Kubernetes >= 1.27 -- `kubectl` >= 1.27 +Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/installation/#prerequisites) +for prerequisites to install Katib. -## Latest Version +## Installation -For the latest Katib version run this command: - -``` -kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master" -``` +Please follow [the Kubeflow Katib guide](https://www.kubeflow.org/docs/components/katib/installation/#installing-katib) +for the detailed instructions on how to install Katib. -## Release Version +### Installing the Control Plane -For the specific Katib release (for example `v0.14.0`) run this command: +Run the following command to install the latest stable release of Katib control plane: ``` -kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.14.0" +kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.17.0" ``` -Make sure that all Katib components are running: +Run the following command to install the latest changes of Katib control plane: ``` -$ kubectl get pods -n kubeflow - -NAME READY STATUS RESTARTS AGE -katib-controller-566595bdd8-hbxgf 1/1 Running 0 36s -katib-db-manager-57cd769cdb-4g99m 1/1 Running 0 36s -katib-mysql-7894994f88-5d4s5 1/1 Running 0 36s -katib-ui-5767cfccdc-pwg2x 1/1 Running 0 36s +kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master" ``` For the Katib Experiments check the [complete examples list](./examples/v1beta1). -# Quickstart +### Installing the Python SDK -You can run your first HyperParameter Tuning Experiment using [Katib Python SDK](./sdk/python/v1beta1). +Katib implements [a Python SDK](https://pypi.org/project/kubeflow-katib/) to simplify creation of +hyperparameter tuning jobs for Data Scientists. -In the following example we are going to maximize a simple objective function: -$F(a,b) = 4a - b^2$. The bigger $a$ and the lesser $b$ value, the bigger the function value $F$. +Run the following command to install the latest stable release of Katib SDK: -```python -import kubeflow.katib as katib - -# Step 1. Create an objective function. -def objective(parameters): - # Import required packages. - import time - time.sleep(5) - # Calculate objective function. - result = 4 * int(parameters["a"]) - float(parameters["b"]) ** 2 - # Katib parses metrics in this format: =. - print(f"result={result}") - -# Step 2. Create HyperParameter search space. -parameters = { - "a": katib.search.int(min=10, max=20), - "b": katib.search.double(min=0.1, max=0.2) -} - -# Step 3. Create Katib Experiment. -katib_client = katib.KatibClient() -name = "tune-experiment" -katib_client.tune( - name=name, - objective=objective, - parameters=parameters, - objective_metric_name="result", - max_trial_count=12 -) - -# Step 4. Get the best HyperParameters. -print(katib_client.get_optimal_hyperparameters(name)) +```sh +pip install -U kubeflow-katib ``` -# Documentation - -- Check - [the Katib getting started guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-search-algorithm). - -- Learn about Katib **Concepts** in this - [guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-concepts). - -- Learn about Katib **Interfaces** in this - [guide](https://www.kubeflow.org/docs/components/katib/overview/#katib-interfaces). - -- Learn about Katib **Components** in this - [guide](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components). - -- Know more about Katib in the [presentations and demos list](./docs/presentations.md). +## Getting Started -# Community +Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk) +to quickly create your first hyperparameter tuning Experiment using the Python SDK. -We are always growing our community and invite new users and AutoML enthusiasts -to contribute to the Katib project. The following links provide information -about getting involved in the community: +## Community -- Subscribe to the - [AutoML calendar](https://calendar.google.com/calendar/u/0/r?cid=ZDQ5bnNpZWZzbmZna2Y5MW8wdThoMmpoazRAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ) - to attend Working Group bi-weekly community meetings. +The following links provide information on how to get involved in the community: -- Check the - [AutoML and Training Working Group meeting notes](https://docs.google.com/document/d/1MChKfzrKAeFRtYqypFbMXL6ZIc_OgijjkvbqmwRV-64/edit). - -- If you use Katib, please update [the adopters list](ADOPTERS.md). +- Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) + community meeting. +- Join our [`#kubeflow-katib`](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) + Slack channel. +- Check out [who is using Katib](ADOPTERS.md) and [presentations about Katib project](docs/presentations.md). ## Contributing -Please feel free to test the system! [Developer guide](./docs/developer-guide.md) -is a good starting point for our developers. - -## Blog posts - -- [Kubeflow Katib: Scalable, Portable and Cloud Native System for AutoML](https://blog.kubeflow.org/katib/) - (by Andrey Velichkevich) - -## Events - -- [AutoML and Training WG Summit. 16th of July 2021](https://docs.google.com/document/d/1vGluSPHmAqEr8k9Dmm82RcQ-MVnqbYYSfnjMGB-aPuo/edit?usp=sharing) +Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md). ## Citation diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000000..b199e066e7a --- /dev/null +++ b/docs/README.md @@ -0,0 +1,5 @@ +# Katib Documentation + +Welcome to Kubeflow Katib! + +The Katib documentation is available on [kubeflow.org](https://www.kubeflow.org/docs/components/katib/). diff --git a/docs/images-location.md b/docs/images-location.md index 0cbf78d7618..c20b60a43e4 100644 --- a/docs/images-location.md +++ b/docs/images-location.md @@ -5,7 +5,7 @@ Here you can find the location for images that are used in Katib. ## Katib Components Images The following table shows images for the -[Katib components](https://www.kubeflow.org/docs/components/katib/hyperparameter/#katib-components). +[Katib components](https://www.kubeflow.org/docs/components/katib/reference/architecture/#katib-control-plane-components). @@ -70,7 +70,7 @@ The following table shows images for the ## Katib Metrics Collectors Images The following table shows images for the -[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/experiment/#metrics-collector). +[Katib Metrics Collectors](https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/).
@@ -113,8 +113,8 @@ The following table shows images for the ## Katib Suggestions and Early Stopping Images The following table shows images for the -[Katib Suggestions](https://www.kubeflow.org/docs/components/katib/experiment/#search-algorithms-in-detail) -and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/early-stopping/). +[Katib Suggestion services](https://www.kubeflow.org/docs/components/katib/reference/architecture/#suggestion) +and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/components/katib/user-guides/early-stopping/#early-stopping-algorithms).
@@ -223,7 +223,7 @@ and the [Katib Early Stopping algorithms](https://www.kubeflow.org/docs/componen ## Training Containers Images The following table shows images for training containers which are used in the -[Katib Trials](https://www.kubeflow.org/docs/components/katib/experiment/#packaging-your-training-code-in-a-container-image). +[Katib Trials](https://www.kubeflow.org/docs/components/katib/reference/architecture/#trial).
diff --git a/docs/images/SystemFlow.png b/docs/images/SystemFlow.png deleted file mode 100644 index 8a147437676..00000000000 Binary files a/docs/images/SystemFlow.png and /dev/null differ diff --git a/docs/new-algorithm-service.md b/docs/new-algorithm-service.md deleted file mode 100644 index 6955b53d1c1..00000000000 --- a/docs/new-algorithm-service.md +++ /dev/null @@ -1,185 +0,0 @@ -# Document about how to add a new algorithm in Katib - -## Implement a new algorithm and use it in Katib - -### Implement the algorithm - -The design of Katib follows the `ask-and-tell` pattern: - -> They often follow a pattern a bit like this: 1. ask for a new set of parameters 1. walk to the Experiment and program in the new parameters 1. observe the outcome of running the Experiment 1. walk back to your laptop and tell the optimizer about the outcome 1. go to step 1 - -When an Experiment is created, one algorithm service will be created. Then Katib asks for new sets of parameters via `GetSuggestions` GRPC call. After that, Katib creates new trials according to the sets and observe the outcome. When the trials are finished, Katib tells the metrics of the finished trials to the algorithm, and ask another new sets. - -The new algorithm needs to implement `Suggestion` service defined in [api.proto](../pkg/apis/manager/v1beta1/api.proto). One sample algorithm looks like: - -```python -from pkg.apis.manager.v1beta1.python import api_pb2 -from pkg.apis.manager.v1beta1.python import api_pb2_grpc -from pkg.suggestion.v1beta1.internal.search_space import HyperParameter, HyperParameterSearchSpace -from pkg.suggestion.v1beta1.internal.trial import Trial, Assignment -from pkg.suggestion.v1beta1.hyperopt.base_service import BaseHyperoptService -from pkg.suggestion.v1beta1.internal.base_health_service import HealthServicer - - -# Inherit SuggestionServicer and implement GetSuggestions. -class HyperoptService( - api_pb2_grpc.SuggestionServicer, HealthServicer): - def ValidateAlgorithmSettings(self, request, context): - # Optional, it is used to validate algorithm settings defined by users. - pass - def GetSuggestions(self, request, context): - # Convert the Experiment in GRPC request to the search space. - # search_space example: - # HyperParameterSearchSpace( - # goal: MAXIMIZE, - # params: [HyperParameter(name: param-1, type: INTEGER, min: 1, max: 5, step: 0), - # HyperParameter(name: param-2, type: CATEGORICAL, list: cat1, cat2, cat3), - # HyperParameter(name: param-3, type: DISCRETE, list: 3, 2, 6), - # HyperParameter(name: param-4, type: DOUBLE, min: 1, max: 5, step: )] - # ) - search_space = HyperParameterSearchSpace.convert(request.experiment) - # Convert the trials in GRPC request to the trials in algorithm side. - # trials example: - # [Trial( - # assignment: [Assignment(name=param-1, value=2), - # Assignment(name=param-2, value=cat1), - # Assignment(name=param-3, value=2), - # Assignment(name=param-4, value=3.44)], - # target_metric: Metric(name="metric-2" value="5643"), - # additional_metrics: [Metric(name=metric-1, value=435), - # Metric(name=metric-3, value=5643)], - # Trial( - # assignment: [Assignment(name=param-1, value=3), - # Assignment(name=param-2, value=cat2), - # Assignment(name=param-3, value=6), - # Assignment(name=param-4, value=4.44)], - # target_metric: Metric(name="metric-2" value="3242"), - # additional_metrics: [Metric(name=metric=1, value=123), - # Metric(name=metric-3, value=543)], - trials = Trial.convert(request.trials) - #-------------------------------------------------------------- - # Your code here - # Implement the logic to generate new assignments for the given current request number. - # For example, if request.current_request_number is 2, you should return: - # [ - # [Assignment(name=param-1, value=3), - # Assignment(name=param-2, value=cat2), - # Assignment(name=param-3, value=3), - # Assignment(name=param-4, value=3.22) - # ], - # [Assignment(name=param-1, value=4), - # Assignment(name=param-2, value=cat4), - # Assignment(name=param-3, value=2), - # Assignment(name=param-4, value=4.32) - # ], - # ] - list_of_assignments = your_logic(search_space, trials, request.current_request_number) - #-------------------------------------------------------------- - # Convert list_of_assignments to - return api_pb2.GetSuggestionsReply( - trials=Assignment.generate(list_of_assignments) - ) -``` - -### Make a GRPC server for the algorithm - -Create a package under [cmd/suggestion](../cmd/suggestion). Then create the main function and Dockerfile. The new GRPC server should serve in port 6789. - -Here is an example: [cmd/suggestion/hyperopt](../cmd/suggestion/hyperopt). -Then build the Docker image. - -### Use the algorithm in Katib. - -Update the [Katib config](../manifests/v1beta1/installs/katib-standalone/katib-config.yaml) with the new algorithm entity: - -```diff - runtime: - suggestions: - - algorithmName: random - image: docker.io/kubeflowkatib/suggestion-hyperopt:$(KATIB_VERSION) - - algorithmName: tpe - image: docker.io/kubeflowkatib/suggestion-hyperopt:$(KATIB_VERSION) -+ - algorithmName: -+ image: "image built in the previous stage":$(KATIB_VERSION) -``` - -Learn more about Katib config in the -[Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/katib-config/) - -### Contribute the algorithm to Katib - -If you want to contribute the algorithm to Katib, you could add unit test and/or -e2e test for it in the CI and submit a PR. - -#### Unit Test - -Here is an example [test_hyperopt_service.py](../test/unit/v1beta1/suggestion/test_hyperopt_service.py): - -```python -import grpc -import grpc_testing -import unittest - -from pkg.apis.manager.v1beta1.python import api_pb2_grpc -from pkg.apis.manager.v1beta1.python import api_pb2 - -from pkg.suggestion.v1beta1.hyperopt.service import HyperoptService - -class TestHyperopt(unittest.TestCase): - def setUp(self): - servicers = { - api_pb2.DESCRIPTOR.services_by_name['Suggestion']: HyperoptService() - } - - self.test_server = grpc_testing.server_from_dictionary( - servicers, grpc_testing.strict_real_time()) - - -if __name__ == '__main__': - unittest.main() -``` - -You can setup the GRPC server using `grpc_testing`, then define your own test cases. - -#### E2E Test (Optional) - -E2E tests help Katib verify that the algorithm works well. -Follow below steps to add your algorithm (Suggestion) to the Katib CI -(replace `` with your Suggestion name): - -1. Submit a PR to add a new ECR private registry to the AWS - [`ECR_Private_Registry_List`](https://github.com/kubeflow/testing/blob/master/aws/IaC/CDK/test-infra/config/static_config/ECR_Resources.py#L18). - Registry name should follow the pattern: `katib/v1beta1/suggestion-` - -1. Create a new Experiment YAML in the [examples/v1beta1](../examples/v1beta1) - with the new algorithm. - -1. Update [`setup-katib.sh`](../test/e2e/v1beta1/scripts/setup-katib.sh) - script to modify `katib-config.yaml` with the new test Suggestion image name. - For example: - - ```sh - sed -i -e "s@docker.io/kubeflowkatib/suggestion-@${ECR_REGISTRY}/${REPO_NAME}/v1beta1/suggestion-@" ${CONFIG_PATCH} - ``` - -1. Update the following variables in [`argo_workflow.py`](../test/e2e/v1beta1/argo_workflow.py): - -- [`KATIB_IMAGES`](../test/e2e/v1beta1/argo_workflow.py#L43) with your Suggestion Dockerfile location: - -```diff - . . . - "suggestion-goptuna": "cmd/suggestion/goptuna/v1beta1/Dockerfile", - "suggestion-optuna": "cmd/suggestion/optuna/v1beta1/Dockerfile", -+ "suggestion-": "cmd/suggestion//v1beta1/Dockerfile", - . . . -``` - -- [`KATIB_EXPERIMENTS`](../test/e2e/v1beta1/argo_workflow.py#L69) with your Experiment YAML location: - -```diff - . . . - "multivariate-tpe": "examples/v1beta1/hp-tuning/multivariate-tpe.yaml", - "cmaes": "examples/v1beta1/hp-tuning/cma-es.yaml", -+ ": "examples/v1beta1/hp-tuning/.yaml", - . . . -``` diff --git a/docs/proposals/trial-custom-crd.md b/docs/proposals/1214-custom-crd-in-trial/README.md similarity index 99% rename from docs/proposals/trial-custom-crd.md rename to docs/proposals/1214-custom-crd-in-trial/README.md index a939f77011a..bd27edb5b4e 100644 --- a/docs/proposals/trial-custom-crd.md +++ b/docs/proposals/1214-custom-crd-in-trial/README.md @@ -1,4 +1,4 @@ -# Support custom CRD in Trial Job proposal +# KEP-1214: Support custom CRD in Trial Job proposal diff --git a/docs/proposals/conformance-test.md b/docs/proposals/2044-conformance-program/README.md similarity index 98% rename from docs/proposals/conformance-test.md rename to docs/proposals/2044-conformance-program/README.md index 900399a30de..7018c5e1af3 100644 --- a/docs/proposals/conformance-test.md +++ b/docs/proposals/2044-conformance-program/README.md @@ -1,4 +1,4 @@ -# Conformance Test for AutoML and Training Working Group +# KEP-2044: Conformance Test for AutoML and Training Working Group Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich)) Johnu George ([@johnugeorge](https://github.com/johnugeorge)) @@ -61,7 +61,7 @@ the 3 category of tests: ## Design for the CRD-based tests -![conformance-crd-test](../images/conformance-crd-test.png) +![conformance-crd-test](conformance-crd-test.png) The design is similar to the KFP conformance program for the API-based tests. diff --git a/docs/images/conformance-crd-test.png b/docs/proposals/2044-conformance-program/conformance-crd-test.png similarity index 100% rename from docs/images/conformance-crd-test.png rename to docs/proposals/2044-conformance-program/conformance-crd-test.png diff --git a/docs/proposals/llm-hyperparameter-optimization-api.md b/docs/proposals/2339-hpo-for-llm-fine-tuning/README.md similarity index 95% rename from docs/proposals/llm-hyperparameter-optimization-api.md rename to docs/proposals/2339-hpo-for-llm-fine-tuning/README.md index 01f74f9a2fd..f3086f13b32 100644 --- a/docs/proposals/llm-hyperparameter-optimization-api.md +++ b/docs/proposals/2339-hpo-for-llm-fine-tuning/README.md @@ -1,13 +1,13 @@ -# HyperParameter Optimization API for LLM Fine-Tuning +# KEP-2339: HyperParameter Optimization API for LLM Fine-Tuning - [HyperParameter Optimization API for LLM Fine-Tuning](#hyperparameter-optimization-api-for-llm-fine-tuning) - * [Links](#links) - * [Motivation](#motivation) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Design for API](#design-for-api) - + [Example](#example) - * [Implementation](#implementation) + - [Links](#links) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Design for API](#design-for-api) + - [Example](#example) + - [Implementation](#implementation) ## Links @@ -37,10 +37,10 @@ import kubeflow.katib as katib from kubeflow.katib import KatibClient class KatibClient(object): - + def tune( self, - name: str, + name: str, namespace: Optional[str] = None, model_provider_parameters: Optional[HuggingFaceModelParams] = None, dataset_provider_parameters: Optional[Union[HuggingFaceDatasetParams, S3DatasetParams]] = None, @@ -152,9 +152,9 @@ cl.tune( lora_dropout = 0.1, bias = "none", ), - ), - objective_metric_name = "train_loss", - objective_type = "minimize", + ), + objective_metric_name = "train_loss", + objective_type = "minimize", algorithm_name = "random", max_trial_count = 10, parallel_trial_count = 2, @@ -214,7 +214,7 @@ container_spec = client.V1Container( "--transformer_type", model_provider_parameters.transformer_type.__name__, "--model_dir", - "REPLACE_WITH_ACTUAL_MODEL_PATH", + "REPLACE_WITH_ACTUAL_MODEL_PATH", "--dataset_dir", "REPLACE_WITH_ACTUAL_DATASET_PATH", "--lora_config", @@ -228,7 +228,7 @@ container_spec = client.V1Container( **Hyperparameter Optimization**: This API will create an Experiment that defines the search space for identified tunable hyperparameters, the objective metric, optimization algorithm, etc. The Experiment will orchestrate the hyperparameter tuning process, generating Trials for each configuratin. Each Trial will be implemented as a Kubernete PyTorchJob, with the `trialTemplate` specifying the exact values for hyperparameters. The `trialTemplate` will also define master and worker containers, facilitating effective resource distribution and parallel execution of Trials. Trial results will then be fed back to the Experiment, which will evaluate the outcomes to identify the optimal set of hyperparameters. - **Dependencies Update**: To reuse existing assets from the Training Python SDK and integrate packages from HuggingFace, dependencies will be added to the `setup.py` of the Katib Python SDK as follows: +**Dependencies Update**: To reuse existing assets from the Training Python SDK and integrate packages from HuggingFace, dependencies will be added to the `setup.py` of the Katib Python SDK as follows: ```python setuptools.setup( diff --git a/docs/proposals/push-based-metrics-collection.md b/docs/proposals/2340-push-based-metrics-collector/README.md similarity index 98% rename from docs/proposals/push-based-metrics-collection.md rename to docs/proposals/2340-push-based-metrics-collector/README.md index 7f7020bb948..9bef7365253 100644 --- a/docs/proposals/push-based-metrics-collection.md +++ b/docs/proposals/2340-push-based-metrics-collector/README.md @@ -1,4 +1,4 @@ -# Push-based Metrics Collection Proposal +# KEP-2340: Push-based Metrics Collection Proposal ## Links @@ -12,11 +12,12 @@ In the procedure of tuning hyperparameters, Metrics Collector, which is implemen However, current implementation of Metrics Collector is pull-based, raising some [design problems](https://github.com/kubeflow/training-operator/issues/722#issuecomment-405669269) such as determining the frequency we scrape the metrics, performance issues like the overhead caused by too many sidecar containers, and restrictions on developing environments which must support sidecar containers. Thus, we should implement a new API for Katib Python SDK to offer users a push-based way to store metrics directly into the Katib DB and resolve those issues raised by pull-based metrics collection. -![](../images/push-based-metrics-collection.png) +![](./push-based-metrics-collection.png) Fig.1 Architecture of the new design ### Goals + 1. **A new parameter in Python SDK function `tune`**: allow users to specify the method of collecting metrics(push-based/pull-based). 2. **A new interface `report_metrics` in Python SDK**: push the metrics to Katib DB directly. @@ -24,11 +25,11 @@ Fig.1 Architecture of the new design 3. The final metrics of worker pods should be **pushed to Katib DB directly** in the push mode of metrics collection. ### Non-Goals + 1. Implement authentication model for Katib DB to push metrics. 2. Support pushing data to different types of storage system(prometheus, self-defined interface etc.) - ## API ### New Parameter in Python SDK Function `tune` @@ -58,9 +59,9 @@ def tune( packages_to_install: List[str] = None, pip_index_url: str = "https://pypi.org/simple", # The newly added parameter metrics_collector_config. - # It specifies the config of metrics collector, for example, + # It specifies the config of metrics collector, for example, # metrics_collector_config={"kind": "Push"}, - metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"}, + metrics_collector_config: Dict[str, Any] = {"kind": "StdOut"}, ) ``` @@ -76,7 +77,7 @@ def tune( For examle, `metrics = {"loss": 0.01, "accuracy": 0.99}`. db-manager-address: Address for the Katib DB Manager in this format: `ip-address:port`. timeout: Optional, gRPC API Server timeout in seconds to report metrics. - + Raises: RuntimeError: Unable to push Trial metrics to Katib DB. """ @@ -138,7 +139,7 @@ As mentioned above, we decided to add `metrics_collector_config` to the tune fun ### New Interface `report_metrics` in Python SDK -We decide to implement this funcion to push metrics directly to Katib DB with the help of grpc. Trial name should always be passed into Katib Trials (and then into this function) as env variable `KATIB_TRIAL_NAME`. +We decide to implement this funcion to push metrics directly to Katib DB with the help of grpc. Trial name should always be passed into Katib Trials (and then into this function) as env variable `KATIB_TRIAL_NAME`. Also, the function is supposed to be implemented as **global function** because it is called in the user container. @@ -164,6 +165,7 @@ if jobStatus.Condition == trialutil.JobSucceeded && instance.Status.Observation return errMetricsNotReported } ``` + 1. Distinguish pull-based and push-based metrics collection We decide to add a if-else statement in the code above to distinguish pull-based and push-based metrics collection. In the push-based collection, the Trial does not need to be requeued. Instead, we'll insert a unavailable value to Katib DB. @@ -172,7 +174,6 @@ We decide to add a if-else statement in the code above to distinguish pull-based In the current implementation of pull-based metrics collection, Trials will be re-queued when the metrics collector finds the `.Status.Observation` is empty. However, it's not compatible with push-based metrics collection because the forgotten metrics won't be reported in the new round of reconcile. So, we need to update its status in the function `UpdateTrialStatusCondition` in accommodation with the pull-based metrics collection. The following code will be insert into lines before [trial_controller_util.go#L69](https://github.com/kubeflow/katib/blob/7959ffd54851216dbffba791e1da13c8485d1085/pkg/controller.v1beta1/trial/trial_controller_util.go#L69) - ```Golang else if instance.Spec.MetricCollector.Collector.Kind == "Push" { ... // Update the status of this Trial to `MetricsUnavailable` and output the reason. diff --git a/docs/images/push-based-metrics-collection.png b/docs/proposals/2340-push-based-metrics-collector/push-based-metrics-collection.png similarity index 100% rename from docs/images/push-based-metrics-collection.png rename to docs/proposals/2340-push-based-metrics-collector/push-based-metrics-collection.png diff --git a/docs/proposals/parameter-distribution.md b/docs/proposals/2374-parameter-distribution/README.md similarity index 99% rename from docs/proposals/parameter-distribution.md rename to docs/proposals/2374-parameter-distribution/README.md index ebe062c02d0..b6ca0fbc499 100644 --- a/docs/proposals/parameter-distribution.md +++ b/docs/proposals/2374-parameter-distribution/README.md @@ -1,4 +1,4 @@ -# Proposal for Supporting various parameter distributions in Katib +# KEP-2374: Proposal for Supporting various parameter distributions in Katib ## Summary The goal of this project is to enhance the existing Katib Experiment APIs to support various parameter distributions such as uniform, log-uniform, and qlog-uniform. Then extend the suggestion services to be able to configure distributions for search space using libraries provided in each framework. diff --git a/docs/proposals/suggestion.md b/docs/proposals/507-suggestion-crd/README.md similarity index 84% rename from docs/proposals/suggestion.md rename to docs/proposals/507-suggestion-crd/README.md index 75f44e83539..b083cbd1c5f 100644 --- a/docs/proposals/suggestion.md +++ b/docs/proposals/507-suggestion-crd/README.md @@ -1,28 +1,27 @@ -# Suggestion CRD Design Document - -Table of Contents -================= - - * [Suggestion CRD Design Document](#suggestion-crd-design-document) - * [Table of Contents](#table-of-contents) - * [Background](#background) - * [Goals](#goals) - * [Non-Goals](#non-goals) - * [Design](#design) - * [Kubernetes API](#kubernetes-api) - * [GRPC API](#grpc-api) - * [Workflow](#workflow) - * [Example](#example) - * [Algorithm Supports](#algorithm-supports) - * [Random](#random) - * [Grid](#grid) - * [Bayes Optimization](#bayes-optimization) - * [HyperBand](#hyperband) - * [BOHB](#bohb) - * [TPE](#tpe) - * [SMAC](#smac) - * [CMA-ES](#cma-es) - * [Sobol](#sobol) +# KEP-507: Suggestion CRD Design Document + +# Table of Contents + +- [Suggestion CRD Design Document](#suggestion-crd-design-document) +- [Table of Contents](#table-of-contents) + - [Background](#background) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Design](#design) + - [Kubernetes API](#kubernetes-api) + - [GRPC API](#grpc-api) + - [Workflow](#workflow) + - [Example](#example) + - [Algorithm Supports](#algorithm-supports) + - [Random](#random) + - [Grid](#grid) + - [Bayes Optimization](#bayes-optimization) + - [HyperBand](#hyperband) + - [BOHB](#bohb) + - [TPE](#tpe) + - [SMAC](#smac) + - [CMA-ES](#cma-es) + - [Sobol](#sobol) Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc) @@ -118,7 +117,7 @@ message ExperimentSpec { } message ParameterSpecs { - repeated ParameterSpec parameters = 1; + repeated ParameterSpec parameters = 1; } message AlgorithmSpec { @@ -228,28 +227,28 @@ spec: algorithmName: random trialTemplate: goTemplate: - rawTemplate: |- - apiVersion: batch/v1 - kind: Job - metadata: - name: {{.Trial}} - namespace: {{.NameSpace}} - spec: - template: - spec: - containers: - - name: {{.Trial}} - image: katib/mxnet-mnist-example - command: - - "python" - - "/mxnet/example/image-classification/train_mnist.py" - - "--batch-size=64" - {{- with .HyperParameters}} - {{- range .}} - - "{{.Name}}={{.Value}}" - {{- end}} - {{- end}} - restartPolicy: Never + rawTemplate: |- + apiVersion: batch/v1 + kind: Job + metadata: + name: {{.Trial}} + namespace: {{.NameSpace}} + spec: + template: + spec: + containers: + - name: {{.Trial}} + image: katib/mxnet-mnist-example + command: + - "python" + - "/mxnet/example/image-classification/train_mnist.py" + - "--batch-size=64" + {{- with .HyperParameters}} + {{- range .}} + - "{{.Name}}={{.Value}}" + {{- end}} + {{- end}} + restartPolicy: Never parameters: - name: --lr parameterType: double @@ -265,9 +264,9 @@ spec: parameterType: categorical feasibleSpace: list: - - sgd - - adam - - ftrl + - sgd + - adam + - ftrl ``` Then, Experiment controller needs 3 parallel trials to run. It creates the Suggestions: diff --git a/docs/proposals/metrics-collector.md b/docs/proposals/685-metrics-collector/README.md similarity index 98% rename from docs/proposals/metrics-collector.md rename to docs/proposals/685-metrics-collector/README.md index 4a780c621c3..fd89fe5fcc0 100644 --- a/docs/proposals/metrics-collector.md +++ b/docs/proposals/685-metrics-collector/README.md @@ -1,4 +1,4 @@ -# Metrics Collector Proposal +# KEP-685: Metrics Collector Proposal - [Metrics Collector Proposal](#metrics-collector-proposal) - [Links](#links) @@ -33,7 +33,7 @@ In the new design, Katib use mutating webhook to inject metrics collector contai The sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-db-manager and metadata server).
- + Fig. 1 Architecture of the new design diff --git a/docs/images/metrics-collector-design.png b/docs/proposals/685-metrics-collector/metrics-collector-design.png similarity index 100% rename from docs/images/metrics-collector-design.png rename to docs/proposals/685-metrics-collector/metrics-collector-design.png diff --git a/docs/release/README.md b/docs/release/README.md index 75823be8068..74291c3d2a8 100644 --- a/docs/release/README.md +++ b/docs/release/README.md @@ -4,7 +4,7 @@ This is the instruction on how to make a new release for the Katib project. ## Prerequisite -- Tools, defined in the [Developer Guide](./../developer-guide.md#requirements). +- Tools, defined in the [Contributing Guide](./../../CONTRIBUTING.md#requirements). - [Write](https://docs.github.com/en/organizations/managing-access-to-your-organizations-repositories/repository-permission-levels-for-an-organization#permission-levels-for-repositories-owned-by-an-organization) permission for the Katib repository. diff --git a/docs/workflow-design.md b/docs/workflow-design.md index 76dac322bc0..006b606a3ab 100644 --- a/docs/workflow-design.md +++ b/docs/workflow-design.md @@ -1,5 +1,7 @@ # How Katib v1beta1 tunes hyperparameters automatically in a Kubernetes native way +TODO (andreyvelich): This doc is out of date. We should update it and move to the Kubeflow website. + Follow the Kubeflow documentation guides: - [Concepts](https://www.kubeflow.org/docs/components/katib/overview/)