28 Jan 15:58

v1.9.0

6f74c7f

v1.9.0 release Latest

Latest

This is the Training Operator v1.9.0 release.

This release introduces a new JAXJob, enabling seamless distributed training with JAX.

Additionally, it adds the managedBy API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
JAX example for MNIST SPMD and add CI testing (#2390 by @saileshd1402)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Trainer V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken ([#2356](https://github.com/kubeflow/training-operator/pul...

Contributors

astefanutti, johnugeorge, and 32 other contributors

Assets 2

10 Jan 23:27

andreyvelich

v1.9.0-rc.0

a0ae3b1

v1.9.0-rc.0 release Pre-release

Pre-release

This is the Training Operator v1.9.0-rc.0 pre-release.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Training V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken (#2356 by @Veer0x1)
Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
Upgrade Go version to v1.23 (#2302 by @tenzen-y)
Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
Added test for create-pytorchjob.ipynb python notebook ([#2274](https://github.com/kubeflow/training-operator...

Contributors

astefanutti, johnugeorge, and 32 other contributors

Assets 2

10 Sep 15:14

andreyvelich

v1.8.1

0f8735f

v1.8.1 release

This is the Training Operator v1.8.1 release.

Bug Fixes

[Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

New Contributors

@mszadkow made their first contribution in #2243
@helenxie-bit made their first contribution in #2180

Contributors

mszadkow and helenxie-bit

Assets 2

23 Jul 18:10

andreyvelich

v1.8.0

f8687ca

v1.8.0 release

This is the Training Operator v1.8.0 release.

This release introduces a new Python API for LLMs Fine-Tuning that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.

Install the Kubeflow Training SDK as follows to try it:

pip install -U "kubeflow-training[huggingface]"

LLMs Fine-Tuning API

Train/Fine-tune API Proposal for LLMs (#1945 by @deepanker13)
[SDK] Train API for LLM Fine-Tuning (#1962 by @deepanker13)
Modify LLM Trainer to support BERT and Tiny LLaMA (#2031 by @andreyvelich)
Support arm64 for Hugging Face trainer (#2028 by @tariq-hasan)
Add Fine-Tune BERT LLM Example (#2021 by @andreyvelich)
Train api dataset download changes (#1959 by @deepanker13)
Train api init container creation (#1958 by @deepanker13)
[SDK] Add docstring for Train API (#2075 by @andreyvelich)

Breaking Changes

[SDK] Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
Support K8s v1.29 and Drop K8s v1.26 (#2039 by @tenzen-y)
Support K8s v1.28 and Drop K8s v1.25 (#2038 by @tenzen-y)
Deprecation Notice for MXJob (#2058 by @tenzen-y)
⚠️ Breaking Changes: Rename monitoring-port flag to webook-server-port (#1925 by @afritzler)

New Features

Control Plane Updates

Upgrade scheduler-plugins to v0.28.9 (#2065 by @tenzen-y)
Implement webhook validations for the PaddleJob (#2057 by @tenzen-y)
Implement webhook validations for the XGBoostJob (#2052 by @tenzen-y)
Implement webhook validation for the TFJob (#2051 by @tenzen-y)
Implement webhook validations for the PyTorchJob (#2035 by @tenzen-y)
Upgrade PyTorchJob examples to PyTorch v2 (#2024 by @champon1020)
Upgrade Go version to v1.22 (#2046 by @tenzen-y)

SDK Improvements

[SDK] Add resources per worker for Create Job API (#1990 by @andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob (#1988 by @andreyvelich)
[SDK] Get Kubernetes Events for Job (#1975 by @andreyvelich)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 (#2066 by @tenzen-y)
[SDK] Add information about TrainingClient logging (#1973 by @andreyvelich)
Training operator SDK unit test (#1938 by @deepanker13)
[SDK] Consolidate Naming for CRUD APIs (#1907 by @andreyvelich)

Bug Fixes

[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2147 by @andreyvelich)
[SDK] Changed package name to flake8 to fix pip install (#2140 by @tenzen-y)
[SDK] Fix Incorrect Events in get_job_logs API (#2138 by @tenzen-y)
Fix volcano podgroup update issue (#2079 by @ckyuto)
Fix import for HuggingFace Dataset Provider (#2085 by @andreyvelich)
Updated examples for train API (#2077 by @shruti2522)
Fail job for non-retryable exit codes (#2071 by @kellyaa)
E2E: Replace outdated images with latest ones (#2083 by @tenzen-y)
fix wrong filepath in the simple example command (#2062 by @qzoscar)
fix(example): add installation of python-etcd in Pytorch example (#2064 by @champon1020)
fix: Upgrade controller-gen to v0.14.0 (#2026 by @champon1020)
Fix build workflow config for pytorch-torchrun-example (#2020 by @PeterWrighten)
Fix Distributed Data Samplers in PyTorch Examples (#2012 by @andreyvelich)
Fix URL in python SDK setup.py (#2011 by @garymm)
Fix for Github CI to publish HF trainer image (#1987 by @johnugeorge)
train api jupyternotebook fix (#1984 by @deepanker13)
fix: volcano podgroup should has a non-empty queue name (#1977 by @lowang-bh)
Fix Master Label for PyTorchJob (#1974 by @andreyvelich)
IsMasterRole fix in pytorchjob controller (#1969 by @deepanker13)
[fix] replace ${go env GOPATH} with $(go env GOPATH) (#1952 by @double12gzh)
Fixing issues with providing existing service account (#1918 by @rpemsel)

Misc

Refine the integration tests for the immutable PyTorchJob (#2130 by @tenzen-y)
Update training operator image to latest (#2089 by @johnugeorge)
Update sdk to v1.8.0rc0 (#2087 by @johnugeorge)
Test: Simplify and Identify pod-controller envtest (#2084 by @tenzen-y)
Remove deadcode related to PodDisruptionBudget (#2073 by @tenzen-y)
docs: updating docs for local development (#2074 by @franciscojavierarceo)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode (#2067 by @tenzen-y)
Updated developer docs to include Kind (#2061 by @franciscojavierarceo)
adding fine tune example with s3 as the dataset store (#2006 by @deepanker13)
CI: Use a mode=min in the builder cache (#2053 by @tenzen-y)
Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 (#2043 by @jdcfd)
Remove Dockerfile.ppc64le of pytorch example (#2042 by @champon1020)
publish torchrun example via Dockerfile (#2018 by @PeterWrighten)
Updated examples/pytorch to disable istio sidecar injection (#2004 by @jdcfd)
[docs] development guide update (#1995 by @shashank-iitbhu)
Add Kubeflow Website links to README (#1983 by @andreyvelich)
publish trainer hugging face image (#1985 by @deepanker13)
Adding Training image needed for train api (#1963 by @deepanker13)
Add test to create PyTorchJob from func (#1979 by @andreyvelich)
Corrected Some Spelling And Grammatical Errors (#1980 by @daniel-hutao)
torchrun example with cpu version pytorch (#1965 by @kuizhiqing)
utils changes needed to add train api (#1954 by @deepanker13)
Adding parallel support for coveralls (#1956 by @johnugeorge)
chore: pkg import only once (#1950 by @testwill)
fix nproc env in elas...

Contributors

garymm, rpemsel, and 23 other contributors

Assets 2

28 Apr 18:37

johnugeorge

v1.8.0-rc.0

643af3d

v1.8.0-rc.0 release Pre-release

Pre-release

New features

Train/Fine-tune API Proposal for LLMs #1945 (deepanker13)
Adding Training image needed for train api #1963 (deepanker13)
[SDK] Train API #1962 (deepanker13)
Train api dataset download changes #1959 (deepanker13)
Train api init container creation #1958 (deepanker13)
Publish trainer hugging face image #1985 (deepanker13)
Support arm64 for Hugging Face trainer #2028 (tariq-hasan)
Modify LLM Trainer to support BERT and Tiny LLaMA #2031 (andreyvelich)
Implement webhook validations for the PyTorchJob #2035 (tenzen-y)
Implement webhook validations for the XGBoostJob #2052 (tenzen-y)
Implement webhook validation for the TFJob #2051 (tenzen-y)
Implement webhook warnings for the MXJob #2058 (tenzen-y)
Implement webhook validations for the PaddleJob #2057 (tenzen-y)
Fail job for non-retryable exit codes #2071 (kellyaa)
Adding fine tune example with s3 as the dataset store #2006 (deepanker13)

Bug fixes

fix nproc env in elastic mode for pytorchjob #1948 (kuizhiqing)
IsMasterRole fix in pytorchjob controller #1969 (deepanker13)
fix: volcano podgroup should has a non-empty queue name #1977 (lowang-bh)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Fix Worker and Master templates for PyTorchJob #1988 (andreyvelich)
Fix import for HuggingFace Dataset Provider #2085 (andreyvelich)
Upgrade controller-gen to v0.14.0 #2026 (champon1020)
Fix Distributed Data Samplers in PyTorch Examples #2012 (andreyvelich)
Fix URL in python SDK setup.py #2011 (garymm)

Misc

Adding parallel support for coveralls #1956 (johnugeorge)
torchrun example with cpu version pytorch #1965 (kuizhiqing)
[SDK] Get Kubernetes Events for Job #1975 (andreyvelich)
Fix Master Label for PyTorchJob #1974 (andreyvelich)
[SDK] Add information about TrainingClient logging #1973 (andreyvelich)
PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode #2067 (tenzen-y)
SDK: Upgrade the minimum required Kubernetes version to v1.27.2 #2066 (tenzen-y)
Test: Simplify and Identify pod-controller envtest #2084 (tenzen-y)
E2E: Replace outdated images with latest ones #2083 (tenzen-y)
Upgrade scheduler-plugins to v0.28.9 #2065 (tenzen-y)

Assets 2

01 Nov 07:49

johnugeorge

v1.7.0

5525468

v1.7.0 release

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

Assets 2

07 Aug 13:00

johnugeorge

v1.7.0-rc.0

434cef7

v1.7.0-rc.0 release Pre-release

Pre-release

Breaking Changes

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
Merge kubeflow/common to training-operator #1813 (johnugeorge)
Auto-generate RBAC manifests by the controller-gen #1815 (Syulin7)
Implement suspend semantics #1859 (tenzen-y)
Set up controllers using goroutines to start the manager quickly #1869 (tenzen-y)
Set correct ENV for PytorchJob to support torchrun #1840 (kuizhiqing)

Bug fixes

Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

Removing reconciler code #1879 (johnugeorge)
Make Condition and ReplicaStatus optional #1862 (tenzen-y)
Use the same reasons for Condition and Event #1854 (tenzen-y)
Fully consolidate tfjob-operator to training-operator #1850 (tenzen-y)
Clean up /pkg/common/util/v1 #1845 (tenzen-y)
Refactoring tests in common/controller.v1 #1843 (tenzen-y)
remove duplicate code of add task spec annotation #1839 (lowang-bh)
fetch volcano log when e2e failed #1837 (lowang-bh)
Add check pods are not scheduled when testing gang-scheduler integrations in e2e #1835 (tenzen-y)
Replace dummy client with fake client #1818 (tenzen-y)
Add default Intel MPI env variables to MPIJob #1804 (tkatila)
Improve E2E tests for the gang-scheduling #1801 (tenzen-y)
xgb yaml container name should be consistent with xgb job default container name #1794 (Crisescode)
make timeout configurable from e2e tests #1787 (nagar-ajay)

Assets 2

21 Mar 19:37

johnugeorge

v1.6.0

66aa635

v1.6.0 release

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1773

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

New Features

Support for k8s v1.25 in CI #1684 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
Adopting coschduling plugin #1724 (tenzen-y)
Support for Paddlepaddle #1675 (kuizhiqing)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
[SDK] Create Unify Training Client #1719 (andreyvelich)

Bug fixes

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Fix XGBoost conditions bug #1737 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Fix status lost #1697 (ggaaooppeenngg)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)

Misc

Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
Fix Python installation in CI #1759 (tenzen-y)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Update join Slack link #1750 (Syulin7)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
Add Yuki to reviewer group #1739 (johnugeorge)
Trim down CRD descriptions #1735 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Removing deprecated Job Labels #1702 (johnugeorge)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Update deployment.yaml #1668 (OmriShiv)
Upgrade Go version to v1.19 #1663 (tenzen-y)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)

Closed issues:

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state [#1745](https://github.com/kubeflow/t...

Assets 2

14 Feb 09:05

johnugeorge

v1.6.0-rc.1

27e5499

v1.6.0-rc.1 release Pre-release

Pre-release

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

Merged pull requests:

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Fix Python installation in CI #1759 (tenzen-y)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Update join Slack link #1750 (Syulin7)
Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
Add Yuki to reviewer group #1739 (johnugeorge)
Fix XGBoost conditions bug #1737 (tenzen-y)
Add E2E test for gang-scheduling #1736 (tenzen-y)
Trim down CRD descriptions #1735 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
Adopting coschduling plugin #1724 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
[SDK] Create Unify Training Client #1719 (andreyvelich)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
Removing deprecated Job Labels #1702 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Fix status lost #1697 (ggaaooppeenngg)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Support for k8s v1.25 in CI #1684 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
[PaddlePaddle] support paddlejob #1675 (kuizhiqing)
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Update deployment.yaml #1668 (OmriShiv)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
Upgrade Go version to v1.19 #1663 (tenzen-y)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)

Closed issues:

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
*job API(master) cannot compatible with old job [#1725](https://github.com/kubeflow/training-opera...

Assets 2

26 Jan 13:32

johnugeorge

v1.6.0-rc.0

b8004ae

v1.6.0-rc.0 release Pre-release

Pre-release

v1.6.0-rc.0 release

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc

Contributors

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Contributors

Bug Fixes

New Contributors

Contributors

LLMs Fine-Tuning API

Breaking Changes

New Features

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Contributors

Releases: kubeflow/trainer

v1.9.0 release

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc

Contributors

v1.9.0-rc.0 release

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Contributors

v1.8.1 release

Bug Fixes

New Contributors

Contributors

v1.8.0 release

LLMs Fine-Tuning API

Breaking Changes

New Features

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Contributors

v1.8.0-rc.0 release

v1.7.0 release

v1.7.0-rc.0 release

v1.6.0 release

v1.6.0-rc.1 release

v1.6.0-rc.0 release