Skip to content

Releases: kubeflow/trainer

v1.9.0 release

28 Jan 15:58
6f74c7f
Compare
Choose a tag to compare

This is the Training Operator v1.9.0 release.

This release introduces a new JAXJob, enabling seamless distributed training with JAX.

Additionally, it adds the managedBy API to streamline the orchestration of training Jobs in multi-cluster environment using MultiKueue.

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Trainer V2

Bug Fixes

Misc

Read more

v1.9.0-rc.0 release

10 Jan 23:27
a0ae3b1
Compare
Choose a tag to compare
v1.9.0-rc.0 release Pre-release
Pre-release

This is the Training Operator v1.9.0-rc.0 pre-release.

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

Read more

v1.8.1 release

10 Sep 15:14
Compare
Choose a tag to compare

This is the Training Operator v1.8.1 release.

Bug Fixes

  • [Bug] Finish CleanupJob early if the job is suspended (#2243 by @mszadkow)
  • [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
  • Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)

New Contributors

v1.8.0 release

23 Jul 18:10
f8687ca
Compare
Choose a tag to compare

This is the Training Operator v1.8.0 release.

This release introduces a new Python API for LLMs Fine-Tuning that simplifies the ability to fine-tune foundational models using distributed PyTorch nodes.

Install the Kubeflow Training SDK as follows to try it:

pip install -U "kubeflow-training[huggingface]"

LLMs Fine-Tuning API

Breaking Changes

New Features

Control Plane Updates

SDK Improvements

Bug Fixes

Misc

Read more

v1.8.0-rc.0 release

28 Apr 18:37
643af3d
Compare
Choose a tag to compare
v1.8.0-rc.0 release Pre-release
Pre-release

New features

Bug fixes

Misc

v1.7.0 release

01 Nov 07:49
5525468
Compare
Choose a tag to compare

Breaking Changes

  • Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
  • Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Bug fixes

  • Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
  • Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
  • Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

v1.7.0-rc.0 release

07 Aug 13:00
434cef7
Compare
Choose a tag to compare
v1.7.0-rc.0 release Pre-release
Pre-release

Breaking Changes

  • Make scheduler-plugins the default gang scheduler. #1747 (Syulin7)
  • Upgrade the kubernetes dependencies to v1.27 #1834 (tenzen-y)

New features

Bug fixes

  • Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed #1866 (tenzen-y)
  • Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition #1789 (tenzen-y)
  • Avoid to depend on local env when installing the code-generators #1810 (tenzen-y)

Misc

v1.6.0 release

21 Mar 19:37
66aa635
Compare
Choose a tag to compare

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: #1773

Note: Latest Python SDK 1.6 version does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: #1702

New Features

Bug fixes

Misc

Closed issues:

  • The default value for CleanPodPolicy is inconsistent. #1753
  • HPA support for PyTorch Elastic #1751
  • Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state [#1745](https://github.com/kubeflow/t...
Read more

v1.6.0-rc.1 release

14 Feb 09:05
27e5499
Compare
Choose a tag to compare
v1.6.0-rc.1 release Pre-release
Pre-release

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

Merged pull requests:

Closed issues:

  • The default value for CleanPodPolicy is inconsistent. #1753
  • HPA support for PyTorch Elastic #1751
  • Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
  • paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
  • *job API(master) cannot compatible with old job [#1725](https://github.com/kubeflow/training-opera...
Read more

v1.6.0-rc.0 release

26 Jan 13:32
b8004ae
Compare
Choose a tag to compare
v1.6.0-rc.0 release Pre-release
Pre-release

v1.6.0-rc.0 release