[Epic][Improvement] Testing overhaul #368

JamesKunstle · 2024-12-16T19:53:29Z

This repo itself only has basic smoketests. In instructlab/instructlab, there are workflow tests that consume this library and confirm that training isn't outright broken, which is a good start.

There are multiple levels of testing that we should aspire to cover.

Unit testing. These ought to prove that our utility functions (e.g. calculating packed batches with FFD) work, that our assertions (e.g. blocking unsupported model architectures) are obeyed, and that our organizational logic (e.g. loading checkpoints and restarting from a given epoch) function correctly.
Correctness and performance testing. If we're making changes to the core training loop, we ought to be able to quickly invoke a test that checks indicators like (a) the behavior of the loss curve, (b) iteration and epoch training time.
Hardware-stack testing: We now support five hardware runtime categories: CPU, MPS, Nvidia, AMD, Intel. We should be able to run appropriate tests on appropriate hardware without having to manually access machines and invoke tests ourselves.

JamesKunstle · 2024-12-18T23:08:39Z

Child issues:

Add pytest unit testing to tox environments #377

JamesKunstle mentioned this issue Dec 16, 2024

[Epic] Training codebase maturity improvements (aka: the schlep) #362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic][Improvement] Testing overhaul #368

[Epic][Improvement] Testing overhaul #368

JamesKunstle commented Dec 16, 2024

JamesKunstle commented Dec 18, 2024

[Epic][Improvement] Testing overhaul #368

[Epic][Improvement] Testing overhaul #368

Comments

JamesKunstle commented Dec 16, 2024

JamesKunstle commented Dec 18, 2024