Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Add regression tests #995

Closed

Conversation

BenjaminBossan
Copy link
Member

This is a first step towards adding regression tests to the project. These tests allow us to load old adapters with new PEFT versions and ensure that the output generated by the model does not change.

The PR includes a framework for how to add regression artifacts and how to run regression tests based on those artifacts. Right now, only bnb + LoRA is covered, but it should be straightforward to add more tests.

Description

In general, for regression tests, we need two steps:

  1. Creating the regression artifacts, in this case the adapter checkpoint and the expected output of the model.
  2. Running the regression tests, i.e. loading the adapter and checking that the output of the model is the same as the expected output.

My approach is to re-use as much code as possible between those two steps. Therefore, the same test script can be used for both, with only an environment variable to distinguish between the two. Step 1 is invoked by calling:

REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

and to run the second step, we call:

pytest tests/regression/test_regression.py

Creating regression artifacts

The first step will create an adapter checkpoint and an output for the given PEFT version and test setting in a new directory. E.g. it will create a directory tests/regression/lora_opt-125m_bnb_4bit/0.5.0/ that contains adapter_model.bin and output.pt.

Before this step runs, there is a check that the git repo is clean (no dirty worktree) and that the commit is tagged (i.e. corresponds to a release version of PEFT). Otherwise, we may accidentally create regression artifacts that do not correspond to any PEFT release.

The easiest way to get such a clean state (say, for PEFT v0.5.0) is by checking out a tagged commit, e.g:

git checkout v0.5.0

before running the first step.

The first step will also skip the creation of regression artifacts if they already exist.

It is possible to circumvent all the aforementioned checks by setting the environment variable REGRESSION_FORCE_MODE to True like so:

REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

You should only do this if you know exactly what you're doing.

Running regression tests

The second step is much simpler. It will load the adapters and the output created in the first step, and compare the output to the output from a new PEFT model using the loaded adapter. The outputs should be the same.

If more than one version is discovered for a given test setting, all of them are tested.

Notes

As is, the adapters are stored in the git repo itself. Since they're relatively small, the total size of the repo is still reasonable. However, it could be better to store those adapters on HF Hub instead. This would, however, make things a bit more complicated (not sure how to parse directories etc. on Hub).

The regression tests in this included in this PR were used to check that #994 still allows to load checkpoints created with PEFT v0.5.0.

This is a first step towards adding regression tests to the project.
These tests allow us to load old adapters with new PEFT versions and
ensure that the output generated by the model does not change.

The PR includes a framework for how to add regression artifacts and how
to run regression tests based on those artifacts. Right now, only bnb +
LoRA is covered, but it should be straightforward to add more tests.

Description

In general, for regression tests, we need two steps:

1. Creating the regression artifacts, in this case the adapter
   checkpoint and the expected output of the model.
2. Running the regression tests, i.e. loading the adapter and checking
   that the output of the model is the same as the expected output.

My approach is to re-use as much code as possible betweeen those two
steps. Therefore, the same test script can be used for both, with only
an environment variable to distinguish between the two. Step 1 is
invoked by calling:

REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

and to run the second step, we call:

pytest tests/regression/test_regression.py

Creating regression artifacts

The first step will create an adapter checkpoint and an output for the
given PEFT version and test setting in a new directory. E.g. it will
create a directory tests/regression/lora_opt-125m_bnb_4bit/0.5.0/ that
contains adapter_model.bin and output.pt.

Before this step runs, there is a check that the git repo is clean (no
dirty worktree) and that the commit is tagged (i.e. corresponds to a
release version of PEFT). Otherwise, we may accidentally create
regression artifacts that do not correspond to any PEFT release.

The easiest way to get such a clean state (say, for PEFT v0.5.0) is by
checking out a tagged commit, e.g:

git checkout v0.5.0

before running the first step.

The first step will also skip the creation of regression artifacts if
they already exist.

It is possible to circumvent all the aforementioned checks by setting
the environment variable REGRESSION_FORCE_MODE to True like so:

REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py

You should only do this if you know exactly what you're doing.

Running regression tests

The second step is much simpler. It will load the adapters and the
output created in the first step, and compare the output to the output
from a new PEFT model using the loaded adapter. The outputs should be
the same.

If more than one version is discovered for a given test setting, all of
them are tested.

Notes

As is, the adapters are stored in the git repo itself. Since they're
relatively small, the total size of the repo is still reasonable.
However, it could be better to store those adapters on HF Hub instead.
This would, however, make things a bit more complicated (not sure how to
parse directories etc. on Hub).
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 5, 2023

The documentation is not available anymore as the PR was closed or merged.

@BenjaminBossan
Copy link
Member Author

Tests are currently failing because bitsandbytes is not installed. Is there any specific reason why we don't have it for tests?

@huggingface huggingface deleted a comment from github-actions bot Nov 9, 2023
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Nov 10, 2023
This PR supersedes huggingface#995. The description below is copied and modified
from that PR. For some technical reasons, it was easier for me to create
a new PR than to update the previous one, sorry for that.

Description

In general, for regression tests, we need two steps:

1. Creating the regression artifacts, in this case the adapter
   checkpoint and the expected output of the model.
2. Running the regression tests, i.e. loading the adapter and checking
   that the output of the model is the same as the expected output.

My approach is to re-use as much code as possible between those two
steps. Therefore, the same test script can be used for both, with only
an environment variable to distinguish between the two. Step 1 is
invoked by calling:

`REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py`

and to run the second step, we call:

`pytest tests/regression/test_regression.py`

Creating regression artifacts

The first step will create an adapter checkpoint and an output for the
given PEFT version and test setting in a new directory. E.g. it will
create a directory `tests/regression/lora_opt-125m_bnb_4bit/0.5.0/` that
contains adapter_model.bin and output.pt.

Before this step runs, there is a check that the git repo is clean (no
dirty worktree) and that the commit is tagged (i.e. corresponds to a
release version of PEFT). Otherwise, we may accidentally create
regression artifacts that do not correspond to any PEFT release.

The easiest way to get such a clean state (say, for PEFT v0.5.0) is by
checking out a tagged commit, e.g:

`git checkout v0.5.0`

before running the first step.

The first step will also skip the creation of regression artifacts if
they already exist.

It is possible to circumvent all the aforementioned checks by setting
the environment variable `REGRESSION_FORCE_MODE` to True like so:

`REGRESSION_FORCE_MODE=True REGRESSION_CREATION_MODE=True pytest tests/regression/test_regression.py`

You should only do this if you know exactly what you're doing.

Running regression tests

The second step is much simpler. It will load the adapters and the
output created in the first step, and compare the output to the output
from a new PEFT model using the loaded adapter. The outputs should be
the same.

If more than one version is discovered for a given test setting, all of
them are tested.

Notes

As is, the adapters are stored in the git repo itself. Since they're
relatively small, the total size of the repo is still reasonable.
However, it could be better to store those adapters on HF Hub instead.
This would, however, make things a bit more complicated (not sure how to
parse directories etc. on Hub).

The regression tests in this included in this PR were used to check that
 huggingface#994 still allows to load checkpoints created with PEFT v0.6.1.
@BenjaminBossan
Copy link
Member Author

Closing in favor of #1115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants