Moving metrics to metrics and adding pass/fail LLM tests #186
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A PR outlining the interface I have in mind for LLM tests. Right now, only the base classes exists, and there's only one test, but it's mainly to show the code structure. More tests to come once the general structure is agreed upon! A lot of this PR is going through and making the code changes to metrics, that silos them into metrics. I have the following working definitions:
A metric is a function that takes an LLM prediction, prompt (optional), and gold-standard output (optional) and computes a scalar value. These values can be aggregated meaningfully over a dataset of predictions to shed light on properties of a model.
A test is a function that takes an LLM prediction, prompt (optional) and gold-standard output (optional) and computes a boolean, pass/fail output that denotes if the model output meets certain criteria or output. These are less about understanding an LLM's properties and more about assuring it can meet specific requirements, with specific prompts.
There can be overlapping underlying functionality between metrics and tests.
As of right now, there's no way for the user to specify their test suite, but that will come in the form of a spreadsheet interface (next PR). Advanced users can use the library in their own code.