Skip to content

Commit

Permalink
discuss benchmark
Browse files Browse the repository at this point in the history
  • Loading branch information
slobentanzer committed Feb 14, 2024
1 parent 490c003 commit fc9e021
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions content/30.discussion.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ To account for the requirements of biomedical research workflows, we take partic
We achieve this goal by implementing a living benchmarking framework that allows the automated evaluation of LLMs, prompts, and other components ([https://biochatter.org/benchmark/](https://biochatter.org/benchmark/)).
Even the most recent and biomedicine-specific benchmarking efforts are small-scale manual approaches that do not consider the full matrix of possible combinations of components, and many benchmarks are performed by accessing web interfaces of LLMs, which obfuscates important parameters such as model version and temperature [@biollmbench].
As such, a framework is a necessary step towards the objective and reproducible evaluation of LLMs, and its results are a great starting point for delving deeper into the reasons why some models perform differently than expected.
For instance, the benchmark allowed immediate flagging of the drop in performance from the older (0613) to the newer (0125) version of gpt-4.
It also identified a range of pre-trained open-source models suitable for our uses, most notably, the openhermes-2.5 model in 4- or 5-bit quantisation.
Notably, this model is a fine-tuned (on GPT-4-generated data) variant of Mistral 7B v0.1, whose vanilla variants perform considerably worse in our benchmarks.
We prevent data leakage from the benchmark datasets into the training data of new models by encryption, which is essential for the sustainability of the benchmark as new models are released.
The living benchmark will be updated with new questions and tasks as they arise in the community.

Expand All @@ -29,6 +32,7 @@ They are not meant to replace human ingenuity and expertise but to augment it wi
Depending on generic open-source libraries such as LangChain [@langchain] and Pytest [@pytest] allows us to focus on the biomedical domain but also introduces technical dependencies on these libraries.
While we support those upstream libraries via pull requests, we depend on their maintainers for future updates.
In addition, keeping up with these rapid developments is demanding on developer time, which is only sustainable in a community-driven open-source effort.
For the continued relevance of our framework, it is essential that its components, such as the benchmark, are maintained as the field evolves.

### Future directions

Expand Down

0 comments on commit fc9e021

Please sign in to comment.