diff --git a/content/30.discussion.md b/content/30.discussion.md index 1c61ac3..e12b29b 100644 --- a/content/30.discussion.md +++ b/content/30.discussion.md @@ -11,6 +11,9 @@ To account for the requirements of biomedical research workflows, we take partic We achieve this goal by implementing a living benchmarking framework that allows the automated evaluation of LLMs, prompts, and other components ([https://biochatter.org/benchmark/](https://biochatter.org/benchmark/)). Even the most recent and biomedicine-specific benchmarking efforts are small-scale manual approaches that do not consider the full matrix of possible combinations of components, and many benchmarks are performed by accessing web interfaces of LLMs, which obfuscates important parameters such as model version and temperature [@biollmbench]. As such, a framework is a necessary step towards the objective and reproducible evaluation of LLMs, and its results are a great starting point for delving deeper into the reasons why some models perform differently than expected. +For instance, the benchmark allowed immediate flagging of the drop in performance from the older (0613) to the newer (0125) version of gpt-4. +It also identified a range of pre-trained open-source models suitable for our uses, most notably, the openhermes-2.5 model in 4- or 5-bit quantisation. +Notably, this model is a fine-tuned (on GPT-4-generated data) variant of Mistral 7B v0.1, whose vanilla variants perform considerably worse in our benchmarks. We prevent data leakage from the benchmark datasets into the training data of new models by encryption, which is essential for the sustainability of the benchmark as new models are released. The living benchmark will be updated with new questions and tasks as they arise in the community. @@ -29,6 +32,7 @@ They are not meant to replace human ingenuity and expertise but to augment it wi Depending on generic open-source libraries such as LangChain [@langchain] and Pytest [@pytest] allows us to focus on the biomedical domain but also introduces technical dependencies on these libraries. While we support those upstream libraries via pull requests, we depend on their maintainers for future updates. In addition, keeping up with these rapid developments is demanding on developer time, which is only sustainable in a community-driven open-source effort. +For the continued relevance of our framework, it is essential that its components, such as the benchmark, are maintained as the field evolves. ### Future directions