Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Eval Assist LLM as Judges (#1409)
* evalassist metric initial changes Signed-off-by: Martín Santillán Cooper <[email protected]> * Fixed registering eval assist to catalog. Eval assist now generates assessment and summary for hardcoded prompts Signed-off-by: Martín Santillán Cooper <[email protected]> * Update EvalAssistLLMasJudge Signed-off-by: Martín Santillán Cooper <[email protected]> * update catalog for direct assessment metric Signed-off-by: Martín Santillán Cooper <[email protected]> * add pairwiseEvaluator as EvalAssistLLMAsJudgePairwise Signed-off-by: Martín Santillán Cooper <[email protected]> * Pass Rubric dynamically Signed-off-by: Martín Santillán Cooper <[email protected]> * add catalog for pairwise metrics Signed-off-by: Martín Santillán Cooper <[email protected]> * update example evaluate_aval_assist_rubric Signed-off-by: Martín Santillán Cooper <[email protected]> * Add prometheus model for pairwise Signed-off-by: Martín Santillán Cooper <[email protected]> * Add prometheus in catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * add pairwise criteria in preprocess_step Signed-off-by: Martín Santillán Cooper <[email protected]> * update input fields in pairwise example Signed-off-by: Martín Santillán Cooper <[email protected]> * add GPT rubric evaluator Signed-off-by: Martín Santillán Cooper <[email protected]> * add gpt to the catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * Fixed incorrect merge conflict Signed-off-by: Martín Santillán Cooper <[email protected]> * Removed unncessary template files Signed-off-by: Martín Santillán Cooper <[email protected]> * add GPT rubric evaluator Signed-off-by: Martín Santillán Cooper <[email protected]> * Renamed eval_assist to eval_assist_direct and also using constants to create direct assessment evaluators Signed-off-by: Martín Santillán Cooper <[email protected]> * eval assist pairwise now gets model name from constants to register to catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * Add rubrics and criterias as Enums Signed-off-by: Martín Santillán Cooper <[email protected]> * update eval_assist_cinstant Signed-off-by: Martín Santillán Cooper <[email protected]> * Add GPT Pairwise evaluator Signed-off-by: Martín Santillán Cooper <[email protected]> * update catalog for eval assist Signed-off-by: Martín Santillán Cooper <[email protected]> * rename eval assist's example files Signed-off-by: Martín Santillán Cooper <[email protected]> * remove unused variable Signed-off-by: Martín Santillán Cooper <[email protected]> * Major refactors and pairwise implementation Signed-off-by: Martín Santillán Cooper <[email protected]> * Improve criteria parsing and classes Signed-off-by: Martín Santillán Cooper <[email protected]> * Make watsonx infer call concurrent Signed-off-by: Martín Santillán Cooper <[email protected]> * Adapt evaluators and examples context to be a dict instead of a str Signed-off-by: Martín Santillán Cooper <[email protected]> * Rename evalassist to eval_assist as in other files Signed-off-by: Martín Santillán Cooper <[email protected]> * Rename context variable in examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Use empty template in example cards Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove unused imports Signed-off-by: Martín Santillán Cooper <[email protected]> * Many changes, prepares to the integration with EvalAssist Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove model family Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove MatchClosestOption checks Signed-off-by: Martín Santillán Cooper <[email protected]> * Clean unused files Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove inference changes Signed-off-by: Martín Santillán Cooper <[email protected]> * remove .mailmap Signed-off-by: Martín Santillán Cooper <[email protected]> * Suggested python based approach for user criteria Signed-off-by: Yoav Katz <[email protected]> Signed-off-by: Martín Santillán Cooper <[email protected]> * add context_field attribute and minor fixes and improvements Signed-off-by: Martín Santillán Cooper <[email protected]> * Add example of squad card Signed-off-by: Martín Santillán Cooper <[email protected]> * Introduce context_fields and generate_summaries attributes Signed-off-by: Martín Santillán Cooper <[email protected]> * Add example for running evaluations on squad Signed-off-by: Martín Santillán Cooper <[email protected]> * Changes after running pre-commit Signed-off-by: Martín Santillán Cooper <[email protected]> * Apply ruff suggestions Signed-off-by: Martín Santillán Cooper <[email protected]> * Add updated secrets.baseline Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove unused function Signed-off-by: Martín Santillán Cooper <[email protected]> * Several changes - update examples - Update enums to inherit from string - allow task data criteria to be string from catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * Minor changes after updating with main Signed-off-by: Martín Santillán Cooper <[email protected]> * Add the option of passing the criteria as a single string description Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix evaluate_eval_assist_direct_user_criteria_no_catalog Signed-off-by: Martín Santillán Cooper <[email protected]> * fix evaluate_existing_dataset_by_llm_as_judge_eval_assist Signed-off-by: Martín Santillán Cooper <[email protected]> * Refactor EvalAssistLLMAsJudgeDirect Signed-off-by: Martín Santillán Cooper <[email protected]> * Refactor EvalAssistLLMAsJudgePairwise Signed-off-by: Martín Santillán Cooper <[email protected]> * Add more entries to the pairwise results Signed-off-by: Martín Santillán Cooper <[email protected]> * Modularize get_results in direct assessment Signed-off-by: Martín Santillán Cooper <[email protected]> * Simplified example by using standard qa template and "create_dataset". Also allowed qa multireference template to work without references (and added check to metric that required it) Signed-off-by: Yoav Katz <[email protected]> * Simplified example Signed-off-by: Yoav Katz <[email protected]> * Simplified example Signed-off-by: Yoav Katz <[email protected]> * Removed old way to set a custom single criteria over dataset The old way required a dedicated task field. Signed-off-by: Yoav Katz <[email protected]> * Example with criteria from the dataste (WIP) Signed-off-by: Yoav Katz <[email protected]> * Added example of taking judgement string from file Signed-off-by: Yoav Katz <[email protected]> * Change temperature test case name Signed-off-by: Martín Santillán Cooper <[email protected]> * Update criteria parsing now that operators are used Signed-off-by: Martín Santillán Cooper <[email protected]> * Start pairwise change Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix pairwise bugs and update examples to have >1 instances Signed-off-by: Martín Santillán Cooper <[email protected]> * Fixed type of pairwise judge. Added NullTemplate that returns empty prompt and no references. Signed-off-by: Yoav Katz <[email protected]> * Minor simplifcation of example Signed-off-by: Yoav Katz <[email protected]> * Added global scores of mean winwrate and mean ranking per prediction names. Added check to ensure all instances have the same predictions. Signed-off-by: Yoav Katz <[email protected]> * Add/update no-catalog examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Remove user criteria pairwise example Signed-off-by: Martín Santillán Cooper <[email protected]> * Added 'criteria_feild' param to judge. Instead of impliciting using a hard coded field name. Signed-off-by: Yoav Katz <[email protected]> * Add pairwise criteria operators and refactor names Signed-off-by: Martín Santillán Cooper <[email protected]> * Update examples Signed-off-by: Martín Santillán Cooper <[email protected]> * Add criteria to the result Signed-off-by: Martín Santillán Cooper <[email protected]> * Move criteria verification to parent class Signed-off-by: Martín Santillán Cooper <[email protected]> * Change '|' to Union Signed-off-by: Martín Santillán Cooper <[email protected]> * Change missing '|' to Union Signed-off-by: Martín Santillán Cooper <[email protected]> * fix pre-commits check Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix: don't add positional bias check if it is disabled Signed-off-by: Martín Santillán Cooper <[email protected]> * Removed test_multi_reference_template_with_empty_references Because it's now supported. Signed-off-by: Yoav Katz <[email protected]> * Fixed failing unitests by adding missing imports Signed-off-by: Yoav Katz <[email protected]> * Running rugg Signed-off-by: Yoav Katz <[email protected]> * Removed temp workaround for a problem Signed-off-by: Yoav Katz <[email protected]> * Moved criteria existance check from verify() to before processing Signed-off-by: Yoav Katz <[email protected]> * Reverted changes in log probs in inference.py Signed-off-by: Yoav Katz <[email protected]> Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix typo in LLMasJudge Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix catalog name task_based_ll_mas_judge -> task_based_llm_as_judge Signed-off-by: Martín Santillán Cooper <[email protected]> * bring back to old TaskBasedLLMasJudge class name Signed-off-by: Martín Santillán Cooper <[email protected]> * Fix examples Signed-off-by: elronbandel <[email protected]> * Remove example from testing Signed-off-by: elronbandel <[email protected]> * Fix type Signed-off-by: elronbandel <[email protected]> --------- Signed-off-by: Martín Santillán Cooper <[email protected]> Signed-off-by: Yoav Katz <[email protected]> Signed-off-by: elronbandel <[email protected]> Co-authored-by: Tejaswini Pedapati <[email protected]> Co-authored-by: Swapnaja <[email protected]> Co-authored-by: Swapnaja Achintalwar [email protected] <[email protected]> Co-authored-by: Yoav Katz <[email protected]> Co-authored-by: Yoav Katz <[email protected]> Co-authored-by: elronbandel <[email protected]>
- Loading branch information