The OPEA Evaluation Working Group is chartered to identify standardized methodologies and frameworks for evaluating the RAG pipeline, to aid in the benchmarking of the individual components and the end-to-end solution.
The Evaluation will comprise of both Quantitative and Qualitative metrics in the domains of Performance, Safety, Trustworthiness and Scalability.
- GenAI Evaluation is a conundrum
- Most evaluations focused on LLM Model Performance, not on end-to-end of applications deploying LLMs
- LLMs performance benchmarks plateaued
- Lack of standardization
- Multiple leaderboards
- Methodology and Eval Frameworks
- Performance – Focus on metrics/KPIs for each component and End to end
- Trustworthiness - Ability to guarantee quality, security, robustness & relevance to Government or other policies
- Scalability / Enterprise Readiness - Ability to be used in production in enterprise environments
- Establish standardized methodologies and metrics and frameworks for evaluating RAG components
- Evaluate Performance, Safety, Trustworthiness, and Scalability
- Identify Holistic (end to end) metrics and across individual components
- Identify a Eval framework for defining the metrics
- Define at least 3 KPIs -- QuantitativePerformance (Throughput/Latency and Accuracy) -- Trustworthiness -- Scalability
- Lack of standardization on Evaluation frameworks, KPIs and Benchmarking
- Evangelization