From 0e65737ce21c45f1f1530170dd23b1acc0160f1c Mon Sep 17 00:00:00 2001 From: harvard-edge Date: Sun, 2 Feb 2025 20:16:18 +0000 Subject: [PATCH] Push dev branch build --- docs/contents/core/benchmarking/benchmarking.html | 2 +- docs/contents/core/efficient_ai/efficient_ai.html | 12 ++++++------ docs/contents/core/frameworks/frameworks.html | 14 +++++++------- docs/search.json | 2 +- 4 files changed, 15 insertions(+), 15 deletions(-) diff --git a/docs/contents/core/benchmarking/benchmarking.html b/docs/contents/core/benchmarking/benchmarking.html index f1510a07..2c6c0063 100644 --- a/docs/contents/core/benchmarking/benchmarking.html +++ b/docs/contents/core/benchmarking/benchmarking.html @@ -812,7 +812,7 @@

12.3 AI Benchmarks: System, Model, and Data

-

The evolution of benchmarks reaches its apex in machine learning, reflecting a journey that parallels the field’s development towards domain-specific applications. Early machine learning benchmarks focused primarily on algorithmic performance, measuring how well models could perform specific tasks (Lecun et al. 1998). As machine learning applications scaled and computational demands grew, the focus expanded to include system performance and hardware efficiency (Jouppi et al. 2017). Most recently, the critical role of data quality has emerged as the third essential dimension of evaluation (gebru2018datasheets?).

+

The evolution of benchmarks reaches its apex in machine learning, reflecting a journey that parallels the field’s development towards domain-specific applications. Early machine learning benchmarks focused primarily on algorithmic performance, measuring how well models could perform specific tasks (Lecun et al. 1998). As machine learning applications scaled and computational demands grew, the focus expanded to include system performance and hardware efficiency (Jouppi et al. 2017). Most recently, the critical role of data quality has emerged as the third essential dimension of evaluation (Gebru et al. 2021).

Jouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” ACM SIGARCH Computer Architecture News 45 (2): 1–12. https://doi.org/10.1145/3140659.3080246.

What sets AI benchmarks apart from traditional performance metrics is their inherent variability—introducing accuracy as a fundamental dimension of evaluation. Unlike conventional benchmarks, which measure fixed, deterministic characteristics like computational speed or energy consumption, AI benchmarks must account for the probabilistic nature of machine learning models. The same system can produce different results depending on the data it encounters, making accuracy a defining factor in performance assessment. This distinction adds complexity, as benchmarking AI systems requires not only measuring raw computational efficiency but also understanding trade-offs between accuracy, generalization, and resource constraints.

diff --git a/docs/contents/core/efficient_ai/efficient_ai.html b/docs/contents/core/efficient_ai/efficient_ai.html index b9139e38..c8149e99 100644 --- a/docs/contents/core/efficient_ai/efficient_ai.html +++ b/docs/contents/core/efficient_ai/efficient_ai.html @@ -801,10 +801,10 @@

  • Compute Efficiency: Compute efficiency addresses the effective utilization of computational resources, including energy and hardware infrastructure.

  • Data Efficiency: Data efficiency emphasizes optimizing the amount and quality of data required to achieve desired performance.

  • -
    +
    -
    \begin{tikzpicture}[font=\small\sf,node distance=2mm]
    +
    \begin{tikzpicture}[font=\small\sf,node distance=2mm]
     \usetikzlibrary{positioning,arrows.meta,arrows,calc}
     \tikzset{
       Box/.style={inner xsep=1pt,
    @@ -1033,10 +1033,10 @@ 

    9.3.2 Interdependencies Between Efficiency Dimensions

    The efficiency of machine learning systems is inherently a multifaceted challenge that encompasses model design, computational resources, and data utilization. These dimensions—algorithmic efficiency, compute efficiency, and data efficiency—are deeply interdependent, forming a dynamic ecosystem where improvements in one area often ripple across the others. Understanding these interdependencies is crucial for building scalable, cost-effective, and high-performing systems that can adapt to diverse application demands.

    This interplay is best captured through a conceptual visualization. Figure 9.4 illustrates how these efficiency dimensions overlap and interact with each other in a simple Venn diagram. Each circle represents one of the efficiency dimensions, and their intersections highlight the areas where they influence one another, which we will explore next.

    -
    +
    -
    \scalebox{0.8}{%
    +
    \scalebox{0.8}{%
     \begin{tikzpicture}[font=\small\sf,scale=1.25,line width=0.75pt]
     \usetikzlibrary{arrows.meta,calc,positioning,angles}
     \def\firstcircle{(0,0) circle (1.5cm)}
    @@ -1136,10 +1136,10 @@ 

    Susta

    The Virtuous Cycle of Machine Learning Systems

    Efficiency, scalability, and sustainability are deeply interconnected, forming a virtuous cycle that propels machine learning systems toward broader impact. Efficient systems enable scalable deployments, which amplify their sustainability benefits. In turn, sustainable practices drive the need for more efficient designs, ensuring the cycle continues. This interplay creates systems that are not only technically impressive but also socially and environmentally responsible, aligning AI innovation with the needs of a global community.

    Figure 9.5 below illustrates the virtuous cycle of machine learning systems. It highlights how efficiency drives scalability, scalability fosters sustainability, and sustainability reinforces efficiency.

    -
    +
    -
    \begin{tikzpicture}[font=\small\sf,node distance=1pt,line width=0.75pt]
    +
    \begin{tikzpicture}[font=\small\sf,node distance=1pt,line width=0.75pt]
     \usetikzlibrary{arrows.meta,calc,positioning,angles}
     \def\ra{40mm}
     
    diff --git a/docs/contents/core/frameworks/frameworks.html b/docs/contents/core/frameworks/frameworks.html
    index 738c2e3c..acfba70b 100644
    --- a/docs/contents/core/frameworks/frameworks.html
    +++ b/docs/contents/core/frameworks/frameworks.html
    @@ -911,7 +911,7 @@ 

    7.3 Framework Fundamentals

    Modern machine learning frameworks operate through the integration of four key layers: Fundamentals, Data Handling, Developer Interface, and Execution and Abstraction. These layers function together to provide a structured and efficient foundation for model development and deployment, as illustrated in Figure 7.2.

    -
    +
    @@ -991,7 +991,7 @@

    Graph Basics

    Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2017a. “Automatic Differentiation in Machine Learning: A Survey.” J. Mach. Learn. Res. 18: 153:1–43. https://jmlr.org/papers/v18/17-468.html.

    For example, a node might represent a matrix multiplication operation, taking two input matrices (or tensors) and producing an output matrix (or tensor). To visualize this, consider the simple example in Figure 7.3. The directed acyclic graph computes \(z = x \times y\), where each variable is just numbers.

    -
    +
    @@ -1016,7 +1016,7 @@

    Graph Basics

    As shown in Figure 7.4, the structure of the computation graph involves defining interconnected layers, such as convolution, activation, pooling, and normalization, which are optimized before execution. The figure also demonstrates key system-level interactions, including memory management and device placement, showing how the static graph approach enables comprehensive pre-execution analysis and resource allocation.

    -
    +
    @@ -1081,7 +1081,7 @@

    Static Graphs

    A static computation graph implements a clear separation between the definition of operations and their execution. During the definition phase, each mathematical operation, variable, and data flow connection is explicitly declared and added to the graph structure. This graph is a complete specification of the computation but does not perform any actual calculations. Instead, the framework constructs an internal representation of all operations and their dependencies, which will be executed in a subsequent phase.

    This upfront definition enables powerful system-level optimizations. The framework can analyze the complete structure to identify opportunities for operation fusion, eliminating unnecessary intermediate results. Memory requirements can be precisely calculated and optimized in advance, leading to efficient allocation strategies. Furthermore, static graphs can be compiled into highly optimized executable code for specific hardware targets, taking full advantage of platform-specific features. Once validated, the same computation can be run repeatedly with high confidence in its behavior and performance characteristics.

    Figure 7.5 illustrates this fundamental two-phase approach: first, the complete computational graph is constructed and optimized; then, during the execution phase, actual data flows through the graph to produce results. This separation enables the framework to perform comprehensive analysis and optimization of the entire computation before any execution begins.

    -
    +
    @@ -1121,10 +1121,10 @@

    Static Graphs

    Dynamic Graphs

    Dynamic computation graphs, popularized by PyTorch, implement a “define-by-run” execution model. This approach constructs the graph during execution, offering greater flexibility in model definition and debugging. Unlike static graphs, which rely on predefined memory allocation, dynamic graphs allocate memory as operations execute, making them susceptible to memory fragmentation6 in long-running tasks.

    6 Memory Fragmentation: The inefficient use of memory caused by small, unused gaps between allocated memory blocks, often resulting in wasted memory or reduced performance.

    As shown in Figure 7.6, each operation is defined, executed, and completed before moving on to define the next operation. This contrasts sharply with static graphs, where all operations must be defined upfront. When an operation is defined, it is immediately executed, and its results become available for subsequent operations or for inspection during debugging. This cycle continues until all operations are complete.

    -
    +
    -
    %%| out-width: 100%
    +
    %%| out-width: 100%
     \usetikzlibrary{arrows, arrows.meta, positioning,calc,fit,backgrounds}
     \begin{tikzpicture}[font=\small\sf,node distance=15mm,outer sep=0pt]
     \tikzset{
    @@ -1916,7 +1916,7 @@ 

    Just-In-Time Comp

    7.3.6 Core Operations

    Machine learning frameworks employ multiple layers of operations that translate high-level model descriptions into efficient computations on hardware. These operations form a hierarchy: hardware abstraction operations manage the complexity of diverse computing platforms, basic numerical operations implement fundamental mathematical computations, and system-level operations coordinate resources and execution. This operational hierarchy is key to understanding how frameworks transform mathematical models into practical implementations. Figure 7.9 illustrates this hierarchy, showing the relationship between the three layers and their respective subcomponents.

    -
    +
    diff --git a/docs/search.json b/docs/search.json index 9e9485ce..d7f8ab42 100644 --- a/docs/search.json +++ b/docs/search.json @@ -1554,7 +1554,7 @@ "href": "contents/core/benchmarking/benchmarking.html#ai-benchmarks-system-model-and-data", "title": "12  Benchmarking AI", "section": "12.3 AI Benchmarks: System, Model, and Data", - "text": "12.3 AI Benchmarks: System, Model, and Data\nThe evolution of benchmarks reaches its apex in machine learning, reflecting a journey that parallels the field’s development towards domain-specific applications. Early machine learning benchmarks focused primarily on algorithmic performance, measuring how well models could perform specific tasks (Lecun et al. 1998). As machine learning applications scaled and computational demands grew, the focus expanded to include system performance and hardware efficiency (Jouppi et al. 2017). Most recently, the critical role of data quality has emerged as the third essential dimension of evaluation (gebru2018datasheets?).\n\nJouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” ACM SIGARCH Computer Architecture News 45 (2): 1–12. https://doi.org/10.1145/3140659.3080246.\nWhat sets AI benchmarks apart from traditional performance metrics is their inherent variability—introducing accuracy as a fundamental dimension of evaluation. Unlike conventional benchmarks, which measure fixed, deterministic characteristics like computational speed or energy consumption, AI benchmarks must account for the probabilistic nature of machine learning models. The same system can produce different results depending on the data it encounters, making accuracy a defining factor in performance assessment. This distinction adds complexity, as benchmarking AI systems requires not only measuring raw computational efficiency but also understanding trade-offs between accuracy, generalization, and resource constraints.\nThe growing complexity and ubiquity of machine learning systems demand comprehensive benchmarking across all three dimensions: algorithmic models, hardware systems, and training data. This multifaceted evaluation approach represents a significant departure from earlier benchmarks that could focus on isolated aspects like computational speed or energy efficiency (Hernandez and Brown 2020). Modern machine learning benchmarks must address the sophisticated interplay between these dimensions, as limitations in any one area can fundamentally constrain overall system performance.\n\nHernandez, Danny, and Tom B. Brown. 2020. “Measuring the Algorithmic Efficiency of Neural Networks.” arXiv Preprint arXiv:2005.04305, May. https://doi.org/10.48550/arxiv.2005.04305.\n\nJouppi, Norman P., Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, et al. 2021. “Ten Lessons from Three Generations Shaped Google’s TPUv4i : Industrial Product.” In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 1–14. IEEE. https://doi.org/10.1109/isca52012.2021.00010.\n\nBender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. ACM. https://doi.org/10.1145/3442188.3445922.\nThis evolution in benchmark complexity mirrors the field’s deepening understanding of what drives machine learning system success. While algorithmic innovations initially dominated progress metrics, the challenges of deploying models at scale revealed the critical importance of hardware efficiency (Jouppi et al. 2021). Subsequently, high-profile failures of machine learning systems in real-world deployments highlighted how data quality and representation fundamentally determine system reliability and fairness (Bender et al. 2021). Understanding how these dimensions interact has become essential for accurately assessing machine learning system performance, informing development decisions, and measuring technological progress in the field.\n\n12.3.1 Algorithmic Benchmarks\nAI algorithms must balance multiple interconnected performance objectives, including accuracy, speed, resource efficiency, and generalization capability. As machine learning applications span diverse domains—such as computer vision, natural language processing, speech recognition, and reinforcement learning—evaluating these objectives requires standardized methodologies tailored to each domain’s unique challenges. Algorithmic benchmarks, such as ImageNet (Deng et al. 2009), establish these evaluation frameworks, providing a consistent basis for comparing different machine learning approaches.\n\n\n\n\n\n\nDefinition of Machine Learning Algorithmic Benchmarks\n\n\n\nML Algorithmic benchmarks refer to the evaluation of machine learning models on standardized tasks using predefined datasets and metrics. These benchmarks measure accuracy, efficiency, and generalization to ensure objective comparisons across different models. Algorithmic benchmarks provide performance baselines, enabling systematic assessment of trade-offs between model complexity and computational cost. They drive technological progress by tracking improvements over time and identifying limitations in existing approaches.\n\n\nAlgorithmic benchmarks serve several critical functions in advancing AI. They establish clear performance baselines, enabling objective comparisons between competing approaches. By systematically evaluating trade-offs between model complexity, computational requirements, and task performance, they help researchers and practitioners identify optimal design choices. Moreover, they track technological progress by documenting improvements over time, guiding the development of new techniques while exposing limitations in existing methodologies. Through these roles, algorithmic benchmarks shape the trajectory of AI research and development, ensuring that innovations translate into measurable, real-world improvements.\n\n\n12.3.2 System Benchmarks\nAI computations, particularly in machine learning, place extraordinary demands on computational resources. The underlying hardware infrastructure, encompassing CPUs, GPUs, TPUs, and specialized accelerators, fundamentally determines the speed, efficiency, and scalability of AI solutions. System benchmarks establish standardized methodologies for evaluating hardware performance across diverse AI workloads, measuring critical metrics including computational throughput, memory bandwidth, power efficiency, and scaling characteristics [Reddi et al. (2019); Mattson2020].\n\n\n\n\n\n\nDefinition of Machine Learning System Benchmarks\n\n\n\nML System benchmarks refer to the evaluation of computational infrastructure used to execute AI workloads, assessing performance, efficiency, and scalability under standardized conditions. These benchmarks measure throughput, latency, and resource utilization to ensure objective comparisons across different system configurations. System benchmarks provide insights into workload efficiency, guiding infrastructure selection, system optimization, and advancements in computational architectures.\n\n\nThese benchmarks fulfill two essential functions in the AI ecosystem. First, they enable developers and organizations to make informed decisions when selecting hardware platforms for their AI applications by providing comprehensive comparative performance data across system configurations. Critical evaluation factors include training speed, inference latency, energy efficiency, and cost-effectiveness. Second, hardware manufacturers rely on these benchmarks to quantify generational improvements and guide the development of specialized AI accelerators, driving continuous advancement in computational capabilities.\nSystem benchmarks evaluate performance across multiple scales, ranging from single-chip configurations to large distributed systems, and diverse AI workloads including both training and inference tasks. This comprehensive evaluation approach ensures that benchmarks accurately reflect real-world deployment scenarios and deliver actionable insights that inform both hardware selection decisions and system architecture design.\n\n\n12.3.3 Data Benchmarks\nData quality, scale, and diversity fundamentally shape machine learning system performance, directly influencing how effectively algorithms learn and generalize to new situations. Data benchmarks establish standardized datasets and evaluation methodologies that enable consistent comparison of different approaches. These frameworks assess critical aspects of data quality, including domain coverage, potential biases, and resilience to real-world variations in input data (Gebru et al. 2021).\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. “Datasheets for Datasets.” Communications of the ACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\n\n\n\n\nDefinition of Machine Learning Data Benchmarks\n\n\n\nML Data benchmarks refer to the evaluation of datasets and data quality in machine learning, assessing coverage, bias, and robustness under standardized conditions. These benchmarks measure data representativeness, consistency, and impact on model performance to ensure objective comparisons across different AI approaches. Data benchmarks provide insights into data reliability, guiding dataset selection, bias mitigation, and improvements in data-driven AI systems.\n\n\nData benchmarks serve an essential function in understanding AI system behavior under diverse data conditions. Through systematic evaluation, they help identify common failure modes, expose gaps in data coverage, and reveal underlying biases that could impact model behavior in deployment. By providing common frameworks for data evaluation, these benchmarks enable the AI community to systematically improve data quality and address potential issues before deploying systems in production environments. This proactive approach to data quality assessment has become increasingly critical as AI systems take on more complex and consequential tasks across different domains.\n\n\n12.3.4 Community Consensus\nThe proliferation of benchmarks spanning performance, energy efficiency, and domain-specific applications creates a fundamental challenge: establishing industry-wide standards. While early computing benchmarks primarily measured processor speed and memory bandwidth, modern benchmarks evaluate sophisticated aspects of system performance, from power consumption profiles to application-specific capabilities. This evolution in scope and complexity necessitates comprehensive validation and consensus from the computing community, particularly in rapidly evolving fields like machine learning where performance must be evaluated across multiple interdependent dimensions.\nThe lasting impact of a benchmark depends fundamentally on its acceptance by the research community, where technical excellence alone proves insufficient. Benchmarks developed without broad community input often fail to gain traction, frequently missing metrics that leading research groups consider essential. Successful benchmarks emerge through collaborative development involving academic institutions, industry partners, and domain experts. This inclusive approach ensures benchmarks evaluate capabilities most crucial for advancing the field, while balancing theoretical and practical considerations.\nBenchmarks developed through extensive collaboration among respected institutions carry the authority necessary to drive widespread adoption, while those perceived as advancing particular corporate interests face skepticism and limited acceptance. The success of ImageNet demonstrates how sustained community engagement through workshops and challenges establishes long-term viability. This community-driven development creates a foundation for formal standardization, where organizations like IEEE and ISO transform these benchmarks into official standards.\nThe standardization process provides crucial infrastructure for benchmark formalization and adoption. IEEE working groups transform community-developed benchmarking methodologies into formal industry standards, establishing precise specifications for measurement and reporting. The IEEE 2416-2019 standard for system power modeling4 exemplifies this process, codifying best practices developed through community consensus. Similarly, ISO/IEC technical committees develop international standards for benchmark validation and certification, ensuring consistent evaluation across global research and industry communities. These organizations bridge the gap between community-driven innovation and formal standardization, providing frameworks that enable reliable comparison of results across different institutions and geographic regions.\n4 IEEE 2416-2019: A standard defining methodologies for parameterized power modeling, enabling system-level power analysis and optimization in electronic design, including AI hardware.Successful community benchmarks establish clear governance structures for managing their evolution. Through rigorous version control systems and detailed change documentation, benchmarks maintain backward compatibility while incorporating new advances. This governance includes formal processes for proposing, reviewing, and implementing changes, ensuring that benchmarks remain relevant while maintaining stability. Modern benchmarks increasingly emphasize reproducibility requirements, incorporating automated verification systems and standardized evaluation environments.\nOpen access accelerates benchmark adoption and ensures consistent implementation. Projects that provide open-source reference implementations, comprehensive documentation, validation suites, and containerized evaluation environments reduce barriers to entry. This standardization enables research groups to evaluate solutions using uniform methods and metrics. Without such coordinated implementation frameworks, organizations might interpret benchmarks inconsistently, compromising result reproducibility and meaningful comparison across studies.\nThe most successful benchmarks strike a careful balance between academic rigor and industry practicality. Academic involvement ensures theoretical soundness and comprehensive evaluation methodology, while industry participation grounds benchmarks in practical constraints and real-world applications. This balance proves particularly crucial in machine learning benchmarks, where theoretical advances must translate to practical improvements in deployed systems (Patterson et al. 2021).\n\nPatterson, David, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. “Carbon Emissions and Large Neural Network Training.” arXiv Preprint arXiv:2104.10350, April. http://arxiv.org/abs/2104.10350v3.\nCommunity consensus establishes enduring benchmark relevance, while fragmentation impedes scientific progress. Through collaborative development and transparent operation, benchmarks evolve into authoritative standards for measuring advancement. The most successful benchmarks in energy efficiency and domain-specific applications share this foundation of community development and governance, demonstrating how collective expertise and shared purpose create lasting impact in rapidly advancing fields.", + "text": "12.3 AI Benchmarks: System, Model, and Data\nThe evolution of benchmarks reaches its apex in machine learning, reflecting a journey that parallels the field’s development towards domain-specific applications. Early machine learning benchmarks focused primarily on algorithmic performance, measuring how well models could perform specific tasks (Lecun et al. 1998). As machine learning applications scaled and computational demands grew, the focus expanded to include system performance and hardware efficiency (Jouppi et al. 2017). Most recently, the critical role of data quality has emerged as the third essential dimension of evaluation (Gebru et al. 2021).\n\nJouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” ACM SIGARCH Computer Architecture News 45 (2): 1–12. https://doi.org/10.1145/3140659.3080246.\nWhat sets AI benchmarks apart from traditional performance metrics is their inherent variability—introducing accuracy as a fundamental dimension of evaluation. Unlike conventional benchmarks, which measure fixed, deterministic characteristics like computational speed or energy consumption, AI benchmarks must account for the probabilistic nature of machine learning models. The same system can produce different results depending on the data it encounters, making accuracy a defining factor in performance assessment. This distinction adds complexity, as benchmarking AI systems requires not only measuring raw computational efficiency but also understanding trade-offs between accuracy, generalization, and resource constraints.\nThe growing complexity and ubiquity of machine learning systems demand comprehensive benchmarking across all three dimensions: algorithmic models, hardware systems, and training data. This multifaceted evaluation approach represents a significant departure from earlier benchmarks that could focus on isolated aspects like computational speed or energy efficiency (Hernandez and Brown 2020). Modern machine learning benchmarks must address the sophisticated interplay between these dimensions, as limitations in any one area can fundamentally constrain overall system performance.\n\nHernandez, Danny, and Tom B. Brown. 2020. “Measuring the Algorithmic Efficiency of Neural Networks.” arXiv Preprint arXiv:2005.04305, May. https://doi.org/10.48550/arxiv.2005.04305.\n\nJouppi, Norman P., Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Laudon, et al. 2021. “Ten Lessons from Three Generations Shaped Google’s TPUv4i : Industrial Product.” In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 1–14. IEEE. https://doi.org/10.1109/isca52012.2021.00010.\n\nBender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. ACM. https://doi.org/10.1145/3442188.3445922.\nThis evolution in benchmark complexity mirrors the field’s deepening understanding of what drives machine learning system success. While algorithmic innovations initially dominated progress metrics, the challenges of deploying models at scale revealed the critical importance of hardware efficiency (Jouppi et al. 2021). Subsequently, high-profile failures of machine learning systems in real-world deployments highlighted how data quality and representation fundamentally determine system reliability and fairness (Bender et al. 2021). Understanding how these dimensions interact has become essential for accurately assessing machine learning system performance, informing development decisions, and measuring technological progress in the field.\n\n12.3.1 Algorithmic Benchmarks\nAI algorithms must balance multiple interconnected performance objectives, including accuracy, speed, resource efficiency, and generalization capability. As machine learning applications span diverse domains—such as computer vision, natural language processing, speech recognition, and reinforcement learning—evaluating these objectives requires standardized methodologies tailored to each domain’s unique challenges. Algorithmic benchmarks, such as ImageNet (Deng et al. 2009), establish these evaluation frameworks, providing a consistent basis for comparing different machine learning approaches.\n\n\n\n\n\n\nDefinition of Machine Learning Algorithmic Benchmarks\n\n\n\nML Algorithmic benchmarks refer to the evaluation of machine learning models on standardized tasks using predefined datasets and metrics. These benchmarks measure accuracy, efficiency, and generalization to ensure objective comparisons across different models. Algorithmic benchmarks provide performance baselines, enabling systematic assessment of trade-offs between model complexity and computational cost. They drive technological progress by tracking improvements over time and identifying limitations in existing approaches.\n\n\nAlgorithmic benchmarks serve several critical functions in advancing AI. They establish clear performance baselines, enabling objective comparisons between competing approaches. By systematically evaluating trade-offs between model complexity, computational requirements, and task performance, they help researchers and practitioners identify optimal design choices. Moreover, they track technological progress by documenting improvements over time, guiding the development of new techniques while exposing limitations in existing methodologies. Through these roles, algorithmic benchmarks shape the trajectory of AI research and development, ensuring that innovations translate into measurable, real-world improvements.\n\n\n12.3.2 System Benchmarks\nAI computations, particularly in machine learning, place extraordinary demands on computational resources. The underlying hardware infrastructure, encompassing CPUs, GPUs, TPUs, and specialized accelerators, fundamentally determines the speed, efficiency, and scalability of AI solutions. System benchmarks establish standardized methodologies for evaluating hardware performance across diverse AI workloads, measuring critical metrics including computational throughput, memory bandwidth, power efficiency, and scaling characteristics [Reddi et al. (2019); Mattson2020].\n\n\n\n\n\n\nDefinition of Machine Learning System Benchmarks\n\n\n\nML System benchmarks refer to the evaluation of computational infrastructure used to execute AI workloads, assessing performance, efficiency, and scalability under standardized conditions. These benchmarks measure throughput, latency, and resource utilization to ensure objective comparisons across different system configurations. System benchmarks provide insights into workload efficiency, guiding infrastructure selection, system optimization, and advancements in computational architectures.\n\n\nThese benchmarks fulfill two essential functions in the AI ecosystem. First, they enable developers and organizations to make informed decisions when selecting hardware platforms for their AI applications by providing comprehensive comparative performance data across system configurations. Critical evaluation factors include training speed, inference latency, energy efficiency, and cost-effectiveness. Second, hardware manufacturers rely on these benchmarks to quantify generational improvements and guide the development of specialized AI accelerators, driving continuous advancement in computational capabilities.\nSystem benchmarks evaluate performance across multiple scales, ranging from single-chip configurations to large distributed systems, and diverse AI workloads including both training and inference tasks. This comprehensive evaluation approach ensures that benchmarks accurately reflect real-world deployment scenarios and deliver actionable insights that inform both hardware selection decisions and system architecture design.\n\n\n12.3.3 Data Benchmarks\nData quality, scale, and diversity fundamentally shape machine learning system performance, directly influencing how effectively algorithms learn and generalize to new situations. Data benchmarks establish standardized datasets and evaluation methodologies that enable consistent comparison of different approaches. These frameworks assess critical aspects of data quality, including domain coverage, potential biases, and resilience to real-world variations in input data (Gebru et al. 2021).\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. “Datasheets for Datasets.” Communications of the ACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\n\n\n\n\nDefinition of Machine Learning Data Benchmarks\n\n\n\nML Data benchmarks refer to the evaluation of datasets and data quality in machine learning, assessing coverage, bias, and robustness under standardized conditions. These benchmarks measure data representativeness, consistency, and impact on model performance to ensure objective comparisons across different AI approaches. Data benchmarks provide insights into data reliability, guiding dataset selection, bias mitigation, and improvements in data-driven AI systems.\n\n\nData benchmarks serve an essential function in understanding AI system behavior under diverse data conditions. Through systematic evaluation, they help identify common failure modes, expose gaps in data coverage, and reveal underlying biases that could impact model behavior in deployment. By providing common frameworks for data evaluation, these benchmarks enable the AI community to systematically improve data quality and address potential issues before deploying systems in production environments. This proactive approach to data quality assessment has become increasingly critical as AI systems take on more complex and consequential tasks across different domains.\n\n\n12.3.4 Community Consensus\nThe proliferation of benchmarks spanning performance, energy efficiency, and domain-specific applications creates a fundamental challenge: establishing industry-wide standards. While early computing benchmarks primarily measured processor speed and memory bandwidth, modern benchmarks evaluate sophisticated aspects of system performance, from power consumption profiles to application-specific capabilities. This evolution in scope and complexity necessitates comprehensive validation and consensus from the computing community, particularly in rapidly evolving fields like machine learning where performance must be evaluated across multiple interdependent dimensions.\nThe lasting impact of a benchmark depends fundamentally on its acceptance by the research community, where technical excellence alone proves insufficient. Benchmarks developed without broad community input often fail to gain traction, frequently missing metrics that leading research groups consider essential. Successful benchmarks emerge through collaborative development involving academic institutions, industry partners, and domain experts. This inclusive approach ensures benchmarks evaluate capabilities most crucial for advancing the field, while balancing theoretical and practical considerations.\nBenchmarks developed through extensive collaboration among respected institutions carry the authority necessary to drive widespread adoption, while those perceived as advancing particular corporate interests face skepticism and limited acceptance. The success of ImageNet demonstrates how sustained community engagement through workshops and challenges establishes long-term viability. This community-driven development creates a foundation for formal standardization, where organizations like IEEE and ISO transform these benchmarks into official standards.\nThe standardization process provides crucial infrastructure for benchmark formalization and adoption. IEEE working groups transform community-developed benchmarking methodologies into formal industry standards, establishing precise specifications for measurement and reporting. The IEEE 2416-2019 standard for system power modeling4 exemplifies this process, codifying best practices developed through community consensus. Similarly, ISO/IEC technical committees develop international standards for benchmark validation and certification, ensuring consistent evaluation across global research and industry communities. These organizations bridge the gap between community-driven innovation and formal standardization, providing frameworks that enable reliable comparison of results across different institutions and geographic regions.\n4 IEEE 2416-2019: A standard defining methodologies for parameterized power modeling, enabling system-level power analysis and optimization in electronic design, including AI hardware.Successful community benchmarks establish clear governance structures for managing their evolution. Through rigorous version control systems and detailed change documentation, benchmarks maintain backward compatibility while incorporating new advances. This governance includes formal processes for proposing, reviewing, and implementing changes, ensuring that benchmarks remain relevant while maintaining stability. Modern benchmarks increasingly emphasize reproducibility requirements, incorporating automated verification systems and standardized evaluation environments.\nOpen access accelerates benchmark adoption and ensures consistent implementation. Projects that provide open-source reference implementations, comprehensive documentation, validation suites, and containerized evaluation environments reduce barriers to entry. This standardization enables research groups to evaluate solutions using uniform methods and metrics. Without such coordinated implementation frameworks, organizations might interpret benchmarks inconsistently, compromising result reproducibility and meaningful comparison across studies.\nThe most successful benchmarks strike a careful balance between academic rigor and industry practicality. Academic involvement ensures theoretical soundness and comprehensive evaluation methodology, while industry participation grounds benchmarks in practical constraints and real-world applications. This balance proves particularly crucial in machine learning benchmarks, where theoretical advances must translate to practical improvements in deployed systems (Patterson et al. 2021).\n\nPatterson, David, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. 2021. “Carbon Emissions and Large Neural Network Training.” arXiv Preprint arXiv:2104.10350, April. http://arxiv.org/abs/2104.10350v3.\nCommunity consensus establishes enduring benchmark relevance, while fragmentation impedes scientific progress. Through collaborative development and transparent operation, benchmarks evolve into authoritative standards for measuring advancement. The most successful benchmarks in energy efficiency and domain-specific applications share this foundation of community development and governance, demonstrating how collective expertise and shared purpose create lasting impact in rapidly advancing fields.", "crumbs": [ "12  Benchmarking AI" ]