diff --git a/docs/installation/user-mode.md b/docs/installation/user-mode.md index bb25e76..eead958 100644 --- a/docs/installation/user-mode.md +++ b/docs/installation/user-mode.md @@ -6,7 +6,7 @@ :maxdepth: 4 ``` -In user-mode executions, Omnistat data collectors and a companion Prometheus +In user-mode executions, Omnistat data collectors and a companion VictoriaMetrics server are deployed temporarily on hosts assigned to a user's job, as highlighted in {numref}`fig-user-mode`. The following assumptions are made throughout the rest of this user-mode installation discussion: @@ -15,7 +15,7 @@ __Assumptions__: * [ROCm](https://rocm.docs.amd.com/en/latest/) v6.1 or newer is pre-installed on all GPU hosts. * Installer has access to a distributed file-system; if no distributed - file-system is present, installation steps need to be repeated in all nodes. + file-system is present, installation steps need to be repeated across all nodes. ## Omnistat software installation @@ -48,8 +48,8 @@ directory of the release. [user@login]$ ~/venv/omnistat/bin/python -m pip install .[query] ``` -3. Download Prometheus. If a `prometheus` server is not already present on the system, - download and extract a [precompiled binary](https://prometheus.io/download/). This binary can generally be stored in any directory accessible by the user, but the path to the binary will need to be known during the next section when configuring user-mode execution. +3. Download a **single-node** VictoriaMetrics server. Assuming a `victoria-metrics` server is not already present on the system, + download and extract a [precompiled binary](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest) from upstream. This binary can generally be stored in any directory accessible by the user, but the path to the binary will need to be known during the next section when configuring user-mode execution. Note that VictoriaMetrics provides a larger number binary releases and we typically use the `victoria-metrics-linux-amd64` variant on x86_64 clusters. ## Configuring user-mode Omnistat @@ -71,36 +71,77 @@ For user-mode execution, Omnistat includes additional options in the `[omnistast [omnistat.usermode] ssh_key = ~/.ssh/id_rsa - prometheus_binary = /path/to/prometheus - prometheus_datadir = data_prom - prometheus_logfile = prom_server.log + victoria_binary = /path/to/victoria-metrics + victoria_datadir = data_prom + victoria_logfile = vic_server.log + push_frequency_mins = 5 ``` -## Running a SLURM Job +## Running Jobs -In the SLURM job script, add the following lines to start and stop the data -collection before and after running the application. Lines highlighted in -yellow need to be customized for different installation paths. +To enable user-mode data collection for a specifid job, add logic within your job script to start and stop the collection mechanism before and after running your desired application(s). Omnistat includes an `omnistat-usermode` utility to help automate this process and the examples below highlight the steps for simple SLURM and Flux job scripts. Note that the lines highlighted in +yellow need to be customized for the local installation path. + + ### SLURM example ```eval_rst .. code-block:: bash - :emphasize-lines: 1-2 - :caption: SLURM job file using user-mode Omnistat with a 10 second sampling interval + :emphasize-lines: 6-7 + :caption: Example SLURM job file using user-mode Omnistat with a 10 second sampling interval + + #!/bin/bash + #SBATCH -N 8 + #SBATCH -n 16 + #SBATCH -t 02:00:00 export OMNISTAT_CONFIG=/path/to/omnistat.config export OMNISTAT_DIR=/path/to/omnistat - # Start data collector + # Beginning of job - start data collector ${OMNISTAT_DIR}/omnistat-usermode --start --interval 10 # Run application(s) as normal srun ./a.out - # End of job - generate summary report and stop data collection + # End of job - stop data collection, generate summary and store collected data by jobid + ${OMNISTAT_DIR}/omnistat-usermode --stopexporters ${OMNISTAT_DIR}/omnistat-query --job ${SLURM_JOB_ID} --interval 10 - ${OMNISTAT_DIR}/omnistat-usermode --stop + ${OMNISTAT_DIR}/omnistat-usermode --stopserver + mv data_prom data_prom_${SLURM_JOB_ID} + ``` + + ### Flux example + +```eval_rst +.. code-block:: bash + :emphasize-lines: 8-9 + :caption: Example FLUX job file using user-mode Omnistat with a 1 second sampling interval + + #!/bin/bash + #flux: -N 8 + #flux: -n 16 + #flux: -t 2h + + jobid=`flux getattr jobid` + + export OMNISTAT_CONFIG=/path/to/omnistat.config + export OMNISTAT_DIR=/path/to/omnistat + + # Beginning of job - start data collector + ${OMNISTAT_DIR}/omnistat-usermode --start --interval 1 + + # Run application(s) as normal + flux run ./a.out + + # End of job - stop data collection, generate summary and store collected data by jobid + ${OMNISTAT_DIR}/omnistat-usermode --stopexporters + ${OMNISTAT_DIR}/omnistat-query --job ${jobid} --interval 1 + ${OMNISTAT_DIR}/omnistat-usermode --stopserver + mv data_prom data_prom.${jobid} ``` + In both examples above, the `omnistat-query` utility is used at the end of the job to query collected telemetry (prior to shutting down the server) for the assigned jobid. This should produce a summary report card for the job similar to the [report card](query_report_card) example mentioned in the Overview directly within the recorded job output. + ## Exploring results with a local Docker environment To explore results generated for user-mode executions of Omnistat, we provide diff --git a/docs/introduction.md b/docs/introduction.md index 4ac3eeb..68c25ce 100644 --- a/docs/introduction.md +++ b/docs/introduction.md @@ -25,15 +25,15 @@ Omnistat provides a set of utilities to aid cluster administrators or individual * GPU type * GPU vBIOS version -To enable scalable collection of these metrics, Omnistat provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server. +To enable scalable collection of these metrics, Omnistat provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server (or a [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics) server). (user-vs-system)= ## User-mode vs System-level monitoring Omnistat utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows: -1. __System-wide monitoring__: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the [SLURM](https://slurm.schedmd.com) workload manager. -1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production SLURM clusters who want to gather telemetry data within a single SLURM job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a SLURM job. +1. __System-wide monitoring__: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the [SLURM](https://slurm.schedmd.com) or [Flux](https://flux-framework.org) workload managers. +1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production clusters under the auspices of a resource manager who want to gather telemetry data within a single job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a job. Resource managers supported by user-mode Omnistat include both [SLURM](https://github.com/SchedMD/slurm) and [Flux](https://flux-framework.org). To demonstrate the overall data collection architecture employed by Omnistat in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases. @@ -53,18 +53,24 @@ In the __system-wide monitoring__ case, a system administrator enables data coll In addition to enabling GPU metrics collection in the __system-wide monitoring__ case, sites may also wish to collect host-side metrics (CPU load, memory usage, etc). Other open-source Prometheus collectors exist for this purpose and we recommend enabling the [node-exporter](https://github.com/prometheus/node_exporter) in combination with Omnistat. -Conversely, in the __user-mode__ case, Omnistat data collector(s) and a companion prometheus server are deployed temporarily on hosts assigned to a user's SLURM job. At the end of the job, Omnistat utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows: +Conversely, in the __user-mode__ case, Omnistat data collector(s) and a companion VictoriaMetrics server are deployed temporarily on hosts assigned to a user's job. At the end of the job, Omnistat utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows: -```none ----------------------------------------- -Omnistat Report Card for Job # 44092 ----------------------------------------- +(query_report_card)= +```eval_rst +.. _report_card: + +.. code-block:: none + :caption: Example telemetry summary report card in user-mode. + + ---------------------------------------- + Omnistat Report Card for Job # 44092 + ---------------------------------------- -Job Overview (Num Nodes = 1, Machine = Snazzy Cluster) - --> Start time = 2024-05-17 10:14:00 - --> End time = 2024-05-17 10:19:00 + Job Overview (Num Nodes = 1, Machine = Snazzy Cluster) + --> Start time = 2024-05-17 10:14:00 + --> End time = 2024-05-17 10:19:00 -GPU Statistics: + GPU Statistics: | Utilization (%) | Memory Use (%) | Temperature (C) | Power (W) | GPU # | Max Mean | Max Mean | Max Mean | Max Mean | @@ -74,9 +80,9 @@ GPU Statistics: 2 | 100.00 55.56 | 94.92 63.28 | 60.00 51.11 | 304.00 176.78 | 3 | 100.00 55.56 | 94.78 63.20 | 58.00 48.89 | 354.00 184.67 | --- -Query execution time = 0.1 secs -Version = 0.2.0 + -- + Query execution time = 0.1 secs + Version = {__VERSION__} ``` ## Software dependencies @@ -94,7 +100,7 @@ System administrators wishing to deploy a system-wide GPU monitoring capability ## Resource Manager Integration -Omnistat can be _optionally_ configured to map telemetry tracking to specific job Ids when using the popular [SLURM](https://github.com/SchedMD/slurm) resource manager. This is accomplished via enablement of a Prometheus info metric that tracks node-level job assignments and makes the following metadata available to Prometheus: +Omnistat can be _optionally_ configured to map telemetry tracking to specific job Ids when using the popular [SLURM](https://github.com/SchedMD/slurm) resource manager or the new [Flux](https://flux-framework.org) framework. This is accomplished via enablement of a Prometheus info metric that tracks node-level job assignments and makes the following metadata available: * job id * username