Skip to content

Commit

Permalink
Documentation updates for latest release to include introduction of Flux
Browse files Browse the repository at this point in the history
support and use of VictoriaMetrics during user-mode (the new default).

Signed-off-by: Karl W. Schulz <[email protected]>
  • Loading branch information
koomie committed Jan 7, 2025
1 parent e1c0b26 commit 193c256
Show file tree
Hide file tree
Showing 2 changed files with 79 additions and 32 deletions.
73 changes: 57 additions & 16 deletions docs/installation/user-mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
:maxdepth: 4
```

In user-mode executions, Omnistat data collectors and a companion Prometheus
In user-mode executions, Omnistat data collectors and a companion VictoriaMetrics
server are deployed temporarily on hosts assigned to a user's job, as
highlighted in {numref}`fig-user-mode`. The following assumptions are made
throughout the rest of this user-mode installation discussion:
Expand All @@ -15,7 +15,7 @@ __Assumptions__:
* [ROCm](https://rocm.docs.amd.com/en/latest/) v6.1 or newer is pre-installed
on all GPU hosts.
* Installer has access to a distributed file-system; if no distributed
file-system is present, installation steps need to be repeated in all nodes.
file-system is present, installation steps need to be repeated across all nodes.


## Omnistat software installation
Expand Down Expand Up @@ -48,8 +48,8 @@ directory of the release.
[user@login]$ ~/venv/omnistat/bin/python -m pip install .[query]
```

3. Download Prometheus. If a `prometheus` server is not already present on the system,
download and extract a [precompiled binary](https://prometheus.io/download/). This binary can generally be stored in any directory accessible by the user, but the path to the binary will need to be known during the next section when configuring user-mode execution.
3. Download a **single-node** VictoriaMetrics server. Assuming a `victoria-metrics` server is not already present on the system,
download and extract a [precompiled binary](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/latest) from upstream. This binary can generally be stored in any directory accessible by the user, but the path to the binary will need to be known during the next section when configuring user-mode execution. Note that VictoriaMetrics provides a larger number binary releases and we typically use the `victoria-metrics-linux-amd64` variant on x86_64 clusters.

## Configuring user-mode Omnistat

Expand All @@ -71,36 +71,77 @@ For user-mode execution, Omnistat includes additional options in the `[omnistast
[omnistat.usermode]
ssh_key = ~/.ssh/id_rsa
prometheus_binary = /path/to/prometheus
prometheus_datadir = data_prom
prometheus_logfile = prom_server.log
victoria_binary = /path/to/victoria-metrics
victoria_datadir = data_prom
victoria_logfile = vic_server.log
push_frequency_mins = 5
```

## Running a SLURM Job
## Running Jobs

In the SLURM job script, add the following lines to start and stop the data
collection before and after running the application. Lines highlighted in
yellow need to be customized for different installation paths.
To enable user-mode data collection for a specifid job, add logic within your job script to start and stop the collection mechanism before and after running your desired application(s). Omnistat includes an `omnistat-usermode` utility to help automate this process and the examples below highlight the steps for simple SLURM and Flux job scripts. Note that the lines highlighted in
yellow need to be customized for the local installation path.


### SLURM example
```eval_rst
.. code-block:: bash
:emphasize-lines: 1-2
:caption: SLURM job file using user-mode Omnistat with a 10 second sampling interval
:emphasize-lines: 6-7
:caption: Example SLURM job file using user-mode Omnistat with a 10 second sampling interval
#!/bin/bash
#SBATCH -N 8
#SBATCH -n 16
#SBATCH -t 02:00:00
export OMNISTAT_CONFIG=/path/to/omnistat.config
export OMNISTAT_DIR=/path/to/omnistat
# Start data collector
# Beginning of job - start data collector
${OMNISTAT_DIR}/omnistat-usermode --start --interval 10
# Run application(s) as normal
srun <options> ./a.out
# End of job - generate summary report and stop data collection
# End of job - stop data collection, generate summary and store collected data by jobid
${OMNISTAT_DIR}/omnistat-usermode --stopexporters
${OMNISTAT_DIR}/omnistat-query --job ${SLURM_JOB_ID} --interval 10
${OMNISTAT_DIR}/omnistat-usermode --stop
${OMNISTAT_DIR}/omnistat-usermode --stopserver
mv data_prom data_prom_${SLURM_JOB_ID}
```

### Flux example

```eval_rst
.. code-block:: bash
:emphasize-lines: 8-9
:caption: Example FLUX job file using user-mode Omnistat with a 1 second sampling interval
#!/bin/bash
#flux: -N 8
#flux: -n 16
#flux: -t 2h
jobid=`flux getattr jobid`
export OMNISTAT_CONFIG=/path/to/omnistat.config
export OMNISTAT_DIR=/path/to/omnistat
# Beginning of job - start data collector
${OMNISTAT_DIR}/omnistat-usermode --start --interval 1
# Run application(s) as normal
flux run <options> ./a.out
# End of job - stop data collection, generate summary and store collected data by jobid
${OMNISTAT_DIR}/omnistat-usermode --stopexporters
${OMNISTAT_DIR}/omnistat-query --job ${jobid} --interval 1
${OMNISTAT_DIR}/omnistat-usermode --stopserver
mv data_prom data_prom.${jobid}
```

In both examples above, the `omnistat-query` utility is used at the end of the job to query collected telemetry (prior to shutting down the server) for the assigned jobid. This should produce a summary report card for the job similar to the [report card](query_report_card) example mentioned in the Overview directly within the recorded job output.

## Exploring results with a local Docker environment

To explore results generated for user-mode executions of Omnistat, we provide
Expand Down
38 changes: 22 additions & 16 deletions docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,15 @@ Omnistat provides a set of utilities to aid cluster administrators or individual
* GPU type
* GPU vBIOS version

To enable scalable collection of these metrics, Omnistat provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server.
To enable scalable collection of these metrics, Omnistat provides a python-based [Prometheus](https://prometheus.io) client that supplies instantaneous metric values on-demand for periodic polling by a companion Prometheus server (or a [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics) server).

(user-vs-system)=
## User-mode vs System-level monitoring

Omnistat utilities can be deployed with two primary use-cases in mind that differ based on the end-consumer and whether the user has administrative rights or not. The use cases are denoted as follows:

1. __System-wide monitoring__: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the [SLURM](https://slurm.schedmd.com) workload manager.
1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production SLURM clusters who want to gather telemetry data within a single SLURM job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a SLURM job.
1. __System-wide monitoring__: requires administrative rights and is typically used to monitor all GPU hosts within a given cluster in a 24x7 mode of operation. Use this approach to support system-wide telemetry collection for all user workloads and optionally, provide job-level insights for systems running the [SLURM](https://slurm.schedmd.com) or [Flux](https://flux-framework.org) workload managers.
1. __User-mode monitoring__: does not require administrative rights and can be run entirely within user-space. This case is typically exercised by end application users running on production clusters under the auspices of a resource manager who want to gather telemetry data within a single job allocation. Frequently, this approach is performed entirely within a command-line `ssh` environment but Omnistat includes support for downloading data after a job for visualization with a dockerized Grafana environment. Alternatively, standalone query utilities can be used to summarize collected metrics at the conclusion of a job. Resource managers supported by user-mode Omnistat include both [SLURM](https://github.com/SchedMD/slurm) and [Flux](https://flux-framework.org).

To demonstrate the overall data collection architecture employed by Omnistat in these two modes of operation, the following diagrams highlight the data collector layout and life-cycle for both cases.

Expand All @@ -53,18 +53,24 @@ In the __system-wide monitoring__ case, a system administrator enables data coll
In addition to enabling GPU metrics collection in the __system-wide monitoring__ case, sites may also wish to collect host-side metrics (CPU load, memory usage, etc). Other open-source Prometheus collectors exist for this purpose and we recommend enabling the [node-exporter](https://github.com/prometheus/node_exporter) in combination with Omnistat.
Conversely, in the __user-mode__ case, Omnistat data collector(s) and a companion prometheus server are deployed temporarily on hosts assigned to a user's SLURM job. At the end of the job, Omnistat utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows:
Conversely, in the __user-mode__ case, Omnistat data collector(s) and a companion VictoriaMetrics server are deployed temporarily on hosts assigned to a user's job. At the end of the job, Omnistat utilities can query cached telemetry data to summarize GPU utilization details or it can be visualized offline after the job completes. An example command-line summary from this user-mode approach is highlighted as follows:
```none
----------------------------------------
Omnistat Report Card for Job # 44092
----------------------------------------
(query_report_card)=
```eval_rst
.. _report_card:
.. code-block:: none
:caption: Example telemetry summary report card in user-mode.
----------------------------------------
Omnistat Report Card for Job # 44092
----------------------------------------
Job Overview (Num Nodes = 1, Machine = Snazzy Cluster)
--> Start time = 2024-05-17 10:14:00
--> End time = 2024-05-17 10:19:00
Job Overview (Num Nodes = 1, Machine = Snazzy Cluster)
--> Start time = 2024-05-17 10:14:00
--> End time = 2024-05-17 10:19:00
GPU Statistics:
GPU Statistics:
| Utilization (%) | Memory Use (%) | Temperature (C) | Power (W) |
GPU # | Max Mean | Max Mean | Max Mean | Max Mean |
Expand All @@ -74,9 +80,9 @@ GPU Statistics:
2 | 100.00 55.56 | 94.92 63.28 | 60.00 51.11 | 304.00 176.78 |
3 | 100.00 55.56 | 94.78 63.20 | 58.00 48.89 | 354.00 184.67 |
--
Query execution time = 0.1 secs
Version = 0.2.0
--
Query execution time = 0.1 secs
Version = {__VERSION__}
```

## Software dependencies
Expand All @@ -94,7 +100,7 @@ System administrators wishing to deploy a system-wide GPU monitoring capability

## Resource Manager Integration

Omnistat can be _optionally_ configured to map telemetry tracking to specific job Ids when using the popular [SLURM](https://github.com/SchedMD/slurm) resource manager. This is accomplished via enablement of a Prometheus info metric that tracks node-level job assignments and makes the following metadata available to Prometheus:
Omnistat can be _optionally_ configured to map telemetry tracking to specific job Ids when using the popular [SLURM](https://github.com/SchedMD/slurm) resource manager or the new [Flux](https://flux-framework.org) framework. This is accomplished via enablement of a Prometheus info metric that tracks node-level job assignments and makes the following metadata available:

* job id
* username
Expand Down

0 comments on commit 193c256

Please sign in to comment.