Skip to content

Commit

Permalink
docs: Update log policies
Browse files Browse the repository at this point in the history
Two new features: improved log search and log signal
  • Loading branch information
tara-hpe committed Oct 23, 2024
1 parent f45ebb9 commit 87c4dec
Show file tree
Hide file tree
Showing 2 changed files with 67 additions and 15 deletions.
34 changes: 19 additions & 15 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -304,39 +304,43 @@ if at least one of its trials completes without errors. The default value for ``
``log_policies``
================

Optional. Defines actions in response to trial logs matching specified regex patterns (Go language
syntax). For more information about the syntax, you can visit this `RE2 reference page
<https://github.com/google/re2/wiki/Syntax>`__. Actions include:
Optional. Defines actions and labels in response to trial logs matching specified regex patterns (Go
language syntax). For more information about the syntax, you can visit this `RE2 reference page
<https://github.com/google/re2/wiki/Syntax>`__. Each log policy can have the following fields:

- ``exclude_node``: Excludes a failed trial's restart attempts (due to its ``max_restarts`` policy)
from being scheduled on nodes with matched error logs. This is useful for bypassing nodes with
hardware issues, like uncorrectable GPU ECC errors.
- ``name``: Optional. A name for the log policy. If provided, this name will be displayed as a
label in the UI when the log policy matches.

Note: This option is not supported on PBS systems.
- ``pattern``: Required. The regex pattern to match in the logs.

For the agent resource manager, if a trial becomes unschedulable due to enough node exclusions,
and ``launch_error`` in the master config is true (default), the trial fails.
- ``action``: Optional. The action to take when the pattern is matched. Actions include:

- ``cancel_retries``: Prevents a trial from restarting if a trial reports a log that matches the
pattern, even if it has remaining ``max_restarts``. This avoids using resources for retrying a
trial that encounters certain failures that won't be fixed by retrying the trial, such as CUDA
memory issues.
- ``exclude_node``: Excludes a failed trial's restart attempts from being scheduled on nodes
with matching error logs.
- ``cancel_retries``: Prevents a trial from restarting if it reports a matching log.

Example configuration:

.. code:: yaml
log_policies:
- pattern: ".*uncorrectable ECC error encountered.*"
- name: "ECC Error"
pattern: ".*uncorrectable ECC error encountered.*"
action:
type: exclude_node
- pattern: ".*CUDA out of memory.*"
- name: "CUDA OOM"
pattern: ".*CUDA out of memory.*"
action:
type: cancel_retries
When a log policy matches, its name (if provided) will be displayed as a label in the WebUI,
allowing for easy identification of specific issues or events during a run.

These settings may also be specified at the cluster or resource pool level through task container
defaults.

To find out more about log management, visit :ref:`Log Management <log-management>`.

.. _log-retention-days:

``retention_policy``
Expand Down
48 changes: 48 additions & 0 deletions docs/tutorials/log-management.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
.. _log-management:

#################
Log Management
#################

This guide covers two log management features: Log Search and Log Signal.

*************
Log Search
*************

To perform a log search:

1. Navigate to your run in the WebUI.
2. In the Logs tab, start typing in the search box to open the search pane.
3. To use regex search, click the "Regex" checkbox in the search pane.
4. Click on a search result to view it in context, with logs before and after visible.
5. Scroll up and down to fetch new logs.

Note: Search results are not auto-updating. You may need to refresh to see new logs.

***********
Log Signal
***********

Log Signal allows you to configure log policies in the master configuration to display labels in the UI when specific patterns are matched in the logs.

To set up a log policy:

1. In the master configuration file, under ``task_container_defaults > log_policies``, define your log policies.
2. Each policy can have a ``name``, ``pattern``, and ``action``.
3. When a log matching the pattern is encountered, the ``name`` will be displayed as a label in the run table and run detail views.

Example configuration:

.. code:: yaml
log_policies:
- name: "CUDA OOM"
pattern: ".*CUDA out of memory.*"
action:
type: cancel_retries
This will display a "CUDA OOM" label in the UI when a CUDA out of memory error is encountered in the logs.

For more detailed information on configuring log policies, refer to the :ref:`experiment configuration reference <config-log-policies>`.

0 comments on commit 87c4dec

Please sign in to comment.