docs: update docs for non-Trial-centric world #10174

rb-determined-ai · 2024-10-31T18:04:59Z

The model debugging guide was completely out-of-date, and needed a near-total rewrite.

Additionally, the Core API user guide had additional details that needed updating, which I missed in my first pass.

Also, there were issues with two examples:

the iris example was not configured to train long enough to actually converge, which looks bad for an example
The core_api_mnist_pytorch example had a couple show-stopper bugs, so not all of its stages ran at all.

Finally, several examples touched in the searcher-context-removal project needed make fmt applied to them.

netlify · 2024-10-31T18:05:16Z

✅ Deploy Preview for determined-ui canceled.

Name	Link
🔨 Latest commit	`2917e04`
🔍 Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/6723ef8c31bb4d00086ca539

codecov · 2024-10-31T18:05:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.27%. Comparing base (e3c31f0) to head (2917e04).
Report is 21 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10174      +/-   ##
==========================================
- Coverage   58.46%   54.27%   -4.20%     
==========================================
  Files         754     1259     +505     
  Lines      104292   157257   +52965     
  Branches     3642     3643       +1     
==========================================
+ Hits        60978    85355   +24377     
- Misses      43181    71769   +28588     
  Partials      133      133

Flag	Coverage Δ
backend	`45.94% <ø> (+2.14%)`	⬆️
harness	`71.15% <ø> (ø)`
web	`54.29% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 505 files with indirect coverage changes

azhou-determined

nice, like the debug-models rewrite.

docs/model-dev-guide/debug-models.rst

azhou-determined · 2024-10-31T18:55:17Z

docs/model-dev-guide/debug-models.rst


-This step assumes you have ported (converted) your model from code outside of Determined. Otherwise,
-skip to :ref:`Step 2 <step2>`.
+Determined's training APIs are designed to work both on-cluster and locally (that is, without


it makes me happy that we can finally say this :')

azhou-determined · 2024-10-31T19:01:40Z

docs/model-dev-guide/debug-models.rst

-      -  `Step 9 - Verify that a multi-GPU experiment works`_
+-  `Step 1 - Verify that your training script runs locally`_
+-  `Step 2 - Verify that your training script runs in a notebook or shell`_
+-  `Step 3 - Verify that you can submit a single-GPU experiment`_


i think local GPU/distributed training should also possible now for all high-level training APIs if you use your launcher directly. though this functionality might be a little clunky so i dunno if it's worth calling out.

I think it isn't what the reader is prolly looking for.

azhou-determined · 2024-10-31T19:12:36Z

examples/tutorials/core_api_pytorch_mnist/adaptive.yaml

  time_metric: epochs
  max_time: 20
-entrypoint: python3 model_def_adaptive.py
+entrypoint: python3 model_def_adaptive.py --epochs 20


do you like this pattern of setting length in entrypoint? i get that the code for this is existing, but generally speaking.

i find it kind of annoying when iterating between local and on cluster modes, having two files to set max length in. is the reasoning that we don't want an extra if cluster_info/else in training code?

having two files to set max length in

which two files?

i train locally, i change it in train.py. i want to submit it to cluster, i have to change .yaml

azhou-determined · 2024-10-31T19:20:55Z

docs/model-dev-guide/debug-models.rst

+-  `Step 1 - Verify that your training script runs locally`_
+-  `Step 2 - Verify that your training script runs in a notebook or shell`_
+-  `Step 3 - Verify that you can submit a single-GPU experiment`_
+-  `Step 4 - Verify that you can submit a multi-GPU experiment`_


wonder if it would make sense to link to the profiling doc in this page, as sort of a "performance debugging" step

I think that could make sense. I'm not going to make that change today though.

The model debugging guide was completely out-of-date, and needed a near-total rewrite. Additionally, the Core API user guide had additional details that needed updating, which I missed in my first pass. Also, there were issues with two examples: - the iris example was not configured to train long enough to actually converge, which looks bad for an example - The core_api_mnist_pytorch example had a couple show-stopper bugs, so not all of its stages ran at all. Finally, several examples touched in the searcher-context-removal project needed `make fmt` applied to them.

lbliii

LGTM

The model debugging guide was completely out-of-date, and needed a near-total rewrite. Additionally, the Core API user guide had additional details that needed updating, which I missed in my first pass. Also, there were issues with two examples: - the iris example was not configured to train long enough to actually converge, which looks bad for an example - The core_api_mnist_pytorch example had a couple show-stopper bugs, so not all of its stages ran at all. Finally, several examples touched in the searcher-context-removal project needed `make fmt` applied to them. (cherry picked from commit 21b0256)

cla-bot bot added the cla-signed label Oct 31, 2024

determined-ci requested a review from a team October 31, 2024 18:05

determined-ci added the documentation Improvements or additions to documentation label Oct 31, 2024

rb-determined-ai added the backport release-0.38.0 label Oct 31, 2024

azhou-determined approved these changes Oct 31, 2024

View reviewed changes

rb-determined-ai force-pushed the rb/scr-fixes branch from b1ab48a to f37b320 Compare October 31, 2024 20:39

rb-determined-ai force-pushed the rb/scr-fixes branch from f37b320 to 2917e04 Compare October 31, 2024 20:58

lbliii approved these changes Oct 31, 2024

View reviewed changes

rb-determined-ai merged commit 21b0256 into main Nov 1, 2024
83 of 95 checks passed

rb-determined-ai deleted the rb/scr-fixes branch November 1, 2024 15:56

github-actions bot mentioned this pull request Nov 1, 2024

[AUTO-BACKPORT 10174] docs: update docs for non-Trial-centric world #10186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: update docs for non-Trial-centric world #10174

docs: update docs for non-Trial-centric world #10174

rb-determined-ai commented Oct 31, 2024

netlify bot commented Oct 31, 2024 •

edited

Loading

codecov bot commented Oct 31, 2024 •

edited

Loading

azhou-determined left a comment

azhou-determined Oct 31, 2024

rb-determined-ai Oct 31, 2024

azhou-determined Oct 31, 2024

rb-determined-ai Oct 31, 2024

azhou-determined Oct 31, 2024

rb-determined-ai Oct 31, 2024

azhou-determined Nov 1, 2024

azhou-determined Oct 31, 2024

rb-determined-ai Oct 31, 2024

lbliii left a comment

docs: update docs for non-Trial-centric world #10174

docs: update docs for non-Trial-centric world #10174

Conversation

rb-determined-ai commented Oct 31, 2024

netlify bot commented Oct 31, 2024 • edited Loading

✅ Deploy Preview for determined-ui canceled.

codecov bot commented Oct 31, 2024 • edited Loading

Codecov Report

azhou-determined left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbliii left a comment

Choose a reason for hiding this comment

netlify bot commented Oct 31, 2024 •

edited

Loading

codecov bot commented Oct 31, 2024 •

edited

Loading