Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update docs for non-Trial-centric world #10174

Merged
merged 1 commit into from
Nov 1, 2024
Merged

Conversation

rb-determined-ai
Copy link
Contributor

The model debugging guide was completely out-of-date, and needed a near-total rewrite.

Additionally, the Core API user guide had additional details that needed updating, which I missed in my first pass.

Also, there were issues with two examples:

  • the iris example was not configured to train long enough to actually converge, which looks bad for an example

  • The core_api_mnist_pytorch example had a couple show-stopper bugs, so not all of its stages ran at all.

Finally, several examples touched in the searcher-context-removal project needed make fmt applied to them.

@cla-bot cla-bot bot added the cla-signed label Oct 31, 2024
@determined-ci determined-ci requested a review from a team October 31, 2024 18:05
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Oct 31, 2024
Copy link

netlify bot commented Oct 31, 2024

Deploy Preview for determined-ui canceled.

Name Link
🔨 Latest commit 2917e04
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/6723ef8c31bb4d00086ca539

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 54.27%. Comparing base (e3c31f0) to head (2917e04).
Report is 21 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10174      +/-   ##
==========================================
- Coverage   58.46%   54.27%   -4.20%     
==========================================
  Files         754     1259     +505     
  Lines      104292   157257   +52965     
  Branches     3642     3643       +1     
==========================================
+ Hits        60978    85355   +24377     
- Misses      43181    71769   +28588     
  Partials      133      133              
Flag Coverage Δ
backend 45.94% <ø> (+2.14%) ⬆️
harness 71.15% <ø> (ø)
web 54.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 505 files with indirect coverage changes

Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, like the debug-models rewrite.


This step assumes you have ported (converted) your model from code outside of Determined. Otherwise,
skip to :ref:`Step 2 <step2>`.
Determined's training APIs are designed to work both on-cluster and locally (that is, without
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes me happy that we can finally say this :')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me too

- `Step 9 - Verify that a multi-GPU experiment works`_
- `Step 1 - Verify that your training script runs locally`_
- `Step 2 - Verify that your training script runs in a notebook or shell`_
- `Step 3 - Verify that you can submit a single-GPU experiment`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think local GPU/distributed training should also possible now for all high-level training APIs if you use your launcher directly. though this functionality might be a little clunky so i dunno if it's worth calling out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it isn't what the reader is prolly looking for.

time_metric: epochs
max_time: 20
entrypoint: python3 model_def_adaptive.py
entrypoint: python3 model_def_adaptive.py --epochs 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you like this pattern of setting length in entrypoint? i get that the code for this is existing, but generally speaking.

i find it kind of annoying when iterating between local and on cluster modes, having two files to set max length in. is the reasoning that we don't want an extra if cluster_info/else in training code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having two files to set max length in

which two files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i train locally, i change it in train.py. i want to submit it to cluster, i have to change .yaml

- `Step 1 - Verify that your training script runs locally`_
- `Step 2 - Verify that your training script runs in a notebook or shell`_
- `Step 3 - Verify that you can submit a single-GPU experiment`_
- `Step 4 - Verify that you can submit a multi-GPU experiment`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if it would make sense to link to the profiling doc in this page, as sort of a "performance debugging" step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could make sense. I'm not going to make that change today though.

The model debugging guide was completely out-of-date, and needed a
near-total rewrite.

Additionally, the Core API user guide had additional details that needed
updating, which I missed in my first pass.

Also, there were issues with two examples:

 - the iris example was not configured to train long enough to actually
   converge, which looks bad for an example

 - The core_api_mnist_pytorch example had a couple show-stopper bugs,
   so not all of its stages ran at all.

Finally, several examples touched in the searcher-context-removal
project needed `make fmt` applied to them.
Copy link
Contributor

@lbliii lbliii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rb-determined-ai rb-determined-ai merged commit 21b0256 into main Nov 1, 2024
83 of 95 checks passed
@rb-determined-ai rb-determined-ai deleted the rb/scr-fixes branch November 1, 2024 15:56
github-actions bot pushed a commit that referenced this pull request Nov 1, 2024
The model debugging guide was completely out-of-date, and needed a
near-total rewrite.

Additionally, the Core API user guide had additional details that needed
updating, which I missed in my first pass.

Also, there were issues with two examples:

 - the iris example was not configured to train long enough to actually
   converge, which looks bad for an example

 - The core_api_mnist_pytorch example had a couple show-stopper bugs,
   so not all of its stages ran at all.

Finally, several examples touched in the searcher-context-removal
project needed `make fmt` applied to them.

(cherry picked from commit 21b0256)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release-0.38.0 cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants