-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: update docs for non-Trial-centric world #10174
Conversation
✅ Deploy Preview for determined-ui canceled.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #10174 +/- ##
==========================================
- Coverage 58.46% 54.27% -4.20%
==========================================
Files 754 1259 +505
Lines 104292 157257 +52965
Branches 3642 3643 +1
==========================================
+ Hits 60978 85355 +24377
- Misses 43181 71769 +28588
Partials 133 133
Flags with carried forward coverage won't be shown. Click here to find out more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, like the debug-models rewrite.
|
||
This step assumes you have ported (converted) your model from code outside of Determined. Otherwise, | ||
skip to :ref:`Step 2 <step2>`. | ||
Determined's training APIs are designed to work both on-cluster and locally (that is, without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes me happy that we can finally say this :')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me too
- `Step 9 - Verify that a multi-GPU experiment works`_ | ||
- `Step 1 - Verify that your training script runs locally`_ | ||
- `Step 2 - Verify that your training script runs in a notebook or shell`_ | ||
- `Step 3 - Verify that you can submit a single-GPU experiment`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think local GPU/distributed training should also possible now for all high-level training APIs if you use your launcher directly. though this functionality might be a little clunky so i dunno if it's worth calling out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it isn't what the reader is prolly looking for.
time_metric: epochs | ||
max_time: 20 | ||
entrypoint: python3 model_def_adaptive.py | ||
entrypoint: python3 model_def_adaptive.py --epochs 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you like this pattern of setting length in entrypoint? i get that the code for this is existing, but generally speaking.
i find it kind of annoying when iterating between local and on cluster modes, having two files to set max length in. is the reasoning that we don't want an extra if cluster_info/else in training code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having two files to set max length in
which two files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i train locally, i change it in train.py
. i want to submit it to cluster, i have to change .yaml
- `Step 1 - Verify that your training script runs locally`_ | ||
- `Step 2 - Verify that your training script runs in a notebook or shell`_ | ||
- `Step 3 - Verify that you can submit a single-GPU experiment`_ | ||
- `Step 4 - Verify that you can submit a multi-GPU experiment`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if it would make sense to link to the profiling doc in this page, as sort of a "performance debugging" step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that could make sense. I'm not going to make that change today though.
b1ab48a
to
f37b320
Compare
The model debugging guide was completely out-of-date, and needed a near-total rewrite. Additionally, the Core API user guide had additional details that needed updating, which I missed in my first pass. Also, there were issues with two examples: - the iris example was not configured to train long enough to actually converge, which looks bad for an example - The core_api_mnist_pytorch example had a couple show-stopper bugs, so not all of its stages ran at all. Finally, several examples touched in the searcher-context-removal project needed `make fmt` applied to them.
f37b320
to
2917e04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The model debugging guide was completely out-of-date, and needed a near-total rewrite. Additionally, the Core API user guide had additional details that needed updating, which I missed in my first pass. Also, there were issues with two examples: - the iris example was not configured to train long enough to actually converge, which looks bad for an example - The core_api_mnist_pytorch example had a couple show-stopper bugs, so not all of its stages ran at all. Finally, several examples touched in the searcher-context-removal project needed `make fmt` applied to them. (cherry picked from commit 21b0256)
The model debugging guide was completely out-of-date, and needed a near-total rewrite.
Additionally, the Core API user guide had additional details that needed updating, which I missed in my first pass.
Also, there were issues with two examples:
the iris example was not configured to train long enough to actually converge, which looks bad for an example
The core_api_mnist_pytorch example had a couple show-stopper bugs, so not all of its stages ran at all.
Finally, several examples touched in the searcher-context-removal project needed
make fmt
applied to them.