-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated Tuner #26
Updated Tuner #26
Changes from 7 commits
7f6aa1f
e66d325
02e6e99
074214a
b167cab
bc9560e
98a6309
75221ee
be26b89
2cf7898
88b5eab
4041c99
6809f98
f400246
f546608
769aaab
9cf7ebb
4284fba
0e1ceff
951b981
1cdcb9f
6c69875
6b07cd8
63c6ad0
a4906e7
25d1758
9987d0d
1f47e12
eef4203
590bd31
6a9bceb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
import os.path as osp | ||
from logging import getLogger | ||
from typing import Any | ||
|
||
import lightning.pytorch as pl | ||
|
@@ -13,6 +14,8 @@ | |
|
||
from .core import Core | ||
|
||
logger = getLogger(__name__) | ||
|
||
|
||
class Tuner(Core): | ||
def __init__(self, cfg: str | dict, args: list[str] | tuple[str, ...] | None): | ||
|
@@ -30,8 +33,23 @@ def __init__(self, cfg: str | dict, args: list[str] | tuple[str, ...] | None): | |
raise ValueError("You have to specify the `tuner` section in config.") | ||
self.tune_cfg = self.cfg.tuner | ||
|
||
# Parent tracker that only logs the best study parameters at the end | ||
rank = rank_zero_only.rank | ||
cfg_tracker = self.cfg.tracker | ||
tracker_params = cfg_tracker.model_dump() | ||
self.parent_tracker = LuxonisTrackerPL( | ||
rank=rank, | ||
mlflow_tracking_uri=self.cfg.ENVIRON.MLFLOW_TRACKING_URI, | ||
is_sweep=False, | ||
**tracker_params, | ||
) | ||
if self.parent_tracker.is_mlflow: | ||
run = self.parent_tracker.experiment["mlflow"].active_run() | ||
self.parent_run_id = run.info.run_id | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this used somewhere? What is this set to if MLFlow is not used? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If MLFlow is not used then still this first tracker will report the best hyperparameters but nesting won't be done because it is not supported by e.g. Tensorboard. So every trial will have it's own tensorboard logs with no connection between them. That's why when tuning using MLFlow is recommended. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok sounds good. Not critical. I think looks good, maybe we just add a comment in code to note this is required for MLFlow to work correctly, then we can merge. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I figured it out: The way tracker is written it creates the actual mlflow run once it is interacted with. And since in our case we need parent tracker to be created first (to then nest all children runs to that one) we need to kep this line in the code. I added a note to the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this specific to "mlflow" then, or should this be present for other underlying trackers as well? @klemen1999 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's the case for all trackers. LuxonisTracker is sctructured very similar to other default PL loggers (e.g. MLFlowLogger) where the actual objects are created when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So should we always call this instead of only on: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, I think we can proceed with the merge in this case. |
||
|
||
def tune(self) -> None: | ||
"""Runs Optuna tunning of hyperparameters.""" | ||
logger.info("Starting tuning...") | ||
|
||
pruner = ( | ||
optuna.pruners.MedianPruner() | ||
|
@@ -57,7 +75,7 @@ def tune(self) -> None: | |
storage=storage, | ||
direction="minimize", | ||
pruner=pruner, | ||
load_if_exists=True, | ||
load_if_exists=self.tune_cfg.continue_existing_study, | ||
) | ||
|
||
study.optimize( | ||
|
@@ -66,25 +84,30 @@ def tune(self) -> None: | |
timeout=self.tune_cfg.timeout, | ||
) | ||
|
||
best_study_params = study.best_params | ||
logger.info(f"Best study parameters: {best_study_params}") | ||
self.parent_tracker.log_hyperparams(best_study_params) | ||
|
||
def _objective(self, trial: optuna.trial.Trial) -> float: | ||
"""Objective function used to optimize Optuna study.""" | ||
rank = rank_zero_only.rank | ||
cfg_tracker = self.cfg.tracker | ||
tracker_params = cfg_tracker.model_dump() | ||
tracker = LuxonisTrackerPL( | ||
child_tracker = LuxonisTrackerPL( | ||
rank=rank, | ||
mlflow_tracking_uri=self.cfg.ENVIRON.MLFLOW_TRACKING_URI, | ||
is_sweep=True, | ||
**tracker_params, | ||
) | ||
run_save_dir = osp.join(cfg_tracker.save_directory, tracker.run_name) | ||
|
||
run_save_dir = osp.join(cfg_tracker.save_directory, child_tracker.run_name) | ||
|
||
curr_params = self._get_trial_params(trial) | ||
curr_params["model.predefined_model"] = None | ||
Config.clear_instance() | ||
cfg = Config.get_config(self.cfg.model_dump(), curr_params) | ||
|
||
tracker.log_hyperparams(curr_params) | ||
child_tracker.log_hyperparams(curr_params) | ||
|
||
cfg.save_data(osp.join(run_save_dir, "config.yaml")) | ||
|
||
|
@@ -94,14 +117,15 @@ def _objective(self, trial: optuna.trial.Trial) -> float: | |
save_dir=run_save_dir, | ||
input_shape=self.loader_train.input_shape, | ||
) | ||
pruner_callback = PyTorchLightningPruningCallback( | ||
trial, monitor="val_loss/loss" | ||
) | ||
callbacks: list[pl.Callback] = ( | ||
[LuxonisProgressBar()] if self.cfg.use_rich_text else [] | ||
) | ||
pruner_callback = PyTorchLightningPruningCallback(trial, monitor="val/loss") | ||
callbacks.append(pruner_callback) | ||
|
||
tracker_end_run = TrackerEndRun() | ||
callbacks.append(tracker_end_run) | ||
|
||
deterministic = False | ||
if self.cfg.trainer.seed: | ||
pl.seed_everything(cfg.trainer.seed, workers=True) | ||
|
@@ -111,7 +135,7 @@ def _objective(self, trial: optuna.trial.Trial) -> float: | |
accelerator=cfg.trainer.accelerator, | ||
devices=cfg.trainer.devices, | ||
strategy=cfg.trainer.strategy, | ||
logger=tracker, # type: ignore | ||
logger=child_tracker, # type: ignore | ||
max_epochs=cfg.trainer.epochs, | ||
accumulate_grad_batches=cfg.trainer.accumulate_grad_batches, | ||
check_val_every_n_epoch=cfg.trainer.validation_interval, | ||
|
@@ -121,12 +145,20 @@ def _objective(self, trial: optuna.trial.Trial) -> float: | |
deterministic=deterministic, | ||
) | ||
|
||
pl_trainer.fit( | ||
lightning_module, # type: ignore | ||
self.pytorch_loader_train, | ||
self.pytorch_loader_val, | ||
) | ||
pruner_callback.check_pruned() | ||
try: | ||
pl_trainer.fit( | ||
lightning_module, # type: ignore | ||
self.pytorch_loader_train, | ||
self.pytorch_loader_val, | ||
) | ||
|
||
pruner_callback.check_pruned() | ||
|
||
except optuna.TrialPruned as e: | ||
# Pruning is done by raising an error | ||
# When .fit() errors out we have to gracefully also end the trackers | ||
tracker_end_run.end_trackers(child_tracker) | ||
logger.info(e) | ||
|
||
if "val/loss" not in pl_trainer.callback_metrics: | ||
raise ValueError( | ||
|
@@ -177,3 +209,24 @@ def _get_trial_params(self, trial: optuna.trial.Trial) -> dict[str, Any]: | |
"No paramteres to tune. Specify them under `tuner.params`." | ||
) | ||
return new_params | ||
|
||
|
||
class TrackerEndRun(pl.Callback): | ||
"""Callback that ends trackers of child processes during tuning study""" | ||
|
||
def teardown( | ||
self, trainer: pl.Trainer, pl_module: pl.LightningModule, stage: str | ||
) -> None: | ||
self.end_trackers(trainer.logger) # type: ignore | ||
return super().teardown(trainer, pl_module, stage) | ||
|
||
def end_trackers(self, tracker: LuxonisTrackerPL) -> None: | ||
"""Ends WandB and MLFlow trackers | ||
|
||
Args: | ||
tracker (LuxonisTrackerPL): Currently active tracker | ||
""" | ||
if tracker.is_wandb: | ||
tracker.experiment["wandb"].finish() | ||
if tracker.is_mlflow: | ||
tracker.experiment["mlflow"].end_run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will parent run remain active long enough for this to not cause any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this run is active until the very end when whole study finishes. Basically every trial in the study is its own training run (like we would run
luxonis_train train...
). And every trial gets its own tracker. To nest them together we need one top-level tracker - parent tracker. In the backendmlflow
has a stack of all active runs and whennest=True
it binds it to the previous run in the stack. So this top level tracker is only used for binding all trials together and logging best hyperparameters in the whole study - for easy of use, to quickly get the results of the study.