Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File saving to track end of evaluation runs. This helps with extensions that might wait for some evals to finish before running an op. #781

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions kauldron/evals/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@
# Lazy-import is import here as `run_strategies` is imported from kxm and
# we do not want to trigger a full import.
with _epy.lazy_api_imports(globals()):
from kauldron.evals.eval_impl import TRAIN_COMPLETE_FILENAME
from kauldron.evals.eval_impl import EVAL_COMPLETE_FILENAME

from kauldron.evals.evaluators import CollectionKeys
from kauldron.evals.evaluators import Evaluator
from kauldron.evals.evaluators import EvaluatorBase
Expand Down
15 changes: 15 additions & 0 deletions kauldron/evals/eval_impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
# XManager API do not have API for jobs within a work-unit to communicate,
# so use files for communication.
TRAIN_COMPLETE_FILENAME = 'train_complete.txt'
EVAL_COMPLETE_FILENAME = 'eval_{}_complete.txt'


def continuous_eval(
Expand Down Expand Up @@ -120,11 +121,25 @@ def continuous_eval(

final_step = step

# All every_checkpoint_evals have been processed. Marks those as complete.
if trainer.workdir.exists(): # `TrainEvaluator` do not have a workdir
for ev in every_checkpoint_evals:
epath.Path(trainer.workdir).joinpath(
EVAL_COMPLETE_FILENAME.format(ev.name)
).touch()

logging.info('Running final evals...')
for ev in last_checkpoint_evals:
with tracker.catch_exception(name=ev.name, step=final_step):
aux[ev.name] = ev.evaluate(state=state, step=final_step)

# All last_checkpoint_evals have been processed. Marks those as complete.
if trainer.workdir.exists(): # `TrainEvaluator` do not have a workdir
for ev in last_checkpoint_evals:
epath.Path(trainer.workdir).joinpath(
EVAL_COMPLETE_FILENAME.format(ev.name)
).touch()

tracker.maybe_reraise()

# Return the last aux
Expand Down