Skip to content

Commit

Permalink
Merge pull request #102 from CliMA/ne/derecho
Browse files Browse the repository at this point in the history
Add PBS controller, DerechoBackend
  • Loading branch information
nefrathenrici authored Jul 25, 2024
2 parents 25dab9d + db5cb81 commit c2b9ca9
Show file tree
Hide file tree
Showing 16 changed files with 699 additions and 241 deletions.
2 changes: 1 addition & 1 deletion .buildkite/clima_server_test/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ steps:

- wait
- label: "SurfaceFluxes perfect model calibration"
command: julia --project=experiments/surface_fluxes_perfect_model test/slurm_backend_e2e.jl
command: julia --project=experiments/surface_fluxes_perfect_model test/hpc_backend_e2e.jl
artifact_paths: output/surface_fluxes_perfect_model/*

- label: "Slurm job controller unit tests"
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ steps:

- wait
- label: "SurfaceFluxes perfect model calibration"
command: julia --project=experiments/surface_fluxes_perfect_model test/slurm_backend_e2e.jl
command: julia --project=experiments/surface_fluxes_perfect_model test/hpc_backend_e2e.jl
artifact_paths: output/surface_fluxes_perfect_model/*

- label: "Slurm job controller unit tests"
Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "ClimaCalibrate"
uuid = "4347a170-ebd6-470c-89d3-5c705c0cacc2"
authors = ["Climate Modeling Alliance"]
version = "0.0.1"
version = "0.0.2"

[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
Expand Down
16 changes: 5 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,27 +9,21 @@
calibration pipelines using with minimal boilerplate.</strong>
</p>

[![docsbuild][docs-bld-img]][docs-bld-url]
[![dev][docs-dev-img]][docs-dev-url]
[![ghaci][gha-ci-img]][gha-ci-url]
[![codecov][codecov-img]][codecov-url]

[docs-bld-img]: https://github.com/CliMA/ClimaCalibrate.jl/workflows/Documentation/badge.svg
[docs-bld-url]: https://github.com/CliMA/ClimaCalibrate.jl/actions?query=workflow%3ADocumentation

[docs-dev-img]: https://img.shields.io/badge/docs-dev-blue.svg
[docs-dev-url]: https://CliMA.github.io/ClimaCalibrate.jl/dev/

[gha-ci-img]: https://github.com/CliMA/ClimaCalibrate.jl/actions/workflows/ci.yml/badge.svg
[gha-ci-url]: https://github.com/CliMA/ClimaCalibrate.jl/actions/workflows/ci.yml

[codecov-img]: https://codecov.io/gh/CliMA/ClimaCalibrate.jl/branch/main/graph/badge.svg
[codecov-url]: https://codecov.io/gh/CliMA/ClimaCalibrate.jl

The recommended Julia version is: Stable release v1.10.0
The recommended Julia version is: Stable release v1.10.4

This pipeline currently runs on the Resnick High Performance Computing Center.
We strive to support flexible and clearly documented calibration experiments.
Currently supported backends:
- [Resnick High Performance Computing Center](https://www.hpc.caltech.edu/)
- [NSF NCAR Supercomputer Derecho](https://ncar-hpc-docs.readthedocs.io/en/latest/compute-systems/derecho/)
- CliMA's private GPU server

## Contributing

Expand Down
1 change: 0 additions & 1 deletion docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ makedocs(
"Getting Started" => "quickstart.md",
"ClimaAtmos Setup Guide" => "atmos_setup_guide.md",
"Emulate and Sample" => "emulate_sample.md",
"Precompilation" => "precompilation.md",
"API" => "api.md",
],
)
Expand Down
19 changes: 18 additions & 1 deletion docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,24 @@ ClimaCalibrate.observation_map
```@docs
ClimaCalibrate.get_backend
ClimaCalibrate.calibrate
ClimaCalibrate.sbatch_model_run
ClimaCalibrate.model_run
ClimaCalibrate.module_load_string
```

## Job Scheduler
```@docs
ClimaCalibrate.wait_for_jobs
ClimaCalibrate.log_member_error
ClimaCalibrate.kill_job
ClimaCalibrate.job_status
ClimaCalibrate.kwargs
ClimaCalibrate.slurm_model_run
ClimaCalibrate.generate_sbatch_script
ClimaCalibrate.generate_sbatch_directives
ClimaCalibrate.submit_slurm_job
ClimaCalibrate.pbs_model_run
ClimaCalibrate.generate_pbs_script
ClimaCalibrate.submit_pbs_job
```

## EnsembleKalmanProcesses Interface
Expand Down
5 changes: 2 additions & 3 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
ClimaCalibrate.jl is a toolkit for developing scalable and reproducible model
calibration pipelines using CalibrateEmulateSample.jl with minimal boilerplate.

To use this framework, component models (and the coupler) define their own versions of the functions provided in the interface (`get_config`, `get_forward_model`, and `run_forward_model`).

Calibrations can either be run using pure Julia, the Caltech central cluster, or CliMA's GPU server.
To use this framework, component models (and the coupler) define their own versions of the functions provided in the interface.
Calibrations can either be run using just Julia, the Caltech central cluster, NCAR Derecho, or CliMA's GPU server.

For more information, see our Getting Started page.
1 change: 1 addition & 0 deletions src/ClimaCalibrate.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ module ClimaCalibrate
include("ekp_interface.jl")
include("model_interface.jl")
include("slurm.jl")
include("pbs.jl")
include("backends.jl")
include("emulate_sample.jl")

Expand Down
148 changes: 98 additions & 50 deletions src/backends.jl
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
export get_backend, calibrate
export get_backend, calibrate, model_run

abstract type AbstractBackend end

struct JuliaBackend <: AbstractBackend end
abstract type SlurmBackend <: AbstractBackend end

abstract type HPCBackend <: AbstractBackend end
abstract type SlurmBackend <: HPCBackend end

struct CaltechHPCBackend <: SlurmBackend end
struct ClimaGPUBackend <: SlurmBackend end

struct DerechoBackend <: HPCBackend end

"""
get_backend()
Expand All @@ -18,6 +23,8 @@ function get_backend()
(r"^clima.gps.caltech.edu$", ClimaGPUBackend),
(r"^login[1-4].cm.cluster$", CaltechHPCBackend),
(r"^hpc-(\d\d)-(\d\d).cm.cluster$", CaltechHPCBackend),
(r"derecho([1-8])$", DerechoBackend),
(r"dec(\d\d\d\d)$", DerechoBackend), # This should be more specific
]

for (pattern, backend) in HOSTNAMES
Expand All @@ -28,12 +35,12 @@ function get_backend()
end

"""
module_load_string(T) where {T<:Type{SlurmBackend}}
module_load_string(backend)
Return a string that loads the correct modules for a given backend when executed via bash.
"""
function module_load_string(::Type{CaltechHPCBackend})
return """export MODULEPATH=/groups/esm/modules:\$MODULEPATH
return """export MODULEPATH="/groups/esm/modules:\$MODULEPATH"
module purge
module load climacommon/2024_05_27"""
end
Expand All @@ -43,32 +50,14 @@ function module_load_string(::Type{ClimaGPUBackend})
module load julia/1.10.0 cuda/julia-pref openmpi/4.1.5-mpitrampoline"""
end

"""
calibrate(::Type{JuliaBackend}, config::ExperimentConfig)
calibrate(::Type{JuliaBackend}, experiment_dir::AbstractString)
Run a calibration in Julia.
Takes an ExperimentConfig or an experiment folder.
If no backend is passed, one is chosen via `get_backend`.
This function is intended for use in a larger workflow, assuming that all needed
model interface and observation map functions are set up for the calibration.
# Example
Run: `julia --project=experiments/surface_fluxes_perfect_model`
```julia
import ClimaCalibrate
# Generate observational data and load interface
experiment_dir = dirname(Base.active_project())
include(joinpath(experiment_dir, "generate_data.jl"))
include(joinpath(experiment_dir, "observation_map.jl"))
include(joinpath(experiment_dir, "model_interface.jl"))
function module_load_string(::Type{DerechoBackend})
return """export MODULEPATH="/glade/campaign/univ/ucit0011/ClimaModules-Derecho:\$MODULEPATH"
module purge
module load climacommon
module list
"""
end

# Initialize and run the calibration
eki = ClimaCalibrate.calibrate(experiment_dir)
```
"""
calibrate(config::ExperimentConfig; ekp_kwargs...) =
calibrate(get_backend(), config; ekp_kwargs...)

Expand All @@ -86,9 +75,8 @@ function calibrate(
config::ExperimentConfig;
ekp_kwargs...,
)
initialize(config; ekp_kwargs...)
(; n_iterations, ensemble_size) = config
eki = nothing
eki = initialize(config; ekp_kwargs...)
for i in 0:(n_iterations - 1)
@info "Running iteration $i"
for m in 1:ensemble_size
Expand All @@ -103,75 +91,80 @@ function calibrate(
end

"""
calibrate(::Type{SlurmBackend}, config::ExperimentConfig; kwargs...)
calibrate(::Type{SlurmBackend}, experiment_dir; kwargs...)
calibrate(::Type{AbstractBackend}, config::ExperimentConfig; kwargs...)
calibrate(::Type{AbstractBackend}, experiment_dir; kwargs...)
Run a full calibration, scheduling the forward model runs on Caltech's HPC cluster.
Takes either an ExperimentConfig or an experiment folder.
Available Backends: CaltechHPCBackend, ClimaGPUBackend, DerechoBackend, JuliaBackend
# Keyword Arguments
- `experiment_dir: Directory containing experiment configurations.
- `model_interface: Path to the model interface file.
- `slurm_kwargs`: Dictionary of slurm arguments, passed through to `sbatch`.
- `verbose::Bool`: Enable verbose output for debugging.
- `hpc_kwargs`: Dictionary of resource arguments, passed to the job scheduler.
- `verbose::Bool`: Enable verbose logging.
# Usage
Open julia: `julia --project=experiments/surface_fluxes_perfect_model`
```julia
import ClimaCalibrate: CaltechHPCBackend, calibrate
using ClimaCalibrate
experiment_dir = dirname(Base.active_project())
experiment_dir = joinpath(pkgdir(ClimaCalibrate), "experiments", "surface_fluxes_perfect_model")
model_interface = joinpath(experiment_dir, "model_interface.jl")
# Generate observational data and load interface
include(joinpath(experiment_dir, "generate_data.jl"))
include(joinpath(experiment_dir, "observation_map.jl"))
include(model_interface)
slurm_kwargs = kwargs(time = 3)
eki = calibrate(CaltechHPCBackend, experiment_dir; model_interface, slurm_kwargs);
hpc_kwargs = kwargs(time = 3)
backend = get_backend()
eki = calibrate(backend, experiment_dir; model_interface, hpc_kwargs);
```
"""
function calibrate(
b::Type{<:SlurmBackend},
b::Type{<:HPCBackend},
experiment_dir::AbstractString;
slurm_kwargs,
hpc_kwargs,
ekp_kwargs...,
)
calibrate(b, ExperimentConfig(experiment_dir); slurm_kwargs, ekp_kwargs...)
calibrate(b, ExperimentConfig(experiment_dir); hpc_kwargs, ekp_kwargs...)
end

function calibrate(
b::Type{<:SlurmBackend},
b::Type{<:HPCBackend},
config::ExperimentConfig;
experiment_dir = dirname(Base.active_project()),
model_interface = abspath(
joinpath(experiment_dir, "..", "..", "model_interface.jl"),
),
verbose = false,
slurm_kwargs = Dict(:time_limit => 45, :ntasks => 1),
reruns = 1,
hpc_kwargs,
ekp_kwargs...,
)
# ExperimentConfig is created from a YAML file within the experiment_dir
(; n_iterations, output_dir, ensemble_size) = config
@info "Initializing calibration" n_iterations ensemble_size output_dir
initialize(config; ekp_kwargs...)

eki = nothing
eki = initialize(config; ekp_kwargs...)
module_load_str = module_load_string(b)
for iter in 0:(n_iterations - 1)
@info "Iteration $iter"
jobids = map(1:ensemble_size) do member
@info "Running ensemble member $member"
sbatch_model_run(
model_run(
b,
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
slurm_kwargs,
hpc_kwargs,
)
end

Expand All @@ -182,14 +175,69 @@ function calibrate(
experiment_dir,
model_interface,
module_load_str;
slurm_kwargs,
hpc_kwargs,
verbose,
reruns,
)
report_iteration_status(statuses, output_dir, iter)
@info "Completed iteration $iter, updating ensemble"
G_ensemble = observation_map(iter)
save_G_ensemble(config, iter, G_ensemble)
eki = update_ensemble(config, iter)
end
return eki
end

# Dispatch on backend type to unify `calibrate` for all HPCBackends
# Scheduler interfaces should not depend on backend struct
"""
model_run(backend, iter, member, output_dir, experiment_dir; model_interface, verbose, hpc_kwargs)
Construct and execute a command to run a single forward model on a given job scheduler.
Dispatches on `backend` to run [`slurm_model_run`](@ref) or [`pbs_model_run`](@ref).
Arguments:
- iter: Iteration number
- member: Member number
- output_dir: Calibration experiment output directory
- experiment_dir: Directory containing the experiment's Project.toml
- model_interface: File containing the model interface
- module_load_str: Commands which load the necessary modules
- hpc_kwargs: Dictionary containing the resources for the job. Easily generated using [`kwargs`](@ref).
"""
model_run(
b::Type{<:SlurmBackend},
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
) = slurm_model_run(
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
)
model_run(
b::Type{DerechoBackend},
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
) = pbs_model_run(
iter,
member,
output_dir,
experiment_dir,
model_interface,
module_load_str;
hpc_kwargs,
)
6 changes: 3 additions & 3 deletions src/ekp_interface.jl
Original file line number Diff line number Diff line change
Expand Up @@ -171,10 +171,10 @@ function env_model_interface(env = ENV)
return string(env[key])
end

function env_iter_number(env = ENV)
key = "CALIBRATION_ITER_NUMBER"
function env_iteration(env = ENV)
key = "CALIBRATION_ITERATION"
haskey(env, key) || error(
"Iteration number not found in environment. Ensure that env variable \"CALIBRATION_ITER_NUMBER\" is set.",
"Iteration number not found in environment. Ensure that env variable \"CALIBRATION_ITERATION\" is set.",
)
return parse(Int, env[key])
end
Expand Down
Loading

2 comments on commit c2b9ca9

@nefrathenrici
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/111798

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.0.2 -m "<description of version>" c2b9ca96f86af79732e45fb7e890823792468e91
git push origin v0.0.2

Please sign in to comment.