-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Slurm agent #3005
base: master
Are you sure you want to change the base?
[WIP] Slurm agent #3005
Conversation
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: jiangjiawei1103 <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3005 +/- ##
===========================================
+ Coverage 51.08% 74.51% +23.42%
===========================================
Files 201 202 +1
Lines 21231 21452 +221
Branches 2731 2766 +35
===========================================
+ Hits 10846 15985 +5139
+ Misses 9787 4686 -5101
- Partials 598 781 +183 ☔ View full report in Codecov by Sentry. |
Successfully submit and run the user-defined task as a normal python function on a remote Slurm cluster. 1. Inherit from PythonFunctionTask instead of PythonTask 2. Transfer the task module through sftp 3. Interact with amazon s3 bucket on both localhost and Slurm cluster Signed-off-by: JiaWei Jiang <[email protected]>
Specifying `--raw-output-data-prefix` option handles task_module download. Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Add `ssh_conf` filed to let users specify connection secret Note that reconnection is done in both `get` and `delete`. This is just a temporary workaround. Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
For data scientists and MLEs developing flyte wf with Slurm agent, they don't actually need to know ssh connection details. We assume they only need to specify which Slurm cluster to use by hostname. Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
1. Write user-defined batch script to a tmp file 2. Transfer the batch script through sftp 3. Construct sbatch command to run on Slurm cluster Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
1. Remove SFTP for batch script transfer * Assume Slurm batch script is present on Slurm cluster 2. Support directly specifying a remote batch script path Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: pryce-turner <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
@@ -326,7 +326,7 @@ class AsyncAgentExecutorMixin: | |||
|
|||
def execute(self: PythonTask, **kwargs) -> LiteralMap: | |||
ctx = FlyteContext.current_context() | |||
ss = ctx.serialization_settings or SerializationSettings(ImageConfig()) | |||
ss = ctx.serialization_settings or SerializationSettings(ImageConfig.auto_default_image()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
is this for shell task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we define a SlurmTask
without specifying container_image
(as the example python script provided above), ctx.serialization_settings
will be None
. Then, an error is raised which describes that PythonAutoContainerTask
needs an image.
I think this is just a temporary workaround for local test and I'm still pondering how to better handle this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing Graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, bro.
`SlurmTask` and `SlurmShellTask` now share the same agent. Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Inherited from `PythonTask` for cases in which the batch script is already on the Slurm cluster 2. Use a dummy `Interface` as a tmp workaround Signed-off-by: JiaWei Jiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This video demo 2 things.
- Slurm Python task (executes a shell script on the Slurm host).
- Slurm shell task (executes a shell script provided on my computer).
output.mp4
…ript Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
Nice bro! I'll push |
Now that we can reason more sensibly about the inner-workings of this agent I want to get a conversation going early about the object store and where it fits into all of this. IMO having a consistent persistence layer is the killer feature here and required for composing workflows with Slurm tasks alongside all the other flyte task types. For this initial implementation we should assume workflows composed entirely of slurm tasks on the local filesystem. We can then orchestrate tasks by passing filepaths around as input and output. This is a fairly naive implementation but will still have the benefit of the console, remote execution, versioning, logging, etc. We should nevertheless build with a consistent object store in mind down the road. Some high perf filesystems will have the option to expose an S3 interface but I don't think we should assume that's always the case. We may need to fall back to some awscli dependency for getting/putting inputs and outputs. This is all down the road but let's keep it in mind! Usecases where people have an on-prem slurm cluster but no GPUs for example can run CPU bound tasks there and then seamlessly offload accelerated tasks, all in the same workflow, is a killer feature. Let me know your thoughts. |
1. Add back `PythonFunctionTask` to support running user-defined functions on Slurm 2. Categorize task types into `script/` and `function/` Signed-off-by: JiaWei Jiang <[email protected]>
Code Review Agent Run Status
|
slurm_host = task_template.custom["slurm_host"] | ||
srun_conf = task_template.custom["srun_conf"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use task_template.custom.get("slurm_host")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As slurm_host
is a required field of the corresponding dataclass, we could assume the key "slurm_host"
must exist in task_template.custom
dict. Then, maybe directly accessing through the bracket is more straightforward here?
Let me know what you think. Thanks!
TODO: TEST gpu task |
Signed-off-by: JiangJiaWei1103 <[email protected]>
Will add all setup back into docs. |
Code Review Agent Run Status
|
Signed-off-by: JiangJiaWei1103 <[email protected]>
Code Review Agent Run Status
|
Signed-off-by: JiangJiaWei1103 <[email protected]>
Code Review Agent Run Status
|
Tracking issue
flyteorg/flyte#5634
Why are the changes needed?
What changes were proposed in this pull request?
Implement the Slurm agent, which submits the user-defined flytekit task to a remote Slurm cluster to run. Following describe three core methods:
create
: Submit a Slurm job withsbatch
to run a batch script on Slurm clusterget
: Check the Slurm job statedelete
(haven't been tested): Cancel the Slurm jobHow was this patch tested?
We test
create
andget
in the development environment described as follows:flytekit
installedslurmctld
andslurmd
runningasyncssh
Suppose we have a batch script to run on Slurm cluster:
We use the following python script to test Slurm agent on the client side:
The test result is shown as follows:
Setup process
As stated above
Check all the applicable boxes
Related PRs
Docs link