Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Slurm agent #3005

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Dec 16, 2024

Tracking issue

flyteorg/flyte#5634

Why are the changes needed?

What changes were proposed in this pull request?

Implement the Slurm agent, which submits the user-defined flytekit task to a remote Slurm cluster to run. Following describe three core methods:

  1. create: Submit a Slurm job with sbatch to run a batch script on Slurm cluster
  2. get: Check the Slurm job state
  3. delete (haven't been tested): Cancel the Slurm job

How was this patch tested?

We test create and get in the development environment described as follows:

  • Local: MacBook with flytekit installed
    • Test slurm agent locally following this guide
  • Remote: Single Ubuntu server with slurmctld and slurmd running
    • We plan to write a single-host setup tutorial and organize useful resources here
  • Communication is done with asyncssh

Suppose we have a batch script to run on Slurm cluster:

#!/bin/bash

echo "Working!" >> ./remote_touch.txt

We use the following python script to test Slurm agent on the client side:

import os

from flytekit import workflow
from flytekitplugins.slurm import Slurm, SlurmTask


echo_job = SlurmTask(
    name="echo-job-name",
    task_config=Slurm(
        slurm_host="<host_alias>",
        batch_script_path="<path_to_batch_script_within_slurm_cluster>",
        sbatch_conf={
            "partition": "debug",
            "job-name": "tiny-slurm",
        }
    )
)


@workflow
def wf() -> None:
    echo_job()


if __name__ == "__main__":
    from flytekit.clis.sdk_in_container import pyflyte
    from click.testing import CliRunner

    runner = CliRunner()
    path = os.path.realpath(__file__)

    # Local run
    print(f">>> LOCAL EXEC <<<")
    result = runner.invoke(pyflyte.main, ["run", path, "wf"])
    print(result.output)

The test result is shown as follows:
slurm_basic_result

Setup process

As stated above

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 14.28571% with 12 lines in your changes missing coverage. Please review.

Project coverage is 74.51%. Comparing base (f99d50e) to head (fc3e34e).
Report is 51 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/extras/tasks/shell.py 18.18% 8 Missing and 1 partial ⚠️
flytekit/extend/backend/utils.py 0.00% 2 Missing ⚠️
flytekit/extend/backend/base_agent.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3005       +/-   ##
===========================================
+ Coverage   51.08%   74.51%   +23.42%     
===========================================
  Files         201      202        +1     
  Lines       21231    21452      +221     
  Branches     2731     2766       +35     
===========================================
+ Hits        10846    15985     +5139     
+ Misses       9787     4686     -5101     
- Partials      598      781      +183     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Successfully submit and run the user-defined task as a normal python
function on a remote Slurm cluster.

1. Inherit from PythonFunctionTask instead of PythonTask
2. Transfer the task module through sftp
3. Interact with amazon s3 bucket on both localhost and Slurm cluster

Signed-off-by: JiaWei Jiang <[email protected]>
Specifying `--raw-output-data-prefix` option handles task_module download.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Add `ssh_conf` filed to let users specify connection secret

Note that reconnection is done in both `get` and `delete`. This is just
a temporary workaround.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

For data scientists and MLEs developing flyte wf with Slurm agent,
they don't actually need to know ssh connection details. We assume
they only need to specify which Slurm cluster to use by hostname.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

1. Write user-defined batch script to a tmp file
2. Transfer the batch script through sftp
3. Construct sbatch command to run on Slurm cluster

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

1. Remove SFTP for batch script transfer
    * Assume Slurm batch script is present on Slurm cluster
2. Support directly specifying a remote batch script path

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@Future-Outlier Future-Outlier self-assigned this Jan 13, 2025
@@ -326,7 +326,7 @@ class AsyncAgentExecutorMixin:

def execute(self: PythonTask, **kwargs) -> LiteralMap:
ctx = FlyteContext.current_context()
ss = ctx.serialization_settings or SerializationSettings(ImageConfig())
ss = ctx.serialization_settings or SerializationSettings(ImageConfig.auto_default_image())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?
is this for shell task?

Copy link
Contributor Author

@JiangJiaWei1103 JiangJiaWei1103 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define a SlurmTask without specifying container_image (as the example python script provided above), ctx.serialization_settings will be None. Then, an error is raised which describes that PythonAutoContainerTask needs an image.

I think this is just a temporary workaround for local test and I'm still pondering how to better handle this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing Graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, bro.

`SlurmTask` and `SlurmShellTask` now share the same agent.

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

1. Inherited from `PythonTask` for cases in which the batch script is
    already on the Slurm cluster
2. Use a dummy `Interface` as a tmp workaround

Signed-off-by: JiaWei Jiang <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This video demo 2 things.

  • Slurm Python task (executes a shell script on the Slurm host).
  • Slurm shell task (executes a shell script provided on my computer).
output.mp4

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@JiangJiaWei1103
Copy link
Contributor Author

JiangJiaWei1103 commented Jan 15, 2025

This video demo 2 things.

* Slurm Python task (executes a shell script on the Slurm host).

* Slurm shell task (executes a shell script provided on my computer).

output.mp4

Nice bro!

I'll push PythonFunctionTask tomorrow (got some errors to solve).

@pryce-turner
Copy link
Contributor

Now that we can reason more sensibly about the inner-workings of this agent I want to get a conversation going early about the object store and where it fits into all of this. IMO having a consistent persistence layer is the killer feature here and required for composing workflows with Slurm tasks alongside all the other flyte task types.

For this initial implementation we should assume workflows composed entirely of slurm tasks on the local filesystem. We can then orchestrate tasks by passing filepaths around as input and output. This is a fairly naive implementation but will still have the benefit of the console, remote execution, versioning, logging, etc. We should nevertheless build with a consistent object store in mind down the road.

Some high perf filesystems will have the option to expose an S3 interface but I don't think we should assume that's always the case. We may need to fall back to some awscli dependency for getting/putting inputs and outputs. This is all down the road but let's keep it in mind! Usecases where people have an on-prem slurm cluster but no GPUs for example can run CPU bound tasks there and then seamlessly offload accelerated tasks, all in the same workflow, is a killer feature.

Let me know your thoughts.

1. Add back `PythonFunctionTask` to support running user-defined functions
    on Slurm
2. Categorize task types into `script/` and `function/`

Signed-off-by: JiaWei Jiang <[email protected]>
@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Comment on lines +41 to +42
slurm_host = task_template.custom["slurm_host"]
srun_conf = task_template.custom["srun_conf"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use task_template.custom.get("slurm_host")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As slurm_host is a required field of the corresponding dataclass, we could assume the key "slurm_host" must exist in task_template.custom dict. Then, maybe directly accessing through the bracket is more straightforward here?

Let me know what you think. Thanks!

@Future-Outlier
Copy link
Member

setup aws credential

https://github.com/flyteorg/flytekit/blob/e68fda9f071833a4ea5facf350f9ee53be42ced3/plugins/flytekit-slurm/demo.md#amazon-s3-bucket

@Future-Outlier
Copy link
Member

TODO: TEST gpu task

@JiangJiaWei1103
Copy link
Contributor Author

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

@flyte-bot
Copy link
Contributor

Code Review Agent Run Status

  • Limitations and other issues: ❌ Failure - The AI Code Review Agent skipped reviewing this change because it is configured to exclude certain pull requests based on the source/target branch or the pull request status. You can change the settings here, or contact the agent instance creator at [email protected].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

4 participants