diff --git a/2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/flux.ipynb b/2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/flux.ipynb deleted file mode 100644 index 49d956d..0000000 --- a/2024-RIKEN-AWS/JupyterNotebook/tutorial/notebook/flux.ipynb +++ /dev/null @@ -1,1651 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "2507d149-dcab-458a-a554-37388e0ee13a", - "metadata": { - "tags": [] - }, - "source": [ - "
\n", - "
\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "40e867ba-f689-4301-bb60-9a448556bb84", - "metadata": { - "tags": [] - }, - "source": [ - "# Welcome to the Flux Tutorial\n", - "\n", - "> What is Flux Framework? πŸ€”οΈ\n", - " \n", - "Flux is a flexible framework for resource management, built for your site. The framework consists of a suite of projects, tools, and libraries which may be used to build site-custom resource managers for High Performance Computing centers. Flux is a next-generation resource manager and scheduler with many transformative capabilities like hierarchical scheduling and resource management (you can think of it as \"fractal scheduling\") and directed-graph based resource representations.\n", - "\n", - "> I'm ready! How do I do this tutorial? 😁️\n", - "\n", - "To step through examples in this notebook you need to execute cells. To run a cell, press Shift+Enter on your keyboard. If you prefer, you can also paste the shell commands in the JupyterLab terminal and execute them there.\n", - "Let's get started! To provide some brief, added background on Flux and a bit more motivation for our tutorial, \"Shift+Enter\" the cell below to watch our YouTube video!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d71ecd22-8552-4b4d-9bc4-61d86f8d33fe", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "%%html\n", - "" - ] - }, - { - "cell_type": "markdown", - "id": "15e82c38-8465-49ac-ae2b-b0bb56a79ec9", - "metadata": { - "tags": [] - }, - "source": [ - "# Getting started with Flux\n", - "\n", - "The code and examples that this tutorial is based on can be found at [flux-framework/Tutorials](https://github.com/flux-framework/Tutorials/tree/master/2023-RADIUSS-AWS). You can also find the examples one level up in the flux-workflow-examples directory in this JupyterLab instance.\n", - "\n", - "## Resources\n", - "\n", - "> Looking for other resources? We got you covered! πŸ€“οΈ\n", - "\n", - " - [https://flux-framework.org/](https://flux-framework.org/) Flux Framework portal for projects, releases, and publication.\n", - " - [Flux Documentation](https://flux-framework.readthedocs.io/en/latest/).\n", - " - [Flux Framework Cheat Sheet](https://flux-framework.org/cheat-sheet/)\n", - " - [Flux Glossary of Terms](https://flux-framework.readthedocs.io/en/latest/glossary.html)\n", - " - [Flux Comics](https://flux-framework.readthedocs.io/en/latest/comics/fluxonomicon.html) come and meet FluxBird - the pink bird who knows things!\n", - " - [Flux Learning Guide](https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html) learn about what Flux does, how it works, and real research applications \n", - " - [Getting Started with Flux and Go](https://converged-computing.github.io/flux-go/)\n", - " - [Getting Started with Flux in C](https://converged-computing.github.io/flux-c-examples/) *looking for contributors*\n", - "\n", - "To read the Flux manpages and get help, run `flux help`. To get documentation on a subcommand, run, e.g. `flux help config`. Here is an example of running `flux help` right from the notebook. Yes, did you know we are running in a Flux Instance right now?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c7d616de-70cd-4090-bd43-ffacb5ade1f6", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "!flux help" - ] - }, - { - "cell_type": "markdown", - "id": "ae33fef6-278c-4996-8534-fd15e548b338", - "metadata": { - "tags": [] - }, - "source": [ - "Did you know you can also get help for a specific command? For example, let's run, e.g. `flux help jobs` to get information on a sub-command:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2e54f640-283a-4523-8dde-9617fd6ef0c5", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# We have commented this out because the output is huge! Feel free to uncomment (remove the #) and run the command\n", - "#!flux help jobs" - ] - }, - { - "cell_type": "markdown", - "id": "17e435d6-0927-4966-a4d7-47a128c94158", - "metadata": { - "tags": [] - }, - "source": [ - "### You can run any of the commands and examples that follow in the JupyterLab terminal. You can find the terminal in the JupyterLab launcher.\n", - "If you do `File -> New -> Terminal` you can open a raw terminal to play with Flux. You'll see a prompt like this: \n", - "\n", - "`Ζ’(s=4,d=0) fluxuser@6e0f43fd90eb:~$`\n", - "\n", - "`s=4` indicates the number of running Flux brokers, `d=0` indicates the Flux hierarchy depth. `@6e0f43fd90eb` references the host, which is a Docker container for our tutorial." - ] - }, - { - "cell_type": "markdown", - "id": "70e3df1d-32c9-4996-b6f7-2fa85f4c02ad", - "metadata": { - "tags": [] - }, - "source": [ - "# Creating Flux Instances\n", - "\n", - "A Flux instance is a fully functional set of services which manage compute resources under its domain with the capability to launch jobs on those resources. A Flux instance may be running as the default resource manager on a cluster, a job in a resource manager such as Slurm, LSF, or Flux itself, or as a test instance launched locally.\n", - "\n", - "When run as a job in another resource manager, Flux is started like an MPI program, e.g., under Slurm we might run `srun [OPTIONS] flux start [SCRIPT]`. Flux is unique in that a test instance which mimics a multi-node instance can be started locally with simply `flux start --test-size=N`. This offers users to a way to learn and test interfaces and commands without access to an HPC cluster.\n", - "\n", - "To start a Flux session with 4 brokers in your container, run:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d568de50-f9e0-452f-8364-e52853013d83", - "metadata": {}, - "outputs": [], - "source": [ - "!flux start --test-size=4 flux getattr size" - ] - }, - { - "cell_type": "markdown", - "id": "e693f2d9-651f-4f58-bf53-62528caa83d9", - "metadata": {}, - "source": [ - "The output indicates the number of brokers started successfully." - ] - }, - { - "cell_type": "markdown", - "id": "eda1a33c-9f9e-4ba0-a013-e97601f79e41", - "metadata": {}, - "source": [ - "## Flux uptime\n", - "Flux provides an `uptime` utility to display properties of the Flux instance such as state of the current instance, how long it has been running, its size and if scheduling is disabled or stopped. The output shows how long the instance has been up, the instance owner, the instance depth (depth in the Flux hierarchy), and the size of the instance (number of brokers)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6057ce25-d1b3-4cc6-b26a-4b05a1639616", - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "!flux uptime" - ] - }, - { - "cell_type": "markdown", - "id": "dee2d6af-43fa-490e-88e9-10f13e660125", - "metadata": { - "tags": [] - }, - "source": [ - "# Submitting Jobs to Flux\n", - "## Submission CLI\n", - "### `flux`: the Job Submission Tool\n", - "\n", - "To submit jobs to Flux, you can use the `flux` `submit`, `run`, `bulksubmit`, `batch`, and `alloc` commands. The `flux submit` command submits a job to Flux and prints out the jobid. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8a5e7d41-1d8d-426c-8198-0ad4a57e7d04", - "metadata": {}, - "outputs": [], - "source": [ - "!flux submit hostname" - ] - }, - { - "cell_type": "markdown", - "id": "a7e4c25e-3ca8-4277-bb70-a0e94bcd223b", - "metadata": {}, - "source": [ - "`submit` supports common options like `--nnodes`, `--ntasks`, and `--cores-per-task`. There are short option equivalents (`-N`, `-n`, and `-c`, respectively) of these options as well. `--cores-per-task=1` is the default." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "571d8c3d-b24a-415e-b9ac-f58b99a7e92c", - "metadata": {}, - "outputs": [], - "source": [ - "!flux submit -N1 -n2 sleep inf" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc2bddee-f454-4674-80d4-4a39c5f1bee2", - "metadata": {}, - "outputs": [], - "source": [ - "# Let's peek at the help for flux submit!\n", - "!flux submit --help | head -n 15" - ] - }, - { - "cell_type": "markdown", - "id": "ac798095", - "metadata": {}, - "source": [ - "The `flux run` command submits a job to Flux (similar to `flux submit`) but then attaches to the job with `flux job attach`, printing the job's stdout/stderr to the terminal and exiting with the same exit code as the job:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "52d26496-dd1f-44f7-bb10-8a9b4b8c9c80", - "metadata": {}, - "outputs": [], - "source": [ - "!flux run hostname" - ] - }, - { - "cell_type": "markdown", - "id": "53357a9d-11d8-4c2d-87d8-c30ae38d01ba", - "metadata": {}, - "source": [ - "The output from the previous command is the hostname (a container ID string in this case). If the job exits with a non-zero exit code this will be reported by `flux job attach` (occurs implicitly with `flux run`). For example, execute the following:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fa40cb98-a138-4771-a7ef-f1860dddf7db", - "metadata": {}, - "outputs": [], - "source": [ - "!flux run /bin/false" - ] - }, - { - "cell_type": "markdown", - "id": "6b2b5c3f-e24a-45a8-a10c-e10bfdbb7b87", - "metadata": {}, - "source": [ - "A job submitted with `run` can be canceled with two rapid `Cltr-C`s in succession, or a user can detach from the job with `Ctrl-C Ctrl-Z`. The user can then re-attach to the job by using `flux job attach JOBID`." - ] - }, - { - "cell_type": "markdown", - "id": "81e5213d", - "metadata": {}, - "source": [ - "`flux submit` and `flux run` also support many other useful flags:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "02032748", - "metadata": {}, - "outputs": [], - "source": [ - "!flux run -n4 --label-io --time-limit=5s --env-remove=LD_LIBRARY_PATH hostname" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f52bb357-a7ce-458d-9c3f-4d664eca4fbd", - "metadata": {}, - "outputs": [], - "source": [ - "# Uncomment and run this help command if you want to see all the flags for flux run\n", - "# !flux run --help" - ] - }, - { - "cell_type": "markdown", - "id": "91e9ed6c", - "metadata": {}, - "source": [ - "The `flux bulksubmit` command enqueues jobs based on a set of inputs which are substituted on the command line, similar to `xargs` and the GNU `parallel` utility, except the jobs have access to the resources of an entire Flux instance instead of only the local system." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f0e82702", - "metadata": {}, - "outputs": [], - "source": [ - "!flux bulksubmit --watch --wait echo {} ::: foo bar baz" - ] - }, - { - "cell_type": "markdown", - "id": "392a8056-1661-4b76-9ca3-5e536c687e82", - "metadata": {}, - "source": [ - "The `--cc` option to `submit` makes repeated submission even easier via, `flux submit --cc=IDSET`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0ea1962b-1831-4bd2-8dab-c61fd710df9c", - "metadata": {}, - "outputs": [], - "source": [ - "!flux submit --cc=1-10 --watch hostname" - ] - }, - { - "cell_type": "markdown", - "id": "27ca3706-8bb4-4fd6-a37c-e6135fb05604", - "metadata": {}, - "source": [ - "Try it in the JupyterLab terminal with a progress bar and jobs/s rate report: `flux submit --cc=1-100 --watch --progress --jps hostname`\n", - "\n", - "Note that `--wait` is implied by `--watch`." - ] - }, - { - "cell_type": "markdown", - "id": "4c5a18ff-8d6a-47e9-a164-931ed1275ef4", - "metadata": {}, - "source": [ - "Of course, Flux can launch more than just single-node, single-core jobs. We can submit multiple heterogeneous jobs and Flux will co-schedule the jobs while also ensuring no oversubscription of resources (e.g., cores).\n", - "\n", - "Note: in this tutorial, we cannot assume that the host you are running on has multiple cores, thus the examples below only vary the number of nodes per job. Varying the `cores-per-task` is also possible on Flux when the underlying hardware supports it (e.g., a multi-core node)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "brazilian-former", - "metadata": {}, - "outputs": [], - "source": [ - "!flux submit --nodes=2 --ntasks=2 --cores-per-task=1 --job-name simulation sleep inf\n", - "!flux submit --nodes=1 --ntasks=1 --cores-per-task=1 --job-name analysis sleep inf" - ] - }, - { - "cell_type": "markdown", - "id": "641f446c-b2e8-40d8-b6bd-eb6b9dba3c71", - "metadata": {}, - "source": [ - "### `flux watch` to watch jobs\n", - "\n", - "Wouldn't it be cool to submit a job and then watch it? Well, yeah! We can do this now with flux watch. Let's run a fun example, and then watch the output. We have sleeps in here interspersed with echos only to show you the live action! πŸ₯žοΈ\n", - "Also note a nice trick - you can always use `flux job last` to get the last JOBID.\n", - "Here is an example (not runnable, as notebooks don't support environment variables) for getting and saving a job id:\n", - "\n", - "```bash\n", - "flux submit hostname\n", - "JOBID=$(flux job last)\n", - "```\n", - "\n", - "And then you could use the variable `$JOBID` in your subsequent script or interactions with Flux! So what makes `flux watch` different from `flux job attach`? Aside from the fact that `flux watch` is read-only, `flux watch` can watch many (or even all (`flux watch --all`) jobs at once!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5ad231c2-4cdb-4d18-afc2-7cb3a74759c2", - "metadata": {}, - "outputs": [], - "source": [ - "!flux submit ../flux-workflow-examples/job-watch/job-watch.sh\n", - "!flux watch $(flux job last)" - ] - }, - { - "cell_type": "markdown", - "id": "3f8c2af2", - "metadata": {}, - "source": [ - "### Listing job properties with `flux jobs`\n", - "\n", - "We can now list the jobs in the queue with `flux jobs` and we should see both jobs that we just submitted. Jobs that are instances are colored blue in output, red jobs are failed jobs, and green jobs are those that completed successfully. Note that the JupyterLab notebook may not display these colors. You will be able to see them in the terminal." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "institutional-vocabulary", - "metadata": {}, - "outputs": [], - "source": [ - "!flux jobs" - ] - }, - { - "cell_type": "markdown", - "id": "77ca4277", - "metadata": {}, - "source": [ - "Since those jobs won't ever exit (and we didn't specify a timelimit), let's cancel them all now and free up the resources." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "46dd8ec8-6c64-4d8d-9a00-949f5f58c07b", - "metadata": {}, - "outputs": [], - "source": [ - "# This was previously flux cancelall -f\n", - "!flux cancel --all\n", - "!flux jobs" - ] - }, - { - "cell_type": "markdown", - "id": "544aa0a9", - "metadata": {}, - "source": [ - "We can use the `flux batch` command to easily created nested flux instances. When `flux batch` is invoked, Flux will automatically create a nested instance that spans the resources allocated to the job, and then Flux runs the batch script passed to `flux batch` on rank 0 of the nested instance. \"Rank\" refers to the rank of the Tree-Based Overlay Network (TBON) used by the Flux brokers: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man1/flux-broker.html\n", - "\n", - "While a batch script is expected to launch parallel jobs using `flux run` or `flux submit` at this level, nothing prevents the script from further batching other sub-batch-jobs using the `flux batch` interface, if desired.\n", - "\n", - "Note: Flux also provides a `flux alloc` which is an interactive version of `flux batch`, but demonstrating that in a Jupyter notebook is difficult due to the lack of pseudo-terminal." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "blank-carpet", - "metadata": {}, - "outputs": [], - "source": [ - "!flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh\n", - "!flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh" - ] - }, - { - "cell_type": "markdown", - "id": "da98bfa1", - "metadata": {}, - "source": [ - "The contents of `sleep_batch.sh`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "381a3f6c-0da1-4923-801f-486ca5226d3c", - "metadata": {}, - "outputs": [], - "source": [ - "from IPython.display import Code\n", - "Code(filename='sleep_batch.sh', language='bash')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "edff8993-3c39-4f46-939d-4c8be5739fbc", - "metadata": {}, - "outputs": [], - "source": [ - "# Here we are submitting a job that generates output, and asking to write it to /tmp/cheese.txt\n", - "!flux submit --out /tmp/cheese.txt echo \"Sweet dreams 🌚️ are made of cheese, who am I to diss a brie? πŸ§€οΈ\"\n", - "\n", - "# This will show us JOBIDs\n", - "!flux jobs\n", - "\n", - "# We can even see jobs in sub-instances with \"-R\" (for recursive)\n", - "!flux jobs -R\n", - "\n", - "# You could copy a JOBID from above and paste it in the line below to examine the job's resources and output\n", - "# or get the last jobid with \"flux job last\" (this is what we will do here)\n", - "# JOBID=\"Ζ’FoRYVpt7\"\n", - "\n", - "# Note here we are using flux job last to see the last one\n", - "# The \"R\" here asks for the resource spec\n", - "!flux job info $(flux job last) R\n", - "\n", - "# When we attach it will direct us to our output file\n", - "!flux job attach $(flux job last)\n", - "\n", - "# And we can look at the output file to see our expected output!\n", - "from IPython.display import Code\n", - "Code(filename='/tmp/cheese.txt', language='text')" - ] - }, - { - "cell_type": "markdown", - "id": "f4e525e2-6c89-4c14-9fae-d87a0d4fc574", - "metadata": {}, - "source": [ - "To list all completed jobs, run `flux jobs -a`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "df8a8b7c-f475-4a51-8bc6-9983dc9d78ab", - "metadata": {}, - "outputs": [], - "source": [ - "!flux jobs -a" - ] - }, - { - "cell_type": "markdown", - "id": "3e415ecc-f451-4909-a2bf-351a639cd7fa", - "metadata": {}, - "source": [ - "To restrict the output to failed (i.e., jobs that exit with nonzero exit code, time out, or are canceled or killed) jobs, run:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "032597d2-4b02-47ea-a5e5-915313cdd7f9", - "metadata": {}, - "outputs": [], - "source": [ - "!flux jobs -f failed" - ] - }, - { - "cell_type": "markdown", - "id": "04b405b1-219f-489c-abfc-e2983e82124a", - "metadata": {}, - "source": [ - "# The Flux Hierarchy\n", - "\n", - "One feature of the Flux Framework scheduler that is unique is its ability to submit jobs within instances, where an instance can be thought of as a level in a graph. Let's start with a basic image - this is what it might look like to submit to a scheduler that is not graph-based,\n", - "where all jobs go to a central job queue or database. Note that our maximum job throughput is one job per second.\n", - "\n", - "![img/single-submit.png](img/single-submit.png)\n", - "\n", - "The throughput is limited by the workload manager's ability to process a single job. We can improve upon this by simply adding another level, perhaps with three instances. For example, let's say we create a flux allocation or batch that has control of some number of child nodes. We might launch three new instances (each with its own scheduler and queue) at that level two, and all of a sudden, we get a throughput of 1x3, or three jobs per second. \n", - "\n", - "![img/instance-submit.png](img/instance-submit.png)\n", - "\n", - "\n", - "All of a sudden, the throughout can increase exponentially because we are essentially submitting to different schedulers. The example above is not impressive, but our [learning guide](https://flux-framework.readthedocs.io/en/latest/guides/learning_guide.html#fully-hierarchical-resource-management-techniques) (Figure 10) has a beautiful example of how it can scale, done via an actual experiment. We were able to submit 500 jobs/second using only three levels, vs. close to 1 job/second with one level. \n", - "\n", - "![img/scaled-submit.png](img/scaled-submit.png)\n", - "\n", - "And for an interesting detail, you can vary the scheduler algorithm or topology within each sub-instance, meaning that you can do some fairly interesting things with scheduling work, and all without stressing the top level system instance. Next, let's look at a prototype tool called `flux-tree` that you can use to see how this works.\n", - "\n", - "## Flux tree\n", - "\n", - "Flux tree is a prototype tool that allows you to easily submit work to different levels of your flux instance, or more specifically, creating a nested hierarchy of jobs that scale out. Let's run the command, look at the output, and talk about it." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "2735b1ca-e761-46be-b509-a86b771628fc", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "TreeID Elapsed(sec) Begin(Epoch) End(Epoch) Match(usec) NJobs NNodes CPN GPN\n", - "tree 3.280890 1711671101.237483 1711671104.518378 1574.710000 4 1 4 0\n", - "tree.2 1.807340 1711671101.913623 1711671103.720962 690.937000 2 1 2 0\n", - "tree.2.2 0.116084 1711671102.753773 1711671102.869857 0.000000 1 1 1 0\n", - "tree.2.1 0.113461 1711671102.525129 1711671102.638590 0.000000 1 1 1 0\n", - "tree.1 1.823700 1711671101.837330 1711671103.661027 698.328000 2 1 2 0\n", - "tree.1.2 0.114873 1711671102.689943 1711671102.804816 0.000000 1 1 1 0\n", - "tree.1.1 0.115360 1711671102.447201 1711671102.562560 0.000000 1 1 1 0\n" - ] - } - ], - "source": [ - "!flux tree -T2x2 -J 4 -N 1 -c 4 -o ./tree.out -Q easy:fcfs hostname \n", - "! cat ./tree.out" - ] - }, - { - "cell_type": "markdown", - "id": "9d5fe7a0-af54-4c90-be6f-75f50c918dea", - "metadata": {}, - "source": [ - "In the above, we are running `flux-tree` and looking at the output file. What is happening is that the `flux tree` command is creating a hierarchy of instances. Based on their names you can tell that:\n", - "\n", - " - `2x2` in the command is the topology\n", - " - It says to create two flux instances, and make them each spawn two more.\n", - " - `tree` is the root\n", - " - `tree.1` is the first instance\n", - " - `tree.2` is the second instance\n", - " - `tree.1.1` and `tree.1.2` refer to the nested instances under `tree.1`\n", - " - `tree.2.1` and `tree.2.2` refer to the nested instances under `tree.2`\n", - " \n", - "And we provided the command `hostname` to this script, but a more complex example would generate more interested hierarchies,\n", - "and with differet functionality for each. Note that although this is just a dummy prototype, you could use `flux-tree` for actual work,\n", - "or more likely, you would want to use `flux batch` to submit multiple commands within a single flux instance to take advantage of the same\n", - "hierarchy. \n", - "\n", - "## Flux batch\n", - "\n", - "Next, let's look at an example that doesn't use `flux tree` but instead uses `flux batch`, which is how you will likely interact with your nested instances. Let's start with a batch script `hello-batch.sh`.\n", - "\n", - "##### hello-batch.sh\n" - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "id": "e82863e5-b2a1-456b-9ff1-f669b3525fa1", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
#!/bin/bash\n",
-       "\n",
-       "flux submit --flags=waitable -N1 --out /tmp/hello-batch-1.out echo "Hello job 1 from $(hostname) πŸ’›οΈ"\n",
-       "flux submit --flags=waitable -N1 --out /tmp/hello-batch-2.out echo "Hello job 2 from $(hostname) πŸ’šοΈ"\n",
-       "flux submit --flags=waitable -N1 --out /tmp/hello-batch-3.out echo "Hello job 3 from $(hostname) πŸ’™οΈ"\n",
-       "flux submit --flags=waitable -N1 --out /tmp/hello-batch-4.out echo "Hello job 4 from $(hostname) πŸ’œοΈ"\n",
-       "# Wait for the jobs to finish\n",
-       "flux job wait --all\n",
-       "
\n" - ], - "text/latex": [ - "\\begin{Verbatim}[commandchars=\\\\\\{\\}]\n", - "\\PY{c+ch}{\\PYZsh{}!/bin/bash}\n", - "\n", - "flux\\PY{+w}{ }submit\\PY{+w}{ }\\PYZhy{}\\PYZhy{}flags\\PY{o}{=}waitable\\PY{+w}{ }\\PYZhy{}N1\\PY{+w}{ }\\PYZhy{}\\PYZhy{}out\\PY{+w}{ }/tmp/hello\\PYZhy{}batch\\PYZhy{}1.out\\PY{+w}{ }\\PY{n+nb}{echo}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}}\\PY{l+s+s2}{Hello job 1 from }\\PY{k}{\\PYZdl{}(}hostname\\PY{k}{)}\\PY{l+s+s2}{ πŸ’›οΈ}\\PY{l+s+s2}{\\PYZdq{}}\n", - "flux\\PY{+w}{ }submit\\PY{+w}{ }\\PYZhy{}\\PYZhy{}flags\\PY{o}{=}waitable\\PY{+w}{ }\\PYZhy{}N1\\PY{+w}{ }\\PYZhy{}\\PYZhy{}out\\PY{+w}{ }/tmp/hello\\PYZhy{}batch\\PYZhy{}2.out\\PY{+w}{ }\\PY{n+nb}{echo}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}}\\PY{l+s+s2}{Hello job 2 from }\\PY{k}{\\PYZdl{}(}hostname\\PY{k}{)}\\PY{l+s+s2}{ πŸ’šοΈ}\\PY{l+s+s2}{\\PYZdq{}}\n", - "flux\\PY{+w}{ }submit\\PY{+w}{ }\\PYZhy{}\\PYZhy{}flags\\PY{o}{=}waitable\\PY{+w}{ }\\PYZhy{}N1\\PY{+w}{ }\\PYZhy{}\\PYZhy{}out\\PY{+w}{ }/tmp/hello\\PYZhy{}batch\\PYZhy{}3.out\\PY{+w}{ }\\PY{n+nb}{echo}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}}\\PY{l+s+s2}{Hello job 3 from }\\PY{k}{\\PYZdl{}(}hostname\\PY{k}{)}\\PY{l+s+s2}{ πŸ’™οΈ}\\PY{l+s+s2}{\\PYZdq{}}\n", - "flux\\PY{+w}{ }submit\\PY{+w}{ }\\PYZhy{}\\PYZhy{}flags\\PY{o}{=}waitable\\PY{+w}{ }\\PYZhy{}N1\\PY{+w}{ }\\PYZhy{}\\PYZhy{}out\\PY{+w}{ }/tmp/hello\\PYZhy{}batch\\PYZhy{}4.out\\PY{+w}{ }\\PY{n+nb}{echo}\\PY{+w}{ }\\PY{l+s+s2}{\\PYZdq{}}\\PY{l+s+s2}{Hello job 4 from }\\PY{k}{\\PYZdl{}(}hostname\\PY{k}{)}\\PY{l+s+s2}{ πŸ’œοΈ}\\PY{l+s+s2}{\\PYZdq{}}\n", - "\\PY{c+c1}{\\PYZsh{} Wait for the jobs to finish}\n", - "flux\\PY{+w}{ }job\\PY{+w}{ }\\PY{n+nb}{wait}\\PY{+w}{ }\\PYZhy{}\\PYZhy{}all\n", - "\\end{Verbatim}\n" - ], - "text/plain": [ - "#!/bin/bash\n", - "\n", - "flux submit --flags=waitable -N1 --out /tmp/hello-batch-1.out echo \"Hello job 1 from $(hostname) πŸ’›οΈ\"\n", - "flux submit --flags=waitable -N1 --out /tmp/hello-batch-2.out echo \"Hello job 2 from $(hostname) πŸ’šοΈ\"\n", - "flux submit --flags=waitable -N1 --out /tmp/hello-batch-3.out echo \"Hello job 3 from $(hostname) πŸ’™οΈ\"\n", - "flux submit --flags=waitable -N1 --out /tmp/hello-batch-4.out echo \"Hello job 4 from $(hostname) πŸ’œοΈ\"\n", - "# Wait for the jobs to finish\n", - "flux job wait --all" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from IPython.display import Code\n", - "Code(filename='hello-batch.sh', language='bash')" - ] - }, - { - "cell_type": "markdown", - "id": "6bc17bac-2fc4-4418-8939-e930f9929976", - "metadata": {}, - "source": [ - "We would provide this script to run with `flux batch` that is going to:\n", - "\n", - "1. Create a flux instance with the top level resources you specify\n", - "2. Submit jobs to the scheduler controlled by the broker of that sub-instance\n", - "3. Run the four jobs, with `--flags=waitable` and `flux job wait --all` to wait for the output file\n", - "4. Within the batch script, you can add `--wait` or `--flags=waitable` to individual jobs, and use `flux queue drain` to wait for the queue to drain, _or_ `flux job wait --all` to wait for the jobs you flagged to finish. \n", - "\n", - "Note that when you submit a batch job, you'll get a job id back for the _batch job_, and usually when you look at the output of that with `flux job attach $jobid` you will see the output file(s) where the internal contents are written. Since we want to print the output file easily to the terminal, we are waiting for the batch job by adding the `--flags=waitable` and then waiting for it. Let's try to run our batch job now." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "id": "72358a03-6f1f-4c5e-91eb-cab71883a232", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Ζ’2LcUUanSB\n", - "Ζ’2LcUUanSB\n", - "Hello job 1 from c32cc16d4b78 πŸ’›οΈ\n", - "Hello job 2 from c32cc16d4b78 πŸ’šοΈ\n", - "Hello job 3 from c32cc16d4b78 πŸ’™οΈ\n", - "Hello job 4 from c32cc16d4b78 πŸ’œοΈ\n" - ] - } - ], - "source": [ - "! flux batch --flags=waitable --out /tmp/flux-batch.out -N2 ./hello-batch.sh\n", - "! flux job wait\n", - "! cat /tmp/hello-batch-1.out\n", - "! cat /tmp/hello-batch-2.out\n", - "! cat /tmp/hello-batch-3.out\n", - "! cat /tmp/hello-batch-4.out" - ] - }, - { - "cell_type": "markdown", - "id": "75c0ae3f-2813-4ae8-83be-00be3df92a4b", - "metadata": {}, - "source": [ - "Excellent! Now let's look at another batch example. Here we have two job scripts:\n", - "\n", - "- sub_job1.sh: Is going to be run with `flux batch` and submit sub_job2.sh\n", - "- sub_job2.sh: Is going to be submit by sub_job1.sh.\n", - "\n", - "You can see that below." - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "id": "2e6976f8-dbb6-405e-a06b-47c571aa1cdf", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
#!/bin/bash\n",
-       "\n",
-       "flux batch -N1 ./sub_job2.sh\n",
-       "flux queue drain\n",
-       "
\n" - ], - "text/latex": [ - "\\begin{Verbatim}[commandchars=\\\\\\{\\}]\n", - "\\PY{c+ch}{\\PYZsh{}!/bin/bash}\n", - "\n", - "flux\\PY{+w}{ }batch\\PY{+w}{ }\\PYZhy{}N1\\PY{+w}{ }./sub\\PYZus{}job2.sh\n", - "flux\\PY{+w}{ }queue\\PY{+w}{ }drain\n", - "\\end{Verbatim}\n" - ], - "text/plain": [ - "#!/bin/bash\n", - "\n", - "flux batch -N1 ./sub_job2.sh\n", - "flux queue drain\n" - ] - }, - "execution_count": 38, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Code(filename='sub_job1.sh', language='bash')" - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "id": "a0719cc9-6bf2-4285-b5d7-6cc534fc364c", - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
#!/bin/bash\n",
-       "\n",
-       "flux run -N1 sleep 30\n",
-       "
\n" - ], - "text/latex": [ - "\\begin{Verbatim}[commandchars=\\\\\\{\\}]\n", - "\\PY{c+ch}{\\PYZsh{}!/bin/bash}\n", - "\n", - "flux\\PY{+w}{ }run\\PY{+w}{ }\\PYZhy{}N1\\PY{+w}{ }sleep\\PY{+w}{ }\\PY{l+m}{30}\n", - "\\end{Verbatim}\n" - ], - "text/plain": [ - "#!/bin/bash\n", - "\n", - "flux run -N1 sleep 30\n" - ] - }, - "execution_count": 39, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "Code(filename='sub_job2.sh', language='bash')" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "id": "8640a611-38e4-42b1-a913-89e0c76c8014", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Ζ’2Mgy7vtZm\n" - ] - } - ], - "source": [ - "# Submit it!\n", - "!flux batch -N1 ./sub_job1.sh" - ] - }, - { - "cell_type": "markdown", - "id": "b29c3a4a-2b77-4ab9-8e0c-9f5228e61016", - "metadata": {}, - "source": [ - "And now that we've submit, let's look at the hierarchy for all the jobs we just ran. Here is how to try flux pstree, which normally can show jobs in an instance, but it has limited functionality given we are in a notebook! So instead of just running the single command, let's add \"-a\" to indicate \"show me ALL jobs.\"\n", - "More complex jobs and in a different environment would have deeper nesting. You can [see examples here](https://flux-framework.readthedocs.io/en/latest/jobs/hierarchies.html?h=pstree#flux-pstree-command)." - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "id": "2d2b1f0b-e6c2-4583-8068-7c76fa341884", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - ".\n", - "β”œβ”€β”€ ./sub_job1.sh:CD\n", - "β”œβ”€β”€ 20*[./hello-batch.sh:CD]\n", - "β”œβ”€β”€ 2*[flux-tree-y7zovVRbptUhPPfV2MKXPJVWwQN9HigN:CD]\n", - "β”œβ”€β”€ 2*[flux-tree-V6rPdUViHrklYfiMxYosv7lEHKgrYLJF:CD]\n", - "β”œβ”€β”€ 2*[flux-tree-oEof1Dm3CMim8MBCpppJf6I3hdxx2aYI:CD]\n", - "β”œβ”€β”€ 2*[flux-tree-tZr1SYP6yAkYvbD7t3g9XziIRVtubVlP:CD]\n", - "└── 2*[flux-tree-5AO4EWvE6Nr1lPfb2qvVje99HKC2ZTYh:CD]\n" - ] - } - ], - "source": [ - "!flux pstree -a" - ] - }, - { - "cell_type": "markdown", - "id": "7724130f-b0db-4ccf-a01e-98907b9a27ca", - "metadata": {}, - "source": [ - "You can also try a more detailed view with `flux pstree -a -X`!" - ] - }, - { - "cell_type": "markdown", - "id": "03e2ae62-3e3b-4c82-a0c7-4c97ff1376d2", - "metadata": {}, - "source": [ - "# Flux Process and Job Utilities\n", - "## Flux top\n", - "Flux provides a feature-full version of `top` for nested Flux instances and jobs. In the JupyterLab terminal, invoke `flux top` to see the \"sleep\" jobs. If they have already completed you can resubmit them. \n", - "\n", - "We recommend not running `flux top` in the notebook as it is not designed to display output from a command that runs continuously.\n", - "\n", - "## Flux pstree\n", - "In analogy to `top`, Flux provides `flux pstree`. Try it out in the JupyterLab terminal or here in the notebook.\n", - "\n", - "## Flux proxy\n", - "\n", - "### Interacting with a job hierarchy with `flux proxy`\n", - "\n", - "Flux proxy is used to route messages to and from a Flux instance. We can use `flux proxy` to connect to a running Flux instance and then submit more nested jobs inside it. You may want to edit `sleep_batch.sh` with the JupyterLab text editor (double click the file in the window on the left) to sleep for `60` or `120` seconds. Then from the JupyterLab terminal, run, you'll want to run the below. Yes, we really want you to open a terminal in the Jupyter launcher FILE-> NEW -> TERMINAL and run the commands below!" - ] - }, - { - "cell_type": "markdown", - "id": "a609b2f8-e24d-40c7-b022-ce02e91a49f8", - "metadata": {}, - "source": [ - "```bash\n", - "# The terminal will start at the root, ensure you are in the right spot!\n", - "# jovyan - that's you! \n", - "cd /home/jovyan/flux-radiuss-tutorial-2023/notebook/\n", - "\n", - "# Outputs the JOBID\n", - "flux batch --nslots=2 --cores-per-slot=1 --nodes=2 ./sleep_batch.sh\n", - "\n", - "# Put the JOBID into an environment variable\n", - "JOBID=$(flux job last)\n", - "\n", - "# See the flux process tree\n", - "flux pstree -a\n", - "\n", - "# Connect to the Flux instance corresponding to JOBID above\n", - "flux proxy ${JOBID}\n", - "\n", - "# Note the depth is now 1 and the size is 2: we're one level deeper in a Flux hierarchy and we have only 2 brokers now.\n", - "flux uptime\n", - "\n", - "# This instance has 2 \"nodes\" and 2 cores allocated to it\n", - "flux resource list\n", - "\n", - "# Have you used the top command in your terminal? We have one for flux!\n", - "flux top\n", - "```\n", - "\n", - "`flux top` was pretty cool, right? 😎️" - ] - }, - { - "cell_type": "markdown", - "id": "997faffc", - "metadata": {}, - "source": [ - "## Submission API\n", - "Flux also provides first-class python bindings which can be used to submit jobs programmatically. The following script shows this with the `flux.job.submit()` call:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "third-comment", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import json\n", - "import flux\n", - "from flux.job import JobspecV1\n", - "from flux.job.JobID import JobID" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "selective-uganda", - "metadata": {}, - "outputs": [], - "source": [ - "f = flux.Flux() # connect to the running Flux instance\n", - "compute_jobreq = JobspecV1.from_command(\n", - " command=[\"./compute.py\", \"120\"], num_tasks=1, num_nodes=1, cores_per_task=1\n", - ") # construct a jobspec\n", - "compute_jobreq.cwd = os.path.expanduser(\"~/flux-tutorial/flux-workflow-examples/job-submit-api/\") # set the CWD\n", - "print(JobID(flux.job.submit(f,compute_jobreq)).f58) # submit and print out the jobid (in f58 format)" - ] - }, - { - "cell_type": "markdown", - "id": "0c4b260f-f08a-46ae-ad66-805911a857a7", - "metadata": {}, - "source": [ - "### `flux.job.get_job(handle, jobid)` to get job info" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ed65cb46-8d8a-41f0-bec1-92b9a89e6db2", - "metadata": {}, - "outputs": [], - "source": [ - "# This is a new command to get info about your job from the id!\n", - "fluxjob = flux.job.submit(f,compute_jobreq)\n", - "fluxjobid = JobID(fluxjob.f58)\n", - "print(f\"πŸŽ‰οΈ Hooray, we just submitted {fluxjobid}!\")\n", - "\n", - "# Here is how to get your info. The first argument is the flux handle, then the jobid\n", - "jobinfo = flux.job.get_job(f, fluxjobid)\n", - "print(json.dumps(jobinfo, indent=4))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5d679897-7054-4f96-b340-7f39245aca89", - "metadata": {}, - "outputs": [], - "source": [ - "!flux jobs -a | grep compute" - ] - }, - { - "cell_type": "markdown", - "id": "d332f9c9", - "metadata": {}, - "source": [ - "Under the hood, the `Jobspec` class is creating a YAML document that ultimately gets serialized as JSON and sent to Flux for ingestion, validation, queueing, scheduling, and eventually execution. We can dump the raw JSON jobspec that is submitted, where we can see the exact resources requested and the task set to be executed on those resources." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "efa06478", - "metadata": {}, - "outputs": [], - "source": [ - "print(compute_jobreq.dumps(indent=2))" - ] - }, - { - "cell_type": "markdown", - "id": "73bbc90e", - "metadata": {}, - "source": [ - "We can then replicate our previous example of submitting multiple heterogeneous jobs and testing that Flux co-schedules them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "industrial-privacy", - "metadata": {}, - "outputs": [], - "source": [ - "compute_jobreq = JobspecV1.from_command(\n", - " command=[\"./compute.py\", \"120\"], num_tasks=4, num_nodes=2, cores_per_task=2\n", - ")\n", - "compute_jobreq.cwd = os.path.expanduser(\"~/flux-tutorial/flux-workflow-examples/job-submit-api/\")\n", - "print(JobID(flux.job.submit(f, compute_jobreq)))\n", - "\n", - "io_jobreq = JobspecV1.from_command(\n", - " command=[\"./io-forwarding.py\", \"120\"], num_tasks=1, num_nodes=1, cores_per_task=1\n", - ")\n", - "io_jobreq.cwd = os.path.expanduser(\"~/flux-tutorial/flux-workflow-examples/job-submit-api/\")\n", - "print(JobID(flux.job.submit(f, io_jobreq)))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "pregnant-creativity", - "metadata": {}, - "outputs": [], - "source": [ - "!flux jobs -a | grep compute" - ] - }, - { - "cell_type": "markdown", - "id": "a8051640", - "metadata": {}, - "source": [ - "We can use the FluxExecutor class to submit large numbers of jobs to Flux. This method uses python's `concurrent.futures` interface. Example snippet from `~/flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py`:" - ] - }, - { - "cell_type": "markdown", - "id": "binary-trace", - "metadata": {}, - "source": [ - "``` python \n", - "with FluxExecutor() as executor:\n", - " compute_jobspec = JobspecV1.from_command(args.command)\n", - " futures = [executor.submit(compute_jobspec) for _ in range(args.njobs)]\n", - " # wait for the jobid for each job, as a proxy for the job being submitted\n", - " for fut in futures:\n", - " fut.jobid()\n", - " # all jobs submitted - print timings\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cleared-lawsuit", - "metadata": {}, - "outputs": [], - "source": [ - "# Submit a FluxExecutor based script.\n", - "%run ../flux-workflow-examples/async-bulk-job-submit/bulksubmit_executor.py -n200 /bin/sleep 0" - ] - }, - { - "cell_type": "markdown", - "id": "e1f041b1-ebe3-49d7-b522-79013e29acfa", - "metadata": {}, - "source": [ - "# Flux Archive\n", - "\n", - "As Flux is more increasingly used in cloud environments, you might find yourself in a situation of having a cluster without a shared filesystem! Have no fear, the `flux archive` command is here to help!\n", - "At a high level, `flux archive` is allowing you to save named pieces of data (text or data files) to the Flux key value store (KVS) for later retrieval.\n", - "Since this tutorial is running on one node it won't make a lot of sense, but we will show you how to use it. The first thing you'll want to do is make a named archive." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "0114079f-26a3-4614-a8b2-6422ee2170a2", - "metadata": {}, - "outputs": [], - "source": [ - "! touch shared-file.txt\n", - "! flux archive create --name myshare --directory $(pwd) shared-file.txt" - ] - }, - { - "cell_type": "markdown", - "id": "e33173df-adbf-4028-8795-7f68d7dc66ba", - "metadata": {}, - "source": [ - "We would then want to send this file to the other nodes, and from the same node. We can combine two commands to do that:\n", - "\n", - "- `flux exec` executes commands on instance nodes, optionally excluding ranks with `-x`\n", - "- `flux archive extract` does the extraction\n", - "\n", - "So we might put them together to look like this - asking for all ranks, but excluding (`-x`) rank 0 where we are currently sitting and the file already exists." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "05769493-54a9-453c-9c5e-516123a274c2", - "metadata": {}, - "outputs": [], - "source": [ - "! flux exec --rank all -x 0 flux archive extract --name myshare --directory $(pwd) shared-file.txt" - ] - }, - { - "cell_type": "markdown", - "id": "4df4ee23-4cce-4df8-9c99-e5cd3a4ae277", - "metadata": {}, - "source": [ - "If the extraction directory doesn't exist on the other nodes yet? No problem! We can use `flux exec` to execute a command to the other nodes to create it, again with `-x 0` to exclude rank 0 (where the directory already exists). Note that you'd run this _before_ `flux archive extract` and we are using `-r` as a shorthand for `--rank`." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "351415e0-4644-49bc-b4b1-b3ab3544d527", - "metadata": {}, - "outputs": [], - "source": [ - "! flux exec -r all -x 0 mkdir -p $(pwd)" - ] - }, - { - "cell_type": "markdown", - "id": "781bb105-4977-4022-a0bf-0bc53d73b2e4", - "metadata": {}, - "source": [ - "When you are done, it's good practice to clean up and remove the archive. Also note that for larger files, you can use `--mmap` to memory map content." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "acde2ba8-ade9-450e-8ff9-2b0f094166b9", - "metadata": {}, - "outputs": [], - "source": [ - "! flux archive remove --name myshare" - ] - }, - { - "cell_type": "markdown", - "id": "ec052119", - "metadata": {}, - "source": [ - "Finally, note that older versions of flux used `flux filemap` instead of flux archive. It's largely the same command with a rename.\n", - "\n", - "# Diving Deeper Into Flux's Internals\n", - "\n", - "Flux uses [hwloc](https://github.com/open-mpi/hwloc) to detect the resources on each node and then to populate its resource graph.\n", - "\n", - "You can access the topology information that Flux collects with the `flux resource` subcommand:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "scenic-chassis", - "metadata": {}, - "outputs": [], - "source": [ - "!flux resource list" - ] - }, - { - "cell_type": "markdown", - "id": "0086e47e", - "metadata": {}, - "source": [ - "Flux can also bootstrap its resource graph based on static input files, like in the case of a multi-user system instance setup by site administrators. [More information on Flux's static resource configuration files](https://flux-framework.readthedocs.io/en/latest/adminguide.html#resource-configuration). Flux provides a more standard interface to listing available resources that works regardless of the resource input source: `flux resource`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "prime-equilibrium", - "metadata": {}, - "outputs": [], - "source": [ - "# To view status of resources\n", - "!flux resource status" - ] - }, - { - "cell_type": "markdown", - "id": "5ee1c49d", - "metadata": {}, - "source": [ - "Flux has a command for controlling the queue within the `job-manager`: `flux queue`. This includes disabling job submission, re-enabling it, waiting for the queue to become idle or empty, and checking the queue status:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "800de4eb", - "metadata": {}, - "outputs": [], - "source": [ - "!flux queue disable \"maintenance outage\"\n", - "!flux queue enable\n", - "!flux queue -h" - ] - }, - { - "cell_type": "markdown", - "id": "67aa7559", - "metadata": {}, - "source": [ - "Each Flux instance has a set of attributes that are set at startup that affect the operation of Flux, such as `rank`, `size`, and `local-uri` (the Unix socket usable for communicating with Flux). Many of these attributes can be modified at runtime, such as `log-stderr-level` (1 logs only critical messages to stderr while 7 logs everything, including debug messages)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "biblical-generic", - "metadata": {}, - "outputs": [], - "source": [ - "!flux getattr rank\n", - "!flux getattr size\n", - "!flux getattr local-uri\n", - "!flux setattr log-stderr-level 3\n", - "!flux lsattr -v" - ] - }, - { - "cell_type": "markdown", - "id": "d74fdfcf", - "metadata": {}, - "source": [ - "Services within a Flux instance are implemented by modules. To query and manage broker modules, use `flux module`. Modules that we have already directly interacted with in this tutorial include `resource` (via `flux resource`), `job-ingest` (via `flux` and the Python API) `job-list` (via `flux jobs`) and `job-manager` (via `flux queue`), and we will interact with the `kvs` module in a few cells. For the most part, services are implemented by modules of the same name (e.g., `kvs` implements the `kvs` service and thus the `kvs.lookup` RPC). In some circumstances, where multiple implementations for a service exist, a module of a different name implements a given service (e.g., in this instance, `sched-fluxion-qmanager` provides the `sched` service and thus `sched.alloc`, but in another instance `sched-simple` might provide the `sched` service)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "spatial-maintenance", - "metadata": {}, - "outputs": [], - "source": [ - "!flux module list" - ] - }, - { - "cell_type": "markdown", - "id": "ad7090eb", - "metadata": {}, - "source": [ - "We can actually unload the Fluxion modules (the scheduler modules from flux-sched) and replace them with `sched-simple` (the scheduler that comes built-into flux-core) as a demonstration of this functionality:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "df4bc2d5", - "metadata": {}, - "outputs": [], - "source": [ - "!flux module unload sched-fluxion-qmanager\n", - "!flux module unload sched-fluxion-resource\n", - "!flux module load sched-simple\n", - "!flux module list" - ] - }, - { - "cell_type": "markdown", - "id": "722c4ecf", - "metadata": {}, - "source": [ - "We can now reload the Fluxion scheduler, but this time, let's pass some extra arguments to specialize our Flux instance. In particular, let's populate our resource graph with nodes, sockets, and cores and limit the scheduling depth to 4." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c34899ba", - "metadata": {}, - "outputs": [], - "source": [ - "!flux dmesg -C\n", - "!flux module unload sched-simple\n", - "!flux module load sched-fluxion-resource load-allowlist=node,socket,core\n", - "!flux module load sched-fluxion-qmanager queue-params=queue-depth=4\n", - "!flux module list\n", - "!flux dmesg | grep queue-depth" - ] - }, - { - "cell_type": "markdown", - "id": "ed4b0e04", - "metadata": {}, - "source": [ - "The key-value store (KVS) is a core component of a Flux instance. The `flux kvs` command provides a utility to list and manipulate values of the KVS. Modules of Flux use the KVS to persistently store information and retrieve it later on (potentially after a restart of Flux). One example of KVS use by Flux is the `resource` module, which stores the resource set `R` of the current Flux instance:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "nervous-broadcast", - "metadata": {}, - "outputs": [], - "source": [ - "!flux kvs ls \n", - "!flux kvs ls resource\n", - "!flux kvs get resource.R | jq" - ] - }, - { - "cell_type": "markdown", - "id": "c3920f9e", - "metadata": {}, - "source": [ - "Flux provides a built-in mechanism for executing commands on nodes without requiring a job or resource allocation: `flux exec`. `flux exec` is typically used by sys admins to execute administrative commands and load/unload modules across multiple ranks simultaneously." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e9507c7b-de5c-4129-9a99-c943614a9ba2", - "metadata": {}, - "outputs": [], - "source": [ - "!flux exec -r 2 flux getattr rank # only execute on rank 2" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6a9de119-abc4-4917-a339-2010ccc7b9b7", - "metadata": {}, - "outputs": [], - "source": [ - "!flux exec flux getattr rank # execute on all ranks" - ] - }, - { - "cell_type": "markdown", - "id": "c9c3e767-0459-4218-a8cf-0f98bd32d6bf", - "metadata": {}, - "source": [ - "# This concludes the notebook tutorial. πŸ˜­οΈπŸ˜„οΈ\n", - "\n", - "Don't worry, you'll have more opportunities for using Flux! We hope you reach out to us on any of our [project repositories](https://flux-framework.org) and ask any questions that you have. We'd love your contribution to code, documentation, or just saying hello! πŸ‘‹οΈ If you have feedback on the tutorial, please let us know so we can improve it for next year. \n", - "\n", - "> But what do I do now?\n", - "\n", - "Feel free to experiment more with Flux here, or (for more freedom) in the terminal. You can try more of the examples in the flux-workflow-examples directory one level up in the window to the left. If you're using a shared system like the one on the RADIUSS AWS tutorial please be mindful of other users and don't run compute intensive workloads. If you're running the tutorial in a job on an HPC cluster... compute away! ⚾️\n", - "\n", - "> Where can I learn to set this up on my own?\n", - "\n", - "If you're interested in installing Flux on your cluster, take a look at the [system instance instructions](https://flux-framework.readthedocs.io/en/latest/adminguide.html). If you are interested in running Flux on Kubernetes, check out the [Flux Operator](https://github.com/flux-framework/flux-operator). " - ] - }, - { - "cell_type": "markdown", - "id": "82657547-fc4a-459c-b628-3a60fea84c8d", - "metadata": {}, - "source": [ - "![https://flux-framework.org/flux-operator/_static/images/flux-operator.png](https://flux-framework.org/flux-operator/_static/images/flux-operator.png)\n", - "\n", - "" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.12" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}