Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AN-360 Fix slurm CI #7680

Merged
merged 15 commits into from
Jan 21, 2025
Merged

AN-360 Fix slurm CI #7680

merged 15 commits into from
Jan 21, 2025

Conversation

jgainerdewar
Copy link
Collaborator

@jgainerdewar jgainerdewar commented Jan 16, 2025

Description

Our SLURM CI broke when GHA runners upgraded to Ubuntu 24. Several changes needed to get SLURM and its containers to be happy, including:

  • Set correct permissions on slurm /var/spool/slurmd dir
  • Switch from Singularity to Apptainer (basically the same packaged, renamed) and add an AppArmor profile for it
  • Switch to updated SelectType
  • Convince SLURM to run a bajillion jobs at once so we can get the test finished within two hours

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@jgainerdewar jgainerdewar added the Don't Look At Me 🙈 (not yet ready for review) label Jan 16, 2025
@jgainerdewar jgainerdewar marked this pull request as ready for review January 21, 2025 14:36
@jgainerdewar jgainerdewar requested a review from a team as a code owner January 21, 2025 14:36
@jgainerdewar jgainerdewar removed the Don't Look At Me 🙈 (not yet ready for review) label Jan 21, 2025
@@ -31,6 +31,21 @@ cromwell::build::slurm::setup_slurm_environment() {
# Create various directories used by slurm
sudo mkdir -p /var/run/munge
sudo mkdir -p /var/spool/slurmd
sudo chown slurm:slurm /var/spool/slurmd

# Set up an AppArmor profile for Apptainer to allow non-root users to use it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You called it!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The homelab to day job pipeline in action

Copy link
Contributor

@lucymcnatt lucymcnatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link

@LizBaldo LizBaldo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing investigation 🫨

@@ -40,13 +55,14 @@ cromwell::build::slurm::setup_slurm_environment() {
cat <<SLURM_CONF | sudo tee /etc/slurm/slurm.conf >/dev/null
ClusterName=localhost
ControlMachine=localhost
NodeName=localhost
PartitionName=localpartition Nodes=localhost Default=YES
NodeName=localhost CPUs=4 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values obtained from running lscpu on the GHA runner.

@jgainerdewar jgainerdewar merged commit a77792a into develop Jan 21, 2025
43 checks passed
@jgainerdewar jgainerdewar deleted the jd_AN-360_slurmCI branch January 21, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants