Step-by-Step Guide to Installing SLURM with memory limit and core affinity on Ubuntu – Single Node

Learn to manually install SLURM with memory limit and core affinity via cgroups on Ubuntu 22.04 with this step-by-step tutorial.

Introduction

This tutorial aims to guide you through the manual installation and configuration of SLURM with memory limit and core affinity on a single-node Ubuntu 22.04 system. Memory limiting and core affinity are achieved via cgroups. Due to compatibility issues with the SLURM version available in Ubuntu’s repository (21.08.5), a manual installation of a newer version is necessary to ensure proper functionality of cgroups. By the end of this tutorial, you’ll have a fully operational SLURM setup tailored for efficient resource management in computational chemistry tasks on a single-node environment.

The tutorial is targeted to small to medium sized computational groups. Consequently, it does not include instructions for setting up queues, QoS and other advanced configurations that are typically needed in large HPC clusters.

Why manual installation?

SLURM memory limit and core affinity features depend on cgroups (control groups). However, the version of SLURM available in the Ubuntu 22.04 repository (21.08.5) is outdated and incompatible with Ubuntu 22.04’s cgroup implementation, leading to issues when enabling memory management. For this reason, manual installation of a newer SLURM version is essential.

Several sources highlight these compatibility challenges, including Google Groups, Unix Stack Exchange, Superuser, and SLURM Users Mailing List.

What you’ll learn

In this guide, you’ll learn how to:

  • Remove Existing SLURM Installations: Start fresh by completely uninstalling any previous SLURM setups.
  • Install SLURM from Source: Manually install the latest SLURM version to enable memory management.
  • Enable SLURM Memory Limit: Configure cgroups to effectively limit memory usage for jobs.
  • Verify the Setup: Test your configuration to ensure SLURM and cgroups are functioning correctly.

By the end, you’ll have a fully functional SLURM setup on your single-node system, ready to manage computational tasks efficiently.

Step-by-step instructions

(optional) Step 1: Remove existing SLURM installation

1.1 Stop SLURM Services
sudo systemctl stop slurmctld
sudo systemctl stop slurmd
ShellScript
1.2 Remove SLURM Packages
sudo apt-get remove --purge slurm-*
sudo apt-get remove --purge slurmctld slurmd
ShellScript
1.3 Clean Up Residual Files
sudo apt autoremove
sudo rm -rf /etc/slurm
sudo rm -rf /var/spool/slurmctld
sudo rm -rf /var/spool/slurmd
sudo rm -rf /var/log/slurm
sudo rm -rf /var/lib/slurm
sudo rm -rf /run/slurm*
sudo rm -rf /usr/local/etc/slurm*
ShellScript
1.4 Remove slurm User (if it exists)
sudo userdel -r slurm
ShellScript
1.5 (optional) Find and Remove Any Remaining SLURM Files

Although the previous steps should have already covered most of the leftovers from previous installations, if you want to check if there’s any file in the machine with its name containing “slurm”, this is the command:

sudo find / -name "*slurm*"
ShellScript
1.6 Clean Package Manager Cache
sudo apt-get clean
sudo apt-get autoremove
ShellScript
1.7 (optional but recommended) Reboot the System
sudo reboot
ShellScript

Step 2: Install Dependencies

2.1 Update Package Lists
sudo apt-get update
ShellScript
2.2 Install Required Packages
sudo apt-get install build-essential fakeroot devscripts libmunge-dev libmunge2 munge
ShellScript
2.3 Upgrade System Packages
sudo apt upgrade
ShellScript

Step 3: Download and Extract SLURM Source Code

3.1 Create a Directory for SLURM Source
mkdir ~/slurm<br>cd ~/slurm
ShellScript
3.2 Download SLURM Source Code
wget https://download.schedmd.com/slurm/slurm-23.11.8.tar.bz2
ShellScript
3.3 Extract the Tarball
tar -xaf slurm-23.11.8.tar.bz2
cd slurm-23.11.8/
ShellScript

Step 4: Install Build Dependencies and Build SLURM

4.1 Install SLURM Package Dependencies
sudo apt install libswitch-perl equivs mk-build-deps
sudo mk-build-deps -i debian/control
ShellScript
4.2 Create slurm User
sudo useradd -m -r -s /bin/false slurm
ShellScript
4.3 Build SLURM Packages
debuild -b -uc -us
cd ..
ShellScript
4.4 Install the Newly Built SLURM Packages
sudo dpkg -i slurm-*.deb
ShellScript

Step 5: Configure SLURM for memory limit and core affinity

5.1 Create SLURM Configuration Directory
sudo mkdir -p /etc/slurm
ShellScript
5.2 Create Configuration Files

Once SLURM packages have been installed, create your slurm.conf and cgroup.conf files and copy them to the configuration directory:

sudo cp ~/slurm.conf /etc/slurm/
sudo cp ~/cgroup.conf /etc/slurm/
ShellScript
Template for slurm.conf and cgroup.conf

The following templates already contain all the directives for enabling memory limiting through cgroups, core affinity settings and default mpi flags that are useful for a single-node machine.

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=YOUR_NODE_NAME
SlurmctldHost=localhost
MpiDefault=none
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING - The scheduler Parameters entry correctly instructs mpi how to bind cores
SchedulerType=sched/backfill
SelectType=select/cons_tres
SchedulerParameters=default_cpu_bind=cores
#
#MEMORY LIMITING - These settings make sure SLURM controls the job core affinities
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
SelectTypeParameters=CR_CPU_Memory
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
# COMPUTE NODES - CUSTOMIZE SOCKETS, CORESPERSOCKET, CPUS AND REALMEMORY BASED ON YOUR MACHINE SPECS
NodeName=localhost Sockets=2 CoresPerSocket=128 CPUs=256 RealMemory=1500000 State=UNKNOWN
PartitionName=batch Nodes=ALL Default=YES MaxTime=INFINITE State=UP
ShellScript
# cgroup.conf file
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=no
ShellScript

Step 6: Set Up SLURM Directories and Permissions

6.1 Create Necessary Directories
sudo mkdir -p /var/spool/slurmctld
sudo mkdir -p /var/spool/slurmd
sudo mkdir -p /var/log/slurm
sudo mkdir -p /var/lib/slurm
ShellScript
6.2 Set Ownership to slurm User
sudo chown -R slurm:slurm /var/spool/slurmctld
sudo chown -R slurm:slurm /var/spool/slurmd
sudo chown -R slurm:slurm /var/log/slurm
sudo chown -R slurm:slurm /var/lib/slurm
ShellScript
6.3 Set Correct Permissions
sudo chmod -R 755 /var/spool/slurmctld
sudo chmod -R 755 /var/spool/slurmd
sudo chmod -R 755 /var/log/slurm
sudo chmod -R 755 /var/lib/slurm
ShellScript
6.4 Set ACL for /var/spool
sudo setfacl -m u:slurm:rwx /var/spool
ShellScript

Step 7: Start and Enable SLURM Services

7.1 Start SLURM Services
sudo systemctl start slurmctld
sudo systemctl start slurmd
ShellScript
7.2 Enable SLURM Services to Start on Boot
sudo systemctl enable slurmctld
sudo systemctl enable slurmd
ShellScript

Step 8: Verify Installation status and logs

8.1 Check Service Status
systemctl status slurmctld
systemctl status slurmd
ShellScript
8.2 Check Logs for Errors
sudo journalctl -xeu slurmctld.service
sudo journalctl -xeu slurmd.service
ShellScript

(optional) Step 9: Editing GRUB Configuration for cgroups

It might happen that SLURM won’t work out of the box because of compatibility issue with the cgroups version of your system. This part of the tutorial can help you fix that.

9.1 Edit the GRUB Configuration File

Firstly, open the GRUB configuration file using a text editor with administrative privileges.

sudo nano /etc/default/grub
ShellScript

Then, modify the GRUB_CMDLINE_LINUX line to include or update parameters that control the cgroup settings. For example, to enable cgroup v2, you might add the following:

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all"
ShellScript
9.2 Update GRUB

Once the changes to the GRUB configuration file are made, update GRUB to apply the changes.

sudo update-grub
ShellScript
9.3 Reboot the System

Reboot your system to apply the changes to the GRUB configuration.

sudo reboot
ShellScript
9.4 Verify cgroup Configuration

Once the reboot is completed, verify that the cgroup settings are applied correctly. You can check the current cgroup settings by inspecting the kernel command:

cat /proc/cmdline
ShellScript

This command will display the current kernel parameters, including those set through GRUB, allowing you to confirm that the intended cgroup configurations are active.

Step 10: Run a test job to check SLURM memory limit

Once you have completed all the setup steps, it’s important to check whether the installation is actually functional in terms of memory limiting. You can run the following script that takes up 20 gigs of RAM launching it in a SLURM job allocating only 10 gigs. If the job dies with errors saying that SLURM killed the job, then everything’s working properly:

# memory_hog.py
import numpy as np
from time import sleep


def allocate_memory(size_in_gb):
    try:
        # Allocate a large amount of memory
        print(f"Allocating {size_in_gb} GB of memory")
        large_array = np.zeros((size_in_gb * 1024**3 // 8,), dtype=np.float64)
        large_array += 1
        sleep(30)
        print("Memory allocation successful")
    except MemoryError:
        print("Memory allocation failed due to MemoryError")
    except Exception as e:
        print(f"Memory allocation failed due to: {e}")

if __name__ == "__main__":
    allocate_memory(20)  # Try to allocate 10 GB of memory
ShellScript
#!/bin/bash
# test.slurm file
#SBATCH --job-name=test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=10gb
#SBATCH --output=%x_%j.out

cd $HOME

python memory_hog.py
ShellScript

Then submit the job as customary:

sbatch test.slurm
ShellScript

The test*.out file should, therefore, contain something like this, if the memory management killed it:

/var/lib/slurm/slurmd/job00006/slurm_script: line 10: 139433 Killed                  python memory_hog.py
slurmstepd: error: Detected 1 oom_kill event in StepId=6.batch. Some of the step tasks have been OOM Killed.

Conclusions

Successfully setting up SLURM with cgroups on a single-node Ubuntu 22.04 system ensures robust resource management for your computational tasks. By following this step-by-step guide, you’ve not only bypassed compatibility issues inherent in Ubuntu’s default SLURM version but also optimized your system for high-performance computing. This setup is particularly beneficial for environments where precise control over memory and task scheduling is crucial. Once SLURM and cgroups are correctly configured, your system is now better equipped to handle complex computational workloads efficiently.

Mattia Felice Palermo Profile Picture
Researcher at CIC energiGUNE

Mattia Felice Palermo, Ph.D. in Computational Chemistry, has experience in industrial R&D with corporate and SME companies. His expertise includes molecular dynamics modeling of liquid crystals and polymers, and predicting electrochemical properties using electronic structure methods. He led the computational chemistry group at Green Energy Storage and now is a researcher at the CIC energiGUNE research center.

Share this blog post!

Leave a Reply

Your email address will not be published. Required fields are marked *

Table of Contents

Index