Compute grid¶
The CIRRELT & GERAD provides you with a local computing cluster based on the Slurm workload manager, the same system used by the Digital Research Alliance of Canada (formerly Compute Canada). Although our resources are more limited and the configuration differs, usage remains similar. Most of the information available on the Alliance wiki job submission page therefore also applies to our cluster.
Rules for Using the Computing Cluster¶
- Never run jobs directly on the
slurmmachine (it is reserved for job submissions). - Use the correct amount of memory for your jobs.
- Use the appropriate number of CPUs for your jobs.
- Jobs shorter than 10 minutes may be ignored by the system. Make sure your computations exceed this threshold.
- Use the appropriate partition for your job:
optimum: For computations under 2 days (10 machines available).optimumlong: For computations between 2 and 7 days (1 machine available).testing: To validate your scripts before submission (max. 15 minutes).- Use your disk space under
/scratchif your job performs a large amount of read/write operations or if you have large datasets.
Hardware Resources¶
| Model | Memory | CPU (per machine) |
|---|---|---|
| Dell PowerEdge R740 | 512 GB | 2 × Intel Xeon Gold 6258R (56 cores) |
Note: The
slurmmachine may be used to test your scripts (using thetestingpartition), but its resources are very limited.
Job Management¶
To run any Slurm command (squeue, sbatch, etc.), you must connect via SSH to the slurm machine.
Cluster Status¶
The sinfo command shows the status of the computing cluster and the maximum time limits of the partitions.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
optimum* up 2-00:00:00 1 mix optimum01
optimum* up 2-00:00:00 1 drain optimum02
optimum* up 2-00:00:00 9 drng optimum03
optimum* up 2-00:00:00 9 idle optimum[04-10]
optimumlong up 7-00:00:00 1 idle optimum11
idle: machine is completely availablemix: partially useddrain: under maintenancedrng: running tasks but not accepting new ones; will switch todrainwhen tasks finish
Task Submission¶
The sbatch command is used to submit a task to the compute grid. Parameters can be set in the submission script, passed on the command line, or a combination of both.
To run a task on the grid, you must have a script with Slurm parameters.
Here are the most commonly used parameters:
| Parameter | Description | Example |
|---|---|---|
--cpus-per-task |
Number of CPUs allocated. Must match the actual needs of your program. | --cpus-per-task=4 |
--mem |
Required memory. If exceeded, the task is canceled. | --mem=16G |
--time |
Maximum duration. Format: DD-HH:MM:SS. |
--time=2-12:00:00 |
--output |
Output file. Default: slurm-<ID>.out. |
--output=results.log |
--partition |
Target partition (optimum, optimumlong, testing). |
--partition=optimumlong |
--nodelist |
Specific nodes (e.g., optimum[01-03]). |
--nodelist=optimum01 |
--array |
Allows running identical tasks in parallel, each with a unique ID ($SLURM_ARRAY_TASK_ID). Avoids manually submitting each instance. |
--array=1-8 |
Best Practices¶
- Test your script with the
testingpartition before submitting tooptimum. - Avoid overestimating resources: Higher demands result in longer wait times.
- Limit CPU usage for software like CPLEX or Gurobi (use
--cpus-per-task). Adjust the number of threads in your program accordingly.
When your script is ready, simply use the command:
Here are three script examples to get you started:
Script for a single instance to solve.
If you have a series of instances, use the array option to submit a single task with sbatch, grouping all the instances you want to solve.
In this example, the total number of generated instances is 8: - n = 2 values - dataset = 2 values - country = 2 values
2 * 2 * 2 = 8.
This is why we use --array=1-8 in the header. It is crucial that these numbers match so that all instances are launched.
To run a MATLAB program on the compute grid, ensure your program is not in graphical mode and can obtain all required parameters without interaction or code modification.
Task Cancellation¶
The scancel program is used to cancel one or more tasks. Example:
Task Monitoring and Statistics¶
Current Task Status¶
The squeue program is used to view tasks in the system. By default, if no options are provided, you will see everyone's tasks. You can use the -u option to specify your username. You can use the SQUEUE_FORMAT variable to change the default display of the command.
For example, the Alliance uses:
Examples¶
| Command | Description |
|---|---|
squeue -u username |
Displays all tasks for the user |
squeue -u username -t RUNNING |
Displays the user's tasks that are running |
squeue -u username -t PENDING |
Displays the user's tasks that are pending |
Task States¶
| Code | State | Description |
|---|---|---|
CA |
Canceled | The task was canceled |
CD |
Complete | The task is completed |
F |
Failed | The task exited with a non-zero exit code |
PD |
Pending | The task is waiting |
R |
Running | The task is running |
Detailed Task Information¶
The sstat and sacct commands can be used to obtain more information about tasks. The sstat command can only be used for running tasks, while sacct can also be used for completed tasks.
The available fields can be displayed with the -e option and then used with the --format option.
To view older tasks, you can specify a start date:
sacct --starttime 2022-04-17 --format=Account,User,JobID,Start,End,AllocCPUS,Elapsed,AllocTRES%30,CPUTime,AveRSS,MaxRSS,MaxRSSTask,MaxRSSNode,NodeList,ExitCode,State%20
The SACCT_FORMAT environment variable allows you to define a format if you do not want to specify it each time you run the program. The following example is the format used by the Alliance:
export SACCT_FORMAT=Account,User,JobID,Start,End,AllocCPUS,Elapsed,AllocTRES%30,CPUTime,AveRSS,MaxRSS,MaxRSSTask,MaxRSSNode,NodeList,ExitCode,State%20
The seff program can be used to view task execution statistics, such as CPU and memory usage percentages.
$ ▶ seff 12345
Job ID: 12345
Cluster: cluster
User/Group:
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 6
CPU Utilized: 07:31:57
CPU Efficiency: 22.34% of 1-09:42:54 core-walltime
Job Wall-clock time: 05:37:09
Memory Utilized: 49.33 GB
Memory Efficiency: 49.33% of 100.00 GB
This command allows you to see the utilization of requested resources and whether they are optimal. In this case, the memory used is 50% less than the memory requested by the user. The same applies to CPU usage, which is 23% of the requested CPUs for this task.
The reportseff program allows you to view statistics for multiple tasks at once.

In the example above, some tasks are very short, which may indicate that the program did not function correctly. The red numbers draw attention to tasks where resources are poorly utilized.
You can find more information about the reportseff program on its software page.
The seff-array program allows you to analyze tasks from the same array. This allows you to see the distribution of tasks for different levels of resource usage in histogram form.
$ ▶ seff-array 12345
--------------------------------------------------------
Job Information
ID: 12345
Name: test_models.sh
Cluster: cluster
User/Group: -----
Requested CPUs: 6 cores on 1 node(s)
Requested Memory: 100G
Requested Time: 14:00:00
--------------------------------------------------------
Job Status
COMPLETED: 28
--------------------------------------------------------
--------------------------------------------------------
Finished Job Statistics
(excludes pending, running, and cancelled jobs)
Average CPU Efficiency 37.26%
Average Memory Usage 26.38G
Average Run-time 18682.25s
---------------------
CPU Efficiency (%)
---------------------
+0.00e+00 - +1.00e+01 [0]
+1.00e+01 - +2.00e+01 [7] ████████████████████████████████████████
+2.00e+01 - +3.00e+01 [5] ████████████████████████████▋
+3.00e+01 - +4.00e+01 [3] █████████████████▏
+4.00e+01 - +5.00e+01 [3] █████████████████▏
+5.00e+01 - +6.00e+01 [7] ████████████████████████████████████████
+6.00e+01 - +7.00e+01 [2] ███████████▍
+7.00e+01 - +8.00e+01 [1] █████▊
+8.00e+01 - +9.00e+01 [0]
+9.00e+01 - +1.00e+02 [0]
Memory Efficiency (%)
---------------------
+0.00e+00 - +1.00e+01 [3] █████████████▍
+1.00e+01 - +2.00e+01 [7] ███████████████████████████████▏
+2.00e+01 - +3.00e+01 [9] ████████████████████████████████████████
+3.00e+01 - +4.00e+01 [4] █████████████████▊
+4.00e+01 - +5.00e+01 [4] █████████████████▊
+5.00e+01 - +6.00e+01 [0]
+6.00e+01 - +7.00e+01 [0]
+7.00e+01 - +8.00e+01 [1] ████▌
+8.00e+01 - +9.00e+01 [0]
+9.00e+01 - +1.00e+02 [0]
Time Efficiency (%)
---------------------
+0.00e+00 - +1.00e+01 [ 2] ████▎
+1.00e+01 - +2.00e+01 [ 0]
+2.00e+01 - +3.00e+01 [ 0]
+3.00e+01 - +4.00e+01 [19] ████████████████████████████████████████
+4.00e+01 - +5.00e+01 [ 7] ██████████████▊
+5.00e+01 - +6.00e+01 [ 0]
+6.00e+01 - +7.00e+01 [ 0]
+7.00e+01 - +8.00e+01 [ 0]
+8.00e+01 - +9.00e+01 [ 0]
+9.00e+01 - +1.00e+02 [ 0]
--------------------------------------------------------
Temporary Workspace¶
A temporary workspace is available in the /scratch directory on each of the optimum machines. This space is also accessible from the slurm frontend machine, allowing you to copy your data before running tasks.
Note: There is no backup for files stored in this directory, so do not place any files there that you cannot afford to lose.
We ask that you use this space reasonably.