Tutorial. Working with a resource manager on an HPC infrastructure

Tutorial. Working with a resource manager on an HPC infrastructure

Registration for this event.

Notice: Fundamental Tutorials registration is closed now.

Synopsis:

Instructors: Eugenio Guerra - Esteban Osorio

Language: Spanish

National Laboratory for High Performance Computing (NLHPC)

Attendance: 80 people

Requirements:
To be able to follow the course and the commands that the teacher will execute, it is recommended to have knowledge of Linux.

In this workshop we will show how to use the Slurm resource management system present in the vast majority of top500 supercomputers. The Leftraru-Guacolda cluster of the National High Performance Computing Laboratory (NLHPC) will be used.

The tutorial will be carried out in 2 sessions of 4 hours each.

The contents of session number 1 are as follows:


Module I

 

  • NLHPC infrastructure
  • Presentation of NLHPC infrastructure
  • Accessing the cluster and submitting tasks
  • NLHPC Login Nodes
  • Basic use of Slurm
  • Using the srun command and its parameters
  • Using the sbatch command
  • Basic script
  • Queue tasks
  • Monitoring tasks
  • Cancel tasks
  • Resource underutilisation
  • Other basic tasks
  • Available software
  • Listing available software
  • Using available software
  • Computational efficiency
  • Others


Module II

  • Parallel programming (basic notions)
  • Shared Memory Model (OpenMP)
  • Message passing model (MPI)
  • Running simulations
  • Sequential jobs
  • OpenMP jobs
  • MPI jobs
  • Multiple sequential jobs (job array).
  • Jobs that use GPUs
  • Job dependencies
  • Task scheduling using crontab
  • Checkpoint/Restart
  • Simulation monitoring
  • Monitoring simulations using http
  • Monitoring simulations using Ganglia
  • Utilization graphs in notification mail
  • Installing and compiling applications
  • Compilers and flags used
  • Compiling programs from source code
  • Installing modules in Python
  • Installing modules in R
  • Frequent problems
  • Cancellation due to excess memory
  • Cancellation due to CPU underutilization
  • Cancellation due to underutilization of Memory
  • Resource overuse