At the heart of the Run:AI solution is the Run:AI scheduler. The scheduler is the gate keeper of your organization's hardware resources. It makes decisions on resource allocations according to pre-created rules.
The purpose of this document is to describe the Run:AI scheduler and explain how resource management works.
Run:AI differentiates between two types of deep learning workloads:
- Interactive build workloads. With these type of workloads, the data scientist opens an interactive session, via bash, jupyter notebook, remote PyCharm or similar and accesses GPU resources directly. Build workloads typically do not tax the GPU for a long duration. There are also typically real users behind an interactive workload that need an immediate scheduling response
- Unattended (or "non-interactive") training workloads. Training is characterized by a deep learning run that has a start and a finish. With these type of workloads, the data scientist prepares a self running workload and sends it for execution. Training workloads typically utilize large percentages of the GPU. During the execution, the researcher can examine the results. A Train session can take anything from a few minutes to a couple of weeks. It can be interrupted in the midst and later restored.
It follows that a good practice for the researcher is to save checkpoints and allow the code to restore from last checkpoint.
Projects are quota entities that associate a project name with a deserved GPU quota as well as other preferences.
A researcher submitting a workload must associate a project with any workload request. The Run:AI scheduler will then compare the request against the current allocations and the project's deserved quota and determine whether the workload can be allocated with resources or whether it should remain in a pending state.
For further information on projects and how to configure them, see: https://support.run.ai/hc/en-us/articles/360011591300-Working-with-Project-Quotas
Basic Scheduling Concepts
Interactive vs. Unattended
The Researcher uses the --interactive flag to specify whether the workload is an unattended "train" workload or an interactive "build"workload.
- Interactive workloads will get precedence over unattended workloads.
- Unattended workloads can be preempted when the scheduler determines a more urgent need for resources. Interactive workloads are never preempted
Guaranteed Quota and Over-Quota
Every new workload is associated with a Project. The project contains a deserved GPU quota. During scheduling:
- If the newly required resources, together with current used resources, end up within the project's quota, then the workload is ready to be scheduled as part of the guaranteed quota.
- If the newly required resources together with current used resources end up above the project's quota, the workload will only be scheduled if there are 'spare' GPU resources. There are nuances in this flow which are meant to ensure that a project does not end up with over-quota made fully of interactive workloads. For additional details see below
Allocation & Preemption
The Run:AI scheduler wakes up periodically to perform allocation tasks on pending workloads:
- The scheduler looks at each Project separately and selects the most 'deprived' Project.
- For this deprived project it chooses a single workload to work on:
- Interactive workloads are tried first, but only up to the project's guaranteed quota. If such a workload exists, it is scheduled even if it means preempting a running unattended workload in this Project.
- Else, it looks for an unattended workload and schedules it on guaranteed quota or over-quota.
- The scheduler then recalculates the next 'deprived' project and continues with the same flow until it finishes attempting to schedule all workloads
During the above process, there may be a pending workload whose project is below the deserved capacity. Still, it cannot be allocated due to lack of GPU resources. The scheduler will then look for alternative allocations at the expense of another project which has gone over-quota while preserving fairness between projects.
- project A has been allocated with a quota of 3 GPUs, and
- project B has been allocated with a quota of 1 GPU.
Bin-packing & Consolidation
Part of an efficient scheduler is the ability to eliminate defragmentation:
- The first step in avoiding defragmentation is bin packing: try and fill nodes (machines) up before allocating workloads to new machines.
- The next step is to consolidate jobs on demand. If a workload cannot be allocated due to defragmentation, the scheduler will try and move unattended workloads from node to node in order to get the required amount of GPUs to schedule the pending workload.
Run:AI Elasticity is explained here. In essence it allows unattended workloads to shrink or expand based on the cluster's availability.
- Shrinking happens when the scheduler is unable to schedule an elastic unattended workload and no amount of consolidation helps. The scheduler then divides the requested GPUs by half again and again and tries to reschedule.
- Shrink jobs will expand when enough GPUs will be available.
- Expanding happens when the scheduler finds spare GPU resources, enough to double the amount of GPUs for an elastic workload.