Deep learning workloads can be divided into two generic types:
- Interactive "build" sessions. With these types of workloads, the data scientist opens an interactive session, via bash, Jupyter notebook, remote PyCharm or similar and accesses GPU resources directly.
- Unattended "training" sessions. With these types of workloads, the data scientist prepares a self-running workload and sends it for execution. During the execution, the customer can examine the results.
With this Walkthrough you will learn how to:
- Use the Run:AI command-line interface (CLI) to start a deep learning training workload
- View training status and resource consumption using the Run:AI user interface and the Run:AI CLI
- View training logs
- Stop the training
To complete this walkthrough you must have:
- Run:AI software is installed on your Kubernetes cluster. See: https://support.run.ai/hc/en-us/articles/360010280179-Installing-Run-AI-on-an-on-premise-Kubernetes-Cluster
- Run:AI CLI installed on your machine. See: https://support.run.ai/hc/en-us/articles/360010706120-Installing-the-Run-AI-Command-Line-Interface
Step by Step Walkthrough
- Open the Run:AI user interface at https://app.run.ai
- Go to "Projects"
- Add a project named "team-a"
- Allocate 2 GPUs to the project
- At the command line run:
runai project set team-a
runai submit hyper1 -i gcr.io/run-ai-demo/quickstart -g 1
This would start an unattended training job for team-a with an allocation of a single GPU. The job is based on a sample docker image gcr.io/run-ai-lab/quickstart. We named the job hyper1
- Follow up on the job's progress by running:
Typical statuses you may see:
- ContainerCreating - The docker container is being downloaded from the cloud repository
- Pending - the job is waiting to be scheduled
- Running - the job is running
- Succeeded - the job has ended
To get additional status on your job run:
runai get hyper1
Run the following:
runai logs hyper1
You should see a log of a running deep learning session:
View status on the Run:AI User Interface
- Go to https://app.run.ai
- Under Dashboards | Overview you should see:
Under "Jobs" you can view the new Workload:
The image we used for training includes the Run:AI Training library. Among other features, this library allows the reporting of metrics from within the deep learning job. Metrics such as progress, accuracy, loss, and epoch and step numbers.
- Progress can be seen in the status column above.
- To see other metrics, press the settings wheel on the top right and select additional deep learning metrics from the list
Under Nodes you can see node utilization:
Run the following:
runai delete hyper1
This would stop the training workload. You can verify this by running runai list again.
- Follow the Walkthrough: Launch Interactive Workloads https://support.run.ai/hc/en-us/articles/360010894959-Walkthrough-Start-and-Use-Interactive-Build-Workloads-
- Use your own containers to run an unattended training workload