Some researchers do data-science on bare metal. The term bare-metal relates to connecting to a server and working directly on its operating system and disks.
This is the fastest way to start working, but it introduces problems when the data science organization scales:
- More researchers mean that the machine resources need to be efficiently shared
- Researchers need to collaborate and share data, code, and results
To overcome that, people working on bare-metal typically write scripts to gather data, code and code dependencies. This soon becomes an overwhelming task.
Why Use Docker Images?
Docker images and 'containerization' in general provide a level of abstraction which, by large, frees developers and researchers from the mundane tasks of 'setting up an environment'. The image is an operating system by itself and thus the 'environment' is by large, a part of the image.
When a docker image is instantiated, it creates a container. A container is the running manifestation of a docker image.
Moving a Data Science Environment to Docker
- Training data.
- Machine Learning (ML) code and inputs.
- Libraries: Code dependencies that must be installed before the ML code can be run.
Training data is usually significantly large (from several Gigabytes to Petabytes) and is read-only in nature. Thus, training data is typically left outside of the docker image. Instead, the data is mounted onto the image when it is instantiated. Mounting a volume allows the code within the container to access the data as though it was within a directory on the local file system.
The best practice is to store the training data on a shared file system. This allows the data to be accessed uniformly on whichever machine the researcher is currently using, allowing the researcher to easily migrate between machines.
Organizations without a shared file system typically write scripts to copy data from machine to machine.
Machine Learning Code and Inputs
- The code resides in the image and is being periodically pulled from the repository. This practice requires building a new container image each time a change is introduced to the code.
- When a shared file system exists, the code can reside outside the image on a shared disk and mounted via a volume onto the container.
ML Lifecycle: Build and Train
Deep learning workloads can be divided into two generic types:
- Interactive "build" sessions. With these types of workloads, the data scientist opens an interactive session, via bash, Jupyter Notebook, remote PyCharm or similar and accesses GPU resources directly. Build workloads are typically meant for debug and development sessions.
- Unattended "training" sessions. Training is characterized by a machine learning run that has a start and a finish. With these types of workloads, the data scientist prepares a self-running workload and sends it for execution. During the execution, the data scientist can examine the results. A Train session can take from a few minutes to a couple of days. It can be interrupted in the middle and later restored (though the data scientist should save checkpoints for that purpose). Training workloads typically utilize large percentages of the GPU and at the end of the run automatically frees the resources.
docker run -it .... "the well known image " -v /where/my/code/resides bash
- Base image is nvidia-tensorflow
- Install popular software.
- (Optional) Run a script.
The script can be part of the image or can be provided as part of the command line to run the docker. It will typically include additional dependencies to install as well as a reference to the ML code to be run.
Best practice for running training workloads is to test the container image in a "build" session and then send it for execution as a training job. For further information on how to set up and parameterize a training workload via docker or runai see https://support.run.ai/hc/en-us/articles/360012065440-Converting-your-Workload-to-use-Unattended-Training-Execution