Run Lifecycle#
Runs are launched by specifying a YAML file with the command:
mcli run -f <yaml>
Certain fields in the YAML (cluster
, gpu-type
, gpus
, image
) can be overridden via command line.
For example:
mcli run -f <yaml> --gpus 8
When a run is launched, it goes through multiple phases:
Run Construction
Queued
Starting
Running
Completed
Run Construction#
The first step in launching a run is the construction phase.
In this phase, run parameters will be pulled from the YAML and command line arguments. These enter a validation phase to check for completeness and validity. For example, the GPU type and count, if provided, must be valid for the available clusters but may be omitted if only one value makes sense. Whenever possible, the platform will pick reasonable defaults for optional and unspecified fields.
If the Run Construction phase fails, MCLI will produce an error as to what the misconfiguration is.
Queued#
After the run is fully constructed it will be placed in the run queue to be picked up by the specified cluster. While in queue, runs are sorted according to their priority. A “high” priority run requesting the same resources as a “low” priority run will always be picked up by the cluster first. That said, a “low” priority but also lower resource run may be picked up before a larger “high” priority run.
Note
A run can be queued for many reasons, but the most likely is that all of the clusters resources are fully utilized.
After the cluster has received the run, it will executed until it completes, fails or is stopped.
In order to check the current status of your runs, use:
mcli get run
NAME GPU_NUM CREATED_TIME START_TIME END_TIME STATUS
hello-world-1337 0 2022-06-12 06:53 PM - - Queued
Cancelling a Run
At this point, the run can be stopped with:
mcli stop run <run_name>
Starting#
> mcli get run
NAME GPU_NUM CREATED_TIME START_TIME END_TIME STATUS
hello-world-1337 0 2022-06-12 06:53 PM - - Starting
After a run has been scheduled, it goes into the starting phase. In this phase, the scheduler has assigned the run a corresponding node or nodes and has started to execute the run on them.
The first thing that happens upon run start is the specified docker image is pulled to the node.
If the docker image is unable to be pulled, mcli will generate a FailedPull error code.
To debug unpullable images, make sure you have created a valid docker secret with mcli create secret docker
. The most likely cause is that the run did not have access to your docker registry.
To fix, create a Docker Registry Secret and ensure that it has the correct permissions.
Running#
> mcli get run
NAME GPU_NUM CREATED_TIME START_TIME END_TIME STATUS
hello-world-1337 0 2022-06-12 06:53 PM 2022-06-12 06:54 PM - Running
Soon after the start sequence is initiated, the run goes into the running
phase.
At this point in time, logs can be streamed from the run in real-time.
By default, mcli run
will automatically tail the generated logs.
When starting the run, the first things that are executed are the Integrations. Integrations are executed in order to produce the required run environment.
After Integrations are run, finally, the run command is executed and the training run should start.
If at any point in time the run crashes, the logs can be obtained with mcli logs
.
Note
At this point the mcli logs
command should be able to follow generated run logs
mcli logs <run_name>
Completed#
> mcli get run
NAME GPU_NUM CREATED_TIME START_TIME END_TIME STATUS
hello-world-1337 0 2022-06-12 06:53 PM 2022-06-12 06:54 PM 2022-06-12 06:59 PM Completed
After the run finishes, it enters the completed state and frees up the node that it was executing on. At this point, the run metrics should have been persisted in a tracker for reference as to how the runs results are. For persisted artifacts, the run itself is responsible for saving model outputs, weights, or anything else in an independent data store.
For more detailed information about a run, you can use mcli describe run <run-name>
to see run metadata, run lifecycle states and duration, and the submitted run yaml.
> mcli describe run <run_name>
Run Metadata
Run Name hello-world-s1ZGyv
Cluster r1z1
Image bash
GPU Type None
GPU Num 0
Run Lifecycle
STATUS START_TIME END_TIME DURATION
PENDING 2023-03-27 11:42 AM 2023-03-27 11:42 AM 4s
RUNNING 2023-03-27 11:42 AM 2023-03-27 11:42 AM 3s
COMPLETED 2023-03-27 11:42 AM --
Submitted YAML
cluster: r1z1
command: |
echo Start
sleep 2
echo 'Hello World!!'
gpu_num: 0
image: bash
name: hello-world
You can also launch a run using a config from a previous or existing run:
mcli run --clone <existing-run-name> --gpus 8
Run Temporary History
After a run is completed, it will only show up in mcli get run
for the next 14 days.
Please ensure to persist run history in a metrics datastore such as Weights and Biases or Cloud Buckets.