Run Lifecycle#

Runs are launched by specifying a YAML file with the command:

mcli run -f <yaml>

Certain fields in the YAML (cluster, gpu-type, gpus, image) can be overridden via command line. For example:

mcli run -f <yaml> --gpus 8

When a run is launched, it goes through multiple phases:

  • Run Construction

  • Queued

  • Starting

  • Running

  • Completed

Run Construction#

The first step in launching a run is the construction phase.

In this phase, run parameters will be pulled from the YAML and command line arguments. These enter a validation phase to check for completeness and validity. For example, the GPU type and count, if provided, must be valid for the available clusters but may be omitted if only one value makes sense. Whenever possible, the platform will pick reasonable defaults for optional and unspecified fields.

If the Run Construction phase fails, MCLI will produce an error as to what the misconfiguration is.

Queued#

After the run is fully constructed it will be placed in the run queue to be picked up by the specified cluster. While in queue, runs are sorted according to their priority. A “high” priority run requesting the same resources as a “low” priority run will always be picked up by the cluster first. That said, a “low” priority but also lower resource run may be picked up before a larger “high” priority run.

Note

A run can be queued for many reasons, but the most likely is that all of the clusters resources are fully utilized.

After the cluster has received the run, it will executed until it completes, fails or is stopped.

In order to check the current status of your runs, use:

mcli get run
NAME                 GPU_NUM   CREATED_TIME          START_TIME  END_TIME   STATUS
hello-world-1337     0         2022-06-12 06:53 PM   -           -          Queued

Cancelling a Run

At this point, the run can be stopped with:

mcli stop run <run_name>

Starting#

> mcli get run
NAME                 GPU_NUM   CREATED_TIME          START_TIME  END_TIME   STATUS
hello-world-1337     0         2022-06-12 06:53 PM   -           -          Starting

After a run has been scheduled, it goes into the starting phase. In this phase, the scheduler has assigned the run a corresponding node or nodes and has started to execute the run on them.

The first thing that happens upon run start is the specified docker image is pulled to the node. If the docker image is unable to be pulled, mcli will generate a FailedPull error code. To debug unpullable images, make sure you have created a valid docker secret with mcli create secret docker. The most likely cause is that the run did not have access to your docker registry. To fix, create a Docker Registry Secret and ensure that it has the correct permissions.

Running#

> mcli get run
NAME                 GPU_NUM   CREATED_TIME          START_TIME            END_TIME   STATUS
hello-world-1337     0         2022-06-12 06:53 PM   2022-06-12 06:54 PM   -          Running

Soon after the start sequence is initiated, the run goes into the running phase. At this point in time, logs can be streamed from the run in real-time. By default, mcli run will automatically tail the generated logs.

When starting the run, the first things that are executed are the Integrations. Integrations are executed in order to produce the required run environment.

After Integrations are run, finally, the run command is executed and the training run should start. If at any point in time the run crashes, the logs can be obtained with mcli logs.

Note

At this point the mcli logs command should be able to follow generated run logs

mcli logs <run_name>

Completed#

> mcli get run
NAME                 GPU_NUM   CREATED_TIME          START_TIME            END_TIME             STATUS
hello-world-1337     0         2022-06-12 06:53 PM   2022-06-12 06:54 PM   2022-06-12 06:59 PM  Completed

After the run finishes, it enters the completed state and frees up the node that it was executing on. At this point, the run metrics should have been persisted in a tracker for reference as to how the runs results are. For persisted artifacts, the run itself is responsible for saving model outputs, weights, or anything else in an independent data store.

For more detailed information about a run, you can use mcli describe run <run-name> to see run metadata, run lifecycle states and duration, and the submitted run yaml.


> mcli describe run <run_name>
Run Metadata
Run Name     hello-world-s1ZGyv
Cluster      r1z1
Image        bash
GPU Type     None
GPU Num      0

Run Lifecycle
STATUS     START_TIME           END_TIME             DURATION
PENDING    2023-03-27 11:42 AM  2023-03-27 11:42 AM  4s
RUNNING    2023-03-27 11:42 AM  2023-03-27 11:42 AM  3s
COMPLETED  2023-03-27 11:42 AM  --

Submitted YAML
cluster: r1z1
command: |
  echo Start
  sleep 2
  echo 'Hello World!!'
gpu_num: 0
image: bash
name: hello-world

You can also launch a run using a config from a previous or existing run:

mcli run --clone <existing-run-name> --gpus 8

Run Temporary History

After a run is completed, it will only show up in mcli get run for the next 14 days. Please ensure to persist run history in a metrics datastore such as Weights and Biases or Cloud Buckets.