Run Schema#

Run submissions to the MosaicML platform can be configured through a YAML file or using our Python API’s RunConfig class.

The fields are identical across both methods:

Field

Type

name

required

str

cluster

optional

str

gpu_num

required

int

gpu_type

optional

str

scheduling

optional

SchedulingConfig

image

required

str

command

required

str

integrations

optional

List[Dict]

env_variables

optional

List[Dict]

parameters

optional

Dict[str, Any]

Here’s an example run configuration:

name: hello-composer
gpu_num: 0
scheduling:
  priority: low
image: mosaicml/pytorch:latest
command: 'echo $MESSAGE'
integrations:
- integration_type: git_repo
  git_repo: mosaicml/benchmarks
  git_branch: main
env_variables:
    - name: welcome_message
      key: MESSAGE
      value: hello composer!
from mcli.sdk import RunConfig
config = RunConfig(
    name='hello-composer',
    gpu_num=0,
    scheduling={'priority': 'low'},
    image='mosaicml/pytorch:latest',
    command='echo $MESSAGE',
    integrations=[
        {
         'integration_type': 'git_repo',
         'git_repo': 'mosaicml/composer',
         'git_branch': 'main'
        }
    ],
    env_variables=[
        {'MESSAGE': 'hello composer!'}
    ],
)

Field Types#

Run Name#

Used to identify your run. For each run, a unique identifier is automatically appended to the provided run name. After submitting a run, the finalized unique name is displayed in the terminal, and can also be viewed with mcli get runs.

$ mcli run -f my_run.yaml --name run-test

✔  Run run-test-zwml submitted.

For the python API, the run name can be retrieved from the returned Run object from create_run.

Resource Fields#

The cluster, gpu_type and gpu_num fields are used to request compute resources for your run.

When requesting compute resources, first specify a cluster and then within the cluster specify a valid gpu_type for that cluster and finally a valid gpu_num.

To see valid combinations of (cluster, gpu_type, gpu_num) available to you:

> mcli get clusters
    NAME           NAMESPACE  GPU_TYPES_AND_NUMS
    onprem-oregon  hanlin     a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128]
                              none (CPU only): [0]
    aws-us-west-2  hanlin     a100_80gb: [1, 2, 4, 8, 16]
                              none (CPU only): [0]
    aws-us-east-1  hanlin     a100_40gb: [1, 2, 4, 8, 16]
                              none (CPU only): [0]
    oracle-sjc     hanlin     a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128, 256]
                              none (CPU only): [0]

Optional Resource Requests

The cluster field is optional if you only have one available cluster. Similarly, the gpu_type is also optional if that is the only available choice.

Launching multi-node runs is simple. Just request a number of GPUs that span multiple machines (e.g. 16).

Scheduling#

The scheduling field governs how the MosaicML platform’s scheduler will manage your run. It is a simple dictionary, currently containing one key: priority.

Field

Type

priority

optional

str

priority: Runs in the platform’s scheduling queue are first sorted by their priority, then by their creation time. The priority field can be one of 3 values: low, default and high. When omitted, the default value is used. Best practices usually dictate that large numbers of more experimental runs (think exploratory hyperparameter sweeps) should usually be run at low priority, whereas important “hero” runs should be run at high priority.

Image#

Runs are executed within Docker containers defined by a Docker image.

Images on DockerHub can be configured as <organization>/<image name>. For private Dockerhub repositories, add a docker secret with:

mcli add secrets docker

For more details, see the Docker Secret Page.

Using Alternative Docker Registries

While we default to DockerHub, custom registries are supported, see the docker documentation.

Command#

command to execute when the run starts, typically to launch your training jobs.

For example, the following command:

command: |
  echo Hello World!

will result in a run that prints “Hello World” to the console.

If you are training models with Composer, then the command field is where you will write your Composer launch command.

Integrations#

We support many Integrations to customize aspects of both the run setup and environment.

Integrations are specified as a list in the YAML. Each item in the list must specify a valid integration_type along with the relevant fields for the requested integration.

Some examples of integrations include automatically cloning a Github repository, installing python packages, and setting up logging to a Weights and Biases project are shown below:

integrations:
  - integration_type: git_repo
    git_repo: org/my_repo
    git_branch: my-work-branch
  - integration_type: pip_packages
    packages:
      - numpy>=1.22.1
      - requests
  - integration_type: wandb
    project: my_weight_and_biases_project
    entity: mosaicml

You can read more about integrations on the Integrations Page.

Some integrations may require adding secrets. For example, pulling from a private github repository would require the git-ssh secret to be configured. See the Secrets Page.

Environment Variables#

Environment variables can also be injected into each run at runtime through the env_variables field.

Each environment variable in the list must have a key and value configured.

  • key: name used to access the value of the environment variable

  • value: value of the environment variable.

For example, the below YAML will print “Hello MOSAICML my name is MOSAICML_TWO!”:

name: hello-world
gpu_type: none
gpu_num: 0
cluster: <YOUR CLUSTER>
image: python
env_variables:
  - key: NAME
    value: MOSAICML
  - key: SECOND_NAME
    value: MOSAICML_TWO
command: |
  sleep 2
  echo Hello $NAME my name is $SECOND_NAME!

The command accesses the value of the environment variable by the key field (in this case $NAME and $SECOND_NAME)

Parameters#

The provided parameters are mounted as a YAML file of your run at /mnt/config/parameters.yaml for your code to access. Parameters are a popular way to easily configure your training run.