Run Schema#
Run submissions to the MosaicML platform can be configured through a YAML file or using our Python API’s RunConfig
class.
The fields are identical across both methods:
Field |
Type |
|
---|---|---|
|
required |
|
|
optional |
|
|
required |
|
|
optional |
|
|
optional |
SchedulingConfig |
|
required |
|
|
required |
|
|
optional |
List[Dict] |
|
optional |
List[Dict] |
|
optional |
Dict[str, Any] |
Here’s an example run configuration:
name: hello-composer
gpu_num: 0
scheduling:
priority: low
image: mosaicml/pytorch:latest
command: 'echo $MESSAGE'
integrations:
- integration_type: git_repo
git_repo: mosaicml/benchmarks
git_branch: main
env_variables:
- name: welcome_message
key: MESSAGE
value: hello composer!
from mcli.sdk import RunConfig
config = RunConfig(
name='hello-composer',
gpu_num=0,
scheduling={'priority': 'low'},
image='mosaicml/pytorch:latest',
command='echo $MESSAGE',
integrations=[
{
'integration_type': 'git_repo',
'git_repo': 'mosaicml/composer',
'git_branch': 'main'
}
],
env_variables=[
{'MESSAGE': 'hello composer!'}
],
)
Field Types#
Run Name#
Used to identify your run. For each run, a unique identifier is automatically appended to the provided run name. After submitting a run, the finalized unique name is displayed in the terminal, and can also be viewed with mcli get runs
.
$ mcli run -f my_run.yaml --name run-test
✔ Run run-test-zwml submitted.
For the python API, the run name can be retrieved from the returned Run
object from create_run
.
Resource Fields#
The cluster
, gpu_type
and gpu_num
fields are used to request compute resources for your run.
When requesting compute resources, first specify a cluster
and then within the cluster specify a valid gpu_type
for that cluster and finally a valid gpu_num
.
To see valid combinations of (cluster
, gpu_type
, gpu_num
) available to you:
> mcli get clusters
NAME NAMESPACE GPU_TYPES_AND_NUMS
onprem-oregon hanlin a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128]
none (CPU only): [0]
aws-us-west-2 hanlin a100_80gb: [1, 2, 4, 8, 16]
none (CPU only): [0]
aws-us-east-1 hanlin a100_40gb: [1, 2, 4, 8, 16]
none (CPU only): [0]
oracle-sjc hanlin a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128, 256]
none (CPU only): [0]
Optional Resource Requests
The cluster
field is optional if you only have one available cluster. Similarly, the gpu_type
is also optional if that is the only available choice.
Launching multi-node runs is simple. Just request a number of GPUs that span multiple machines (e.g. 16).
Scheduling#
The scheduling
field governs how the MosaicML platform’s scheduler will manage your run.
It is a simple dictionary, currently containing one key: priority
.
Field |
Type |
|
---|---|---|
|
optional |
|
priority: Runs in the platform’s scheduling queue are first sorted by their priority, then by their creation time.
The priority
field can be one of 3 values: low
, default
and high
.
When omitted, the default
value is used.
Best practices usually dictate that large numbers of more experimental runs (think exploratory hyperparameter sweeps)
should usually be run at low
priority, whereas important “hero” runs should be run at high
priority.
Image#
Runs are executed within Docker containers defined by a Docker image.
Images on DockerHub can be configured as <organization>/<image name>
. For private Dockerhub repositories, add a docker secret with:
mcli add secrets docker
For more details, see the Docker Secret Page.
Using Alternative Docker Registries
While we default to DockerHub, custom registries are supported, see the docker documentation.
Command#
command
to execute when the run starts, typically to launch your training jobs.
For example, the following command:
command: |
echo Hello World!
will result in a run that prints “Hello World” to the console.
If you are training models with Composer, then the command
field is where you will write
your Composer launch command.
Integrations#
We support many Integrations to customize aspects of both the run setup and environment.
Integrations are specified as a list in the YAML. Each item in the list must specify a valid integration_type
along with the relevant fields for the requested integration.
Some examples of integrations include automatically cloning a Github repository, installing python packages, and setting up logging to a Weights and Biases project are shown below:
integrations:
- integration_type: git_repo
git_repo: org/my_repo
git_branch: my-work-branch
- integration_type: pip_packages
packages:
- numpy>=1.22.1
- requests
- integration_type: wandb
project: my_weight_and_biases_project
entity: mosaicml
You can read more about integrations on the Integrations Page.
Some integrations may require adding secrets. For example, pulling from a private github repository would require the git-ssh
secret to be configured.
See the Secrets Page.
Environment Variables#
Environment variables can also be injected into each run at runtime through the env_variables
field.
Each environment variable in the list must have a key
and value
configured.
key
: name used to access the value of the environment variablevalue
: value of the environment variable.
For example, the below YAML will print “Hello MOSAICML my name is MOSAICML_TWO!”:
name: hello-world
gpu_type: none
gpu_num: 0
cluster: <YOUR CLUSTER>
image: python
env_variables:
- key: NAME
value: MOSAICML
- key: SECOND_NAME
value: MOSAICML_TWO
command: |
sleep 2
echo Hello $NAME my name is $SECOND_NAME!
The command
accesses the value of the environment variable by the key
field (in this case $NAME
and $SECOND_NAME
)
Parameters#
The provided parameters are mounted as a YAML file of your run at /mnt/config/parameters.yaml
for your code to access. Parameters are a popular way to easily configure your training run.