To help users build a mental model of how the MosaicML platform operates, let’s walk through the steps that happen with the below
yaml file that executes a training run:
name: hello-world gpu_type: A100_80GB gpu_num: 1 image: python env_variables: - key: NAME value: MOSAICML command: | sleep 2 echo Hello $NAME!
The yaml specifies a few key components:
Resource requests (number of GPUs, gpu types, etc.)
The run environment (docker image, environment variables, github repos)
The command (e.g. the training run)
After submitting, MosaicML platform will first attempt to request the resources from the cluster.
If none of the requested resources are available, the run will be queued (and status can be viewed with
mcli get runs).
Run environment creation is highly configurable with custom images, integrations to control repo cloning, secrets and environment variable injection, and others. (See: Integrations, Secrets, )
Lastly, the command will be run.
Latest Docker Image
We always pull the latest updated docker image. However, users are encouraged to not use the
:latest tag as that makes it harder to debug what docker image was deployed.
Instead for transparent reproducibility, create and use meaningful tag names (e.g.
Importantly, the command will run from the working directory that was set in your
Dockerfile. So if you are looking for specific files in your docker, make sure the command’s paths are all correct.
For example, if your file structure is:
and your docker’s working directory is
/code/, your command will need to be:
python cool_model/my_train_script.py -f configs/train_config.yaml
Figuring out your docker image’s working directory
The working directory for your docker can be retrieved with:
docker image inspect <image name> | grep 'WorkingDir'
Runs can encounter errors at any stage of the deployment process. If a run gets to the run environment creation stage, usually inspecting the console logs via
mcli logs <run_name> can help diagnose the issue.
Common errors we’ve seen are:
fatal: repository [email protected]/.../... not found.
Either the github repository doesn’t exist (typo), you don’t have access, or the github SSH secrets are missing in
mcli create secret git-ssh to add your github ssh key.
For more details on registring your github key with Github, see Github SSH keys and our docs on Secrets.
Run is taking awhile to start, returning you to the command line.
Not to worry, this just means that we stopped streaming the launch logs to you so you can get other work done! (It hit the 3 minute timer)
Sometimes a run can be queued waiting for resources to become available, or the Docker image is sufficiently large that download takes awhile.
Monitoring a Queued Run
mcli get runs to see when a run has been taken off the queue and started.
mcli logs after a run has started to trail the logs.
If runs are still stalled for awhile, consider reducing your resource requests or checking that the docker setup is correct.
File not Found
This is likely due to your command not having the correct paths.
The command is called from the
Dockerfile’s original working directory. Launch the docker image locally to see the working directory.