Cluster Introduction#

To help users build a mental model of how the MosaicML platform operates, let’s walk through the steps that happen with the below yaml file that executes a training run:

name: hello-world
gpu_type: A100_80GB
gpu_num: 1
image: python
env_variables:
  - key: NAME
    value: MOSAICML
command: |
  sleep 2
  echo Hello $NAME!

The yaml specifies a few key components:

  • Resource requests (number of GPUs, gpu types, etc.)

  • The run environment (docker image, environment variables, github repos)

  • The command (e.g. the training run)

After submitting, MosaicML platform will first attempt to request the resources from the cluster.

If none of the requested resources are available, the run will be queued (and status can be viewed with mcli get runs).

Run environment creation is highly configurable with custom images, integrations to control repo cloning, secrets and environment variable injection, and others. (See: Integrations, Secrets, )

Lastly, the command will be run.

Latest Docker Image

We always pull the latest updated docker image. However, users are encouraged to not use the :latest tag as that makes it harder to debug what docker image was deployed.

Instead for transparent reproducibility, create and use meaningful tag names (e.g. v1.7.0)

Working directory#

Importantly, the command will run from the working directory that was set in your Dockerfile. So if you are looking for specific files in your docker, make sure the command’s paths are all correct.

For example, if your file structure is:

/code/cool_model/my_train_script.py
/code/configs/train_config.yaml

and your docker’s working directory is /code/, your command will need to be:

python cool_model/my_train_script.py -f configs/train_config.yaml

Figuring out your docker image’s working directory

The working directory for your docker can be retrieved with:

docker image inspect <image name> | grep 'WorkingDir'

Debugging#

Runs can encounter errors at any stage of the deployment process. If a run gets to the run environment creation stage, usually inspecting the console logs via mcli logs <run_name> can help diagnose the issue.

Common errors we’ve seen are:


fatal: repository [email protected]/.../... not found.

Either the github repository doesn’t exist (typo), you don’t have access, or the github SSH secrets are missing in mcli. Use mcli create secret git-ssh to add your github ssh key.

For more details on registring your github key with Github, see Github SSH keys and our docs on Secrets.


Run is taking awhile to start, returning you to the command line.

Not to worry, this just means that we stopped streaming the launch logs to you so you can get other work done! (It hit the 3 minute timer)

Sometimes a run can be queued waiting for resources to become available, or the Docker image is sufficiently large that download takes awhile.

Monitoring a Queued Run

Use mcli get runs to see when a run has been taken off the queue and started.

Use mcli logs after a run has started to trail the logs.

If runs are still stalled for awhile, consider reducing your resource requests or checking that the docker setup is correct.


File not Found

This is likely due to your command not having the correct paths.

The command is called from the Dockerfile’s original working directory. Launch the docker image locally to see the working directory.