MosaicML Cloud Platform#

MosaicML Cloud Platform (MCP) is the machine learning platform to accelerate deep learning training. Using our tools, developers can quickly scale up training and easily invoke our optimizations to build the best model for your time and cost.

Featuring:

  • 🚀 Easily scale training across multiple nodes:

mcli run -f gpt_70b.yaml --gpus 256
  • 🏎️ Engage efficiency modes ``-o1`` and ``-o2``, where we optimize your ML training. (beta)

mcli run -f bert_train.yaml -o1
  • Direct jobs across multiple clouds with a single flag.

> mcli get clusters
NAME           NAMESPACE  GPU_TYPES_AND_NUMS
onprem-oregon  hanlin     a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128]
                          none (CPU only): [0]
aws-us-west-2  hanlin     a100_80gb: [1, 2, 4, 8, 16]
                          none (CPU only): [0]
aws-us-east-1  hanlin     a100_40gb: [1, 2, 4, 8, 16]
                          none (CPU only): [0]
oracle-sjc     hanlin     a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128, 256]
                          none (CPU only): [0]
mcli run -f gpu_30b.yaml --gpus 64 --cluster oracle-sjc
  • 💻 Submit local code with just a few lines. We auto-build the docker and submit the job.

integrations:
- integration_type: "local"
    directories:
    - folder1
    - folder2
    push_image: myrepo/example
  • 🐍 Fully featured python API. Build advanced workflows for your team.

from mcli.sdk import *
from time import sleep
import random

def monitor_run(run, config, max_retries: int = 3):
"""Monitor and resubmit failed runs to automatically resume."""
    num_retries = 0
    while get_status(run) != RunStatus.COMPLETED:
        sleep(1.0)
        if get_status(run) in (RunStatus.TERMINATING, RunStatus.FAILED):
            num_retries += 1
            if num_retries > max_retries:
                raise RuntimeError('Exceeded maximum number of retries')
            run = create_run(config)
            print('Failure detected, resubmitting run.')


config = RunConfig.from_file('resnet50.yaml')
config.parameters['run_name'] = config.run_name + f'-{random.randint(100,999)}'

run = create_run(config)

monitor_run(run, config, max_retries=3)

We support integrations with all your favorite tooling: Git, Weights & Biases, CometML, and more!

About Us#

MosaicML’s mission is to make efficient training of ML models accessible. We continually productionize state-of-the-art research on efficient model training, and study the combinations of these methods in order to ensure that model training is ✨ as efficient as possible ✨. These findings are baked into our highly efficient model training stack, the MosaicML Cloud Platform.

If you have questions, please feel free to reach out to us on Twitter, Email, or join our Slack channel!