MosaicML Platform Documentation#
MosaicML platform is an AI training platform designed to tackle the challenges of training large models. Using our tools, developers can rapidly get started and scale up training to build the best model for your data, time, and cost.
Featuring:
🚀 Easily scale training across multiple nodes:
mcli run -f gpt_70b.yaml --gpus 256
🏎️ Engage efficiency modes ``-o1`` and ``-o2``, where we optimize your ML training. (beta)
mcli run -f bert_train.yaml -o1
☁ Direct jobs across multiple clouds with a single flag.
> mcli get clusters
NAME PROVIDER GPU_TYPES_AND_NUMS
onprem-oregon MosaicML a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128]
none (CPU only): [0]
aws-us-west-2 AWS a100_80gb: [1, 2, 4, 8, 16]
none (CPU only): [0]
aws-us-east-1 AWS a100_40gb: [1, 2, 4, 8, 16]
none (CPU only): [0]
oracle-sjc OCI a100_40gb: [1, 2, 4, 8, 16, 32, 64, 128, 256]
none (CPU only): [0]
mcli run -f gpu_30b.yaml --gpus 64 --cluster oracle-sjc
🐍 Fully featured python API. Build advanced workflows for your team.
from mcli.sdk import *
from time import sleep
import random
def monitor_run(run, config, max_retries: int = 3):
"""Monitor and resubmit failed runs to automatically resume."""
num_retries = 0
while get_status(run) != RunStatus.COMPLETED:
sleep(1.0)
if get_status(run) in (RunStatus.TERMINATING, RunStatus.FAILED):
num_retries += 1
if num_retries > max_retries:
raise RuntimeError('Exceeded maximum number of retries')
run = create_run(config)
print('Failure detected, resubmitting run.')
config = RunConfig.from_file('resnet50.yaml')
config.parameters['run_name'] = config.run_name + f'-{random.randint(100,999)}'
run = create_run(config)
monitor_run(run, config, max_retries=3)
We support integrations with all your favorite tooling: Git, Weights & Biases, CometML, and more!
About Us#
MosaicML’s mission is to make efficient training of ML models accessible. We continually productionize state-of-the-art research on efficient model training, and study the combinations of these methods in order to ensure that model training is ✨ as efficient as possible ✨. These findings are baked into our highly efficient model training stack, the MosaicML platform.
If you have questions, please feel free to reach out to us on Twitter, Email, or join our Slack channel!