Configure a deployment#

Creating a deployment is simple and only requires configuring a few fields. This page describes the easiest way to get a deployment running, but if you’d like additional customizability, take a look at the custom inference schema page.

Deployment fields#

Field		Type	Default
`name`	required	`str`	N/A
`compute`	required	`ComputeConfig`	N/A
`replicas`	optional	`int`	1
`image`	optional	`str`	`mosaicml/inference`
`command`	optional	`str`	`""`
`default_model`	optional	`DefaultModelConfig`	`None`
`model`	optional	`ModelConfig`	`None`
`batching`	optional	`BatchingConfig`	`{max_batch_size: 1, max_timeout_ms: 1000}`
`integrations`	optional	`List[Dict]`	`[]`
`env_variables`	optional	`Dict`	`{}`
`metadata`	optional	`Dict[str, Any]`	`{}`

Examples#

MPT-7B-Instruct with custom checkpoint from s3:

name: mpt-7b-instruct
compute:
  gpus: 1
  instance: oci.vm.gpu.a10.1
default_model:
  model_type: mpt-7b-instruct
  checkpoint_path:
    s3_path: s3://my-s3-path/checkpoint-dir

MPT-30B-Instruct using default checkpoint

name: mpt-30b-chat
compute:
  gpus: 4
  instance: oci.vm.gpu.a10.4
default_model:
  model_type: mpt-30b-chat

Default model fields#

The easiest way to specify your model definition is through the default_model field. However, if you’d like more customizability, you can configure the deployment with your own custom code through the model field detailed below.

Supported model types#

The following are the valid model types that can be used for the model_type field:

Model Type	Description
`mpt-7b`	A 7B parameter decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
`mpt-7b-instruct`	A model for short-form instruction following, finetuned from MPT-7B
`mpt-7b-chat`	A chatbot-like model for dialogue generation, finedtuned from MPT-7B
`mpt-7b-storywriter`	A model designed to read and write fictional stories with super long context lengths.
`mpt-30b`	A 30B parameter decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
`mpt-30b-instruct`	A model for long-form instruction following (especially summarization and question-answering) built by finetuning MPT-30B.
`mpt-30b-chat`	A chatbot-like model for dialogue generation built by finetuning MPT-30B.
`llama2-70b`	A state-of-the-art 70B parameter language model with a context length of 4096 tokens, trained by Meta.
`llama2-13b`	13B parameter variant of the llama2 model
`llama2-7b`	7B parameter variant of the llama2 model
`llama2-70b-chat`	70B parameter fine-tuned Llama2 model optimized for dialogue use cases.
`llama2-13b-chat`	13B parameter variant of the llama2 chat model
`llama2-7b-chat`	7B parameter variant of the llama2 chat model

Llama2 License

Checkpoint fields#

You can only specify one of the following sources to download your checkpoint files from.

Field	Type	Details
`hf_path`	`str`	The name of the HuggingFace model repo
`s3_path`	`str`	The s3 path to a model checkpoint in the HuggingFace format. s3:// is required in the prefix (e.g. s3://mosaic-checkpoint/path)
`gcp_path`	`str`	The gcp path to a model checkpoint in the HuggingFace format. gs:// is required in the prefix (e.g. gs://mosaic-checkpoint/path)

We support checkpoint files in two different formats:

Huggingface checkpoint format - the checkpoint path should be a directory and the downloader will fetch all the files in the directory
Composer checkpoint format - the checkpoint path should point to a single file and it must end in .pt.

If the checkpoint_path field is omitted entirely, we download the default checkpoints in the Huggingface format associated with each model in the table above

Model Type	Default Checkpoint Source
`mpt-7b`	mosaicml/mpt-7b
`mpt-7b-instruct`	mosaicml/mpt-7b-instruct
`mpt-7b-chat`	mosaicml/mpt-7b-chat
`mpt-7b-storywriter`	mosaicml/mpt-7b-storywriter
`mpt-30b`	mosaicml/mpt-30b
`mpt-30b-instruct`	mosaicml/mpt-30b-instruct
`mpt-30b-chat`	mosaicml/mpt-30b-chat
`llama2-70b`	meta-llama/Llama-2-70b-hf
`llama2-13b`	meta-llama/Llama-2-13b-hf
`llama2-7b`	meta-llama/Llama-2-7b-hf
`llama2-70b-chat`	meta-llama/Llama-2-70b-chat-hf
`llama2-13b-chat`	meta-llama/Llama-2-13b-chat-hf
`llama2-7b-chat`	meta-llama/Llama-2-7b-chat-hf

Custom Model#

For custom model implementations, set the values in the model field (this is mutually exclusive from the default_model field). Note that your code will need to be mounted through a git integration. Any of the input model parameters are mounted as a YAML file of your deployment at /mnt/model/model_config.yaml for your code to access.

These parameters configure the out-of-the-box MosaicML inference server. If you choose not to provide a model config, we submit the deployment under the assumption that you’re specifying your own inference server code under port 8080 via a custom image.

The model schema fields are as follows:

Field	Type	Details
`downloader`	`str`	The module path to the function that downloads any necessary model files (i.e. checkpoint files). If not provided, uses the default downloader described below.
`download_parameters`	`Dict[str, Any]`	Kwargs passed into the downloader function
`model_handler`	`str`	The module path to the model handler class. If not provided, defaults to the HuggingFace model handler that comes with the inference server.
`model_parameters`	`Dict[str, Any]`	Kwargs used to initialize your model handler

See the docs on custom model handlers for details about our default handler and how to implement your own model handler class.

Deployment Name#

A deployment name is the primary identifier for working with deployments. For each deployment, a unique identifier is automatically appended to the provided deployment name. After submitting a deployment, the finalized unique deployment is displayed in the terminal, and can also be viewed with mcli get deployments or InferenceDeployment object.

Compute Fields#

The compute field specifies which compute resources to request for a single replica of your inference deployment. See the replicas section for details on how replicas interfaces with compute.

In cases where you underspecify compute, the MosaicML platform will try and infer which compute resources to use automatically. Which fields are required depend on which and what type of clusters are available to your organization. If those resources are not valid or if there are multiple options still available, an error will be raised on run submissions, and the run will not be created.

Field	Type	Details
`cluster`	`str`	Required
`gpus`	`int`	Typically required, unless you specify `instance` or a cpu-only run
`gpu_type`	`str`	Optional. Not needed if you specify `instance`.
`instance`	`str`	Optional. Use if the cluster has multiple instances with the same GPU type (ex. 1-wide and 2-wide A10 instances)
`cpus`	`int`	Optional. Typically not used other than for debugging small deployments.

You can see clusters, instances, and compute resources available to you using:

mcli get clusters

For example, you can launch a multi-node cluster my-cluster with 16 A100 GPUs:

compute:
  cluster: my-cluster
  gpus: 16
  gpu_type: a100_80gb

You can also specify a cluster and instance name within that cluster as follows:

compute:
  cluster: my-cluster
  instance: oci.vm.gpu.a10.2

In the above case, the deployment will use all the GPUs on the instance by default. If you want to use fewer GPUs, you can also specify the gpus field using a value up to the total number of GPUs available on the instance.

Replicas#

If the value of replicas is n > 1 in your deployment YAML, then the deployment will spawn n copies of whatever you request in the compute field.

For example, if your YAML looks like this:

compute:
  cluster: my-cluster
  gpus: 1
  gpu_type: a100_40gb
replicas: 2

then your deployment will spawn 2 replicas each using 1 GPU. Since you did not specify an instance in the compute field, each replica will run on any instance that has a matching GPU type and 1 free GPU.

As another example, if your deployment YAML looks like this:

compute:
  cluster: my-cluster
  instance: oci.vm.gpu.a10.2
replicas: 2

then your deployment will spawn 2 replicas with each one being on a oci.vm.gpu.a10.2 instance. Since that particular instance has 2 GPUs, your deployment will use 4 GPUs total (2 replicas X 2 GPUs per replica).

Batching#

The configuration for dynamic batching in the web server.

Field	Type	Details
`max_batch_size`	`int`	The maximum batch size to create before sending requests to the model.
`max_timeout_ms`	`int`	The maximum time to wait from the first request before sending requests to the model.

Setting max_batch_size to 1 is equivalent to turning dynamic batching off which is the default behavior if batching is not specified.

Image#

Deployments are executed within Docker containers defined by a Docker image. Images on DockerHub can be configured as <organization>/<image name>. For private Dockerhub repositories, add a docker secret with:

mcli create secret docker

For more details, see the Docker Secret Page.

Using Alternative Docker Registries

While we default to DockerHub, custom registries are supported, see Docker’s documentation and Docker Secret Page for more details.

Command#

The command is what’s executed when the deployments starts, typically to start the inference server. For example, the following command:

command: |
  echo Hello World!

will result in a deployment that prints “Hello World” to the console.

If you are using a support model format (Hugging Face, Custom Model) then the command field is optional and will be populated by default as the launch command for starting the MosaicML inference server.

Integrations#

We support many Integrations to customize aspects of both the deployment setup and environment.

Integrations are specified as a list in the YAML. Each item in the list must specify a valid integration_type along with the relevant fields for the requested integration.

Some examples of integrations include automatically cloning a Github repository, installing python packages as shown below:

integrations:
  - integration_type: git_repo
    git_repo: org/my_repo
    git_branch: my-work-branch

You can read more about integrations on the Integrations Page.

Some integrations may require adding secrets. For example, pulling from a private github repository would require the git-ssh secret to be configured. See the Secrets Page.

Environment Variables#

Environment variables can also be injected into each deployment at runtime through the env_variables field.

For example, the below YAML will print “Hello MOSAICML my name is MOSAICML_TWO!”:

name: hello-world
image: python
command: |
  sleep 2
  echo Hello $NAME my name is $SECOND_NAME!
env_variables:
  NAME: MOSAICML
  SECOND_NAME: MOSAICML_TWO

The command accesses the value of the environment variable by the key (in this case $NAME and $SECOND_NAME)

Metadata#

Metadata is meant to be a multi-purposed, unstructured place to put information about a deployment. It can be set at the beginning of the deployment, for example to add custom version tags:

name: hello-world
image: bash
command: echo 'hello world'
metadata:
  model_version: 2

Metadata on your deployment is readable through the CLI or SDK:

BASH

> mcli describe deployment hello-world-VC5nFs
Inference Deployment Details
Inference Deployment Name      hello-world-VC5nFs
Image                          bash
...
Metadata                       {model_version: 2}

PYTHON

from mcli import get_deployment

deployment = get_deployment('hello-world-VC5nFs')
print(deployment.metadata)
# {"model_version": 2}

Metadata size constraints

Metadata is not intended for large amounts of data such as time series data. Each key is limited to 200 characters and value is limited to 0.1mb. Metadata cannot have more than 200 keys. A MAPIException will be raised on creation or updates if any of these limits are exceeded.