Inference Quickstart#

You can easily deploy your model with the MosaicML platform with just a few simple steps. Before starting, make sure you’ve configured MosaicML access

Creating Your First Deployment#

For this tutorial, we’re going to deploy the MPT-7B Instruct model. To submit your first deployment, copy the below yaml into a file called mpt_instruct_deploy.yaml:

name: mpt-7b-instruct
compute:
  gpus: 1
  instance: oci.vm.gpu.a10.1
default_model:
  model_type: mpt-7b-instruct

This yaml tells the MosaicML platform that you are requesting a single a10 gpu and would like to download the mpt-7b-instruct model from the HuggingFace hub.

Then, run:

mcli deploy -f mpt_instruct_deploy.yaml

Specifying a cluster

If you have access to more than one cluster, you’ll need to specify which cluster to deploy with using --cluster <name>. You can check which clusters you have access to using mcli get clusters

After you’ve run the deploy command, you’ll see the following output in your terminal (note that the hash after mpt-7b-instruct- is a unique identifier that we append to the deployment name you provided in your yaml):

✔  Deployment mpt-7b-instruct-0t30xo submitted.

To see the deployment's status, use:

mcli get deployments

If you run mcli get deployments, you’ll see the following output:

NAME                    USER             CLUSTER  GPU_TYPE   GPU_NUM  CREATED_TIME         STATUS
mpt-7b-instruct-0t30xo  user.email.com   r7z13    a100_40gb  1        2023-05-17 07:24 PM  PENDING

The mcli get deployments command shows you all the deployments in your organization, so you may see deployments that were not created by you.

You can also get more details about a specific deployment by running mcli describe deployment mpt-7b-instruct-0t30xo.

Interacting With Your Deployment#

You’ve created your first deployment, congrats! From here, MCLI has a few convenience commands that make it easier for you to interact with your deployment.

First, you may want to check your deployment’s status to see if it’s ready to start serving traffic. You can do that by running the following command:

mcli ping mpt-7b-instruct-0t30xo

If your deployment is ready, you should see the output:

mpt-7b-instruct-0t30xo's status:
{'status': 200}

where the status is an HTTP status code. If your status code is 200, your deployment is ready to server traffic!

Let’s try sending a request to your deployment using the Python SDK:

from mcli import predict, get_inference_deployment

deployment = get_inference_deployment("mpt-7b-instruct-0t30xo")
predict(deployment, {"inputs": ["hello world!"]})

You can also make the same request via the command line:

mcli predict mpt-7b-instruct-0t30xo --input '{"inputs": ["hello world!"]}'

You can also do the same with a basic curl command:

curl https://mpt-7b-instruct-0t30xo.inf.hosted-on.mosaicml.hosting/predict_stream \
-H "Authorization: <your_api_key>" \
-d '{"inputs": ["hello world!"]}'

The address above is for the example, you can look up the address for your own deployment using mcli describe deployment <name>

Once you’re done with your deployment, you can delete it with the following command:

mcli delete deployments --name mpt-7b-instruct-0t30xo

Next Steps#

There are many more ways you can customize your deployments. We support downloading checkpoint files from any remote storage such as s3 and you can customize your model’s forward logic by implementing a custom model handler. You can even write your own webserver and replace the mosaicml/inference image with your own. Take a look at the Inference Schema Page for more information.