API Reference#

Runs#

create_run

Launch a run

delete_runs

Delete a list of runs

follow_run_logs

Follow the logs for an active or completed run in the MosaicML platform

get_run_logs

Get the current logs for an active or completed run

get_runs

Get a filtered list of runs

initialize

Initialize the MosaicML platform

set_api_key

Set the api key for the MosaicML platform

stop_runs

Stop a list of runs

wait_for_run_status

Wait for a launched run to reach a specific status

Run

A run that has been launched on the MosaicML platform

RunConfig

A run configuration for the MosaicML platform

RunStatus

Possible statuses of a run

mcli.sdk.create_run(run, timeout=10, future=False, _priority=None, _job_type=MCLIJobType.RUN)[source]#

Launch a run

Launch a run in the MosaicML platform. The provided run must contain enough details to fully configure the run. If it does not, an error will be thrown.

Parameters
  • run (RunConfig) – A run configuration with enough details to launch. The run will be queued and persisted in the run database.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to create_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument.

  • _priority (Optional[PriorityLabel | str]) – DEPRECATED An optional priority level at which the run should be created. Only effective for certain clusters.

  • _job_type (MCLIJobType) – DEPRECATED An optional “job type” descriptor for the run

Raises

InstanceTypeUnavailable – Raised if an invalid compute instance is requested

Returns
  • If future is False – The created Run object

  • Otherwise – A Future for the object

mcli.sdk.delete_runs(runs, timeout=10, future=False)[source]#

Delete a list of runs

Delete a list of runs in the MosaicML platform. Any runs that are currently running will first be stopped.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to delete. Using Run objects is most efficient. See the note below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to delete_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Returns
  • If future is False – A list of deleted Run objects

  • Otherwise – A Future for the list

Note

The Kubernetes API requires the cluster for each run. If you provide runs as a list of names, we will get this by calling get_runs(). Since a common way to get the list of runs is to have already called get_runs(), you can avoid a second call by passing the output of that call in directly.

mcli.sdk.follow_run_logs(run, rank=None, timestamps=False, timeout=10, future=False)[source]#

Follow the logs for an active or completed run in the MosaicML platform

This returns a generator of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active. If you are only looking for the logs up until the time of the request, consider using get_run_logs() instead.

Parameters
  • run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

  • rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

  • timestamps (bool) – If True, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can use LogLine.from_line().

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future . If True, the call to follow_run_logs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the generator, use return_value.result() with an optional timeout argument.

Returns
  • If future is False – A line-by-line Generator of the logs for a run

  • Otherwise – A Future of a line-by-line generator of the logs for a run

mcli.sdk.get_run_logs(run, rank=None, timestamps=False, timeout=10, future=False, failed=False)[source]#

Get the current logs for an active or completed run

Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a str, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, use follow_run_logs().

Parameters
  • run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

  • rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

  • timestamps (bool) – If True, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can use LogLine.from_line().

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future . If True, the call to get_run_logs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the log text, use return_value.result() with an optional timeout argument.

Raises
  • KubernetesException (HTTPStatus.NOT_FOUND) – Raised if the requested run does not exist

  • KubernetesException (HTTPStatus.BAD_REQUEST) – Raised if the run is not yet running, or if the run does not have a node of the requested rank.

Returns
  • If future is False – The full log text for a run at the time of the request as a str

  • Otherwise – A Future for the log text

mcli.sdk.get_runs(runs=None, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, clusters=None)[source]#

Get a filtered list of runs

List runs that have been launched in the MosaicML platform. The returned list will contain all of the details stored about the requested runs.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – List of run names on which to get information

  • cluster_names (Optional[List[str] | List[Cluster ]]) – List of cluster names to filter runs. This can be a list of str or Cluster objects. Only runs submitted to these clusters will be returned.

  • gpu_types (Optional[List[str] | List[GPUType ]]) – List of gpu types to filter runs. This can be a list of str or GPUType enums. Only runs scheduled on these GPUs will be returned.

  • gpu_nums (Optional[List[int]]) – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.

  • statuses (Optional[List[str]|List[RunStatus ]]) – List of run statuses to filter runs. This can be a list of str or RunStatus enums. Only runs currently in these phases will be returned.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to get_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises
  • MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

  • KubernetesException – Raised when a Kubernetes error occurs when communicating with only 1 cluster

  • RuntimeError – Raised when some error occurs in calls to multiple Kubernetes clusters

Returns
  • If future is False – A list of requested Run objects

  • Otherwise – A Future for the list

mcli.sdk.initialize(api_key=None)[source]#

Initialize the MosaicML platform

Parameters

api_key – Optional value to set

mcli.sdk.set_api_key(api_key)[source]#

Set the api key for the MosaicML platform

Parameters

api_key – value to set

mcli.sdk.stop_runs(runs, timeout=10, future=False)[source]#

Stop a list of runs

Stop a list of runs currently running in the MosaicML platform.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to stop. Using Run objects is most efficient. See the note below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to stop_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

KubernetesException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status `RunStatus.STOPPED`. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.

Returns
  • If future is False – A list of stopped Run objects

  • Otherwise – A Future for the list

Note

The Kubernetes API requires the cluster for each run. If you provide runs as a list of names, we will get this by calling get_runs(). Since a common way to get the list of runs is to have already called get_runs(), you can avoid a second call by passing the output of that call in directly.

Warning

Stopping runs does not occur immediately. You may see up to a 40 second delay between your request and the run actually stopping.

mcli.sdk.stop_runs(runs, timeout=10, future=False)[source]#

Stop a list of runs

Stop a list of runs currently running in the MosaicML platform.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to stop. Using Run objects is most efficient. See the note below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to stop_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

KubernetesException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status `RunStatus.STOPPED`. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.

Returns
  • If future is False – A list of stopped Run objects

  • Otherwise – A Future for the list

Note

The Kubernetes API requires the cluster for each run. If you provide runs as a list of names, we will get this by calling get_runs(). Since a common way to get the list of runs is to have already called get_runs(), you can avoid a second call by passing the output of that call in directly.

Warning

Stopping runs does not occur immediately. You may see up to a 40 second delay between your request and the run actually stopping.

mcli.sdk.wait_for_run_status(run, status, timeout=None, future=False)[source]#

Wait for a launched run to reach a specific status

Parameters
  • run (str | Run) – The run whose status should be watched. This can be provided using the run’s name or an existing Run object.

  • status (str | RunStatus) – Status to wait for. This can be any valid RunStatus value. If the run never reaches this state (e.g. it fails or the wait times out), then an error will be raised. See exception details below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to wait_for_run_status() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument.

Raises
  • KubernetesException – Raised with status code 404 if the requested run could not be found

  • RunFailed – Raised if the run failed before reaching the requested status

  • TimeoutError – Raised if the run did not reach the correct status in the specified time

Returns
  • If future is False – A Run object once it has reached the requested status

  • Otherwise – A Future for the run. This will not resolve until the run reaches the requested status

class mcli.sdk.Run(run_uid, name, status, created_at, updated_at, config, started_at=None, completed_at=None, reason=None, submitted_config=None, nodes=<factory>, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'runInput', 'originalRunInput'), _type=None)[source]#

A run that has been launched on the MosaicML platform

Parameters
  • run_uid (str) – Unique identifier for the run

  • name (str) – User-defined name of the run

  • status (RunStatus) – Status of the run at a moment in time

  • created_at (datetime) – Date and time when the run was created

  • updated_at (datetime) – Date and time when the run was last updated

  • config (RunConfig) – The run configuration that was used to launch to the run

  • started_at (Optional[datetime]) – Date and time when the run entered the STARTED RunStatus

  • completed_at (Optional[datetime]) – Date and time when the run entered the COMPLETED RunStatus

class mcli.sdk.RunConfig(run_name=None, name=None, gpu_type=None, gpu_num=None, cpus=None, platform=None, cluster=None, image=None, partitions=None, optimization_level=None, integrations=<factory>, env_variables=<factory>, scheduling=<factory>, command='', parameters=<factory>, entrypoint='')[source]#

A run configuration for the MosaicML platform

Values in here are not yet validated and some required values may be missing.

Parameters
  • name (Optional[str]) – User-defined name of the run

  • gpu_type (Optional[str]) – GPU type (optional if only one gpu type for your cluster)

  • gpu_num (Optional[int]) – Number of GPUs

  • cpus (Optional[int]) – Number of CPUs

  • cluster (Optional[str]) – Cluster to use (optional if you only have one)

  • image (Optional[str]) – Docker image (e.g. mosaicml/composer)

  • integrations (List[Dict[str, Any]]) – List of integrations

  • env_variables (List[Dict[str, str]]) – List of environment variables

  • command (str) – Command to use when a run starts

  • parameters (Dict[str, Any]) – Parameters to mount into the environment

  • entrypoint (str) – Alternative to command

class mcli.sdk.RunStatus(value)[source]#

Possible statuses of a run

COMPLETED = 'COMPLETED'#

The run has finished without any errors

FAILED = 'FAILED'#

The run has failed due to an issue at runtime

FAILED_PULL = 'FAILED_PULL'#

The run has failed due to a kubernetes error

PENDING = 'PENDING'#

The run has been dispatched and is waiting to be received

QUEUED = 'QUEUED'#

The run is queued and is awaiting execution

RUNNING = 'RUNNING'#

The run is actively running

SCHEDULED = 'SCHEDULED'#

The run has been scheduled and is waiting to be queued

STARTING = 'STARTING'#

The run is starting up and preparing to run

STOPPED = 'STOPPED'#

The run has stopped

STOPPING = 'STOPPING'#

The run is in the process of being stopped

TERMINATING = 'TERMINATING'#

The run is in the process of being terminated

UNKNOWN = 'UNKNOWN'#

A valid run status cannot be found

after(other, inclusive=False)[source]#

Returns True if this state usually comes “after” the other

Parameters
  • other – Another RunStatus

  • inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is “after” the other

Example

>>> RunStatus.RUNNING.before(RunStatus.COMPLETED)
True
>>> RunStatus.PENDING.before(RunStatus.RUNNING)
True
before(other, inclusive=False)[source]#

Returns True if this state usually comes “before” the other

Parameters
  • other – Another RunStatus

  • inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is “before” the other

Example

>>> RunStatus.RUNNING.before(RunStatus.COMPLETED)
True
>>> RunStatus.PENDING.before(RunStatus.RUNNING)
True
classmethod from_string(run_status)[source]#

Convert a string to a valid RunStatus Enum

If the run status string is not recognized, will return RunStatus.UNKNOWN instead of raising a KeyError