Managing Runs#

Runs can be managed through the Python API. Below outlines how to work with runs, including creation, following, getting, stopping, and deleting runs. Before getting started, familiarize yourself with Run Lifecycle and the Python API introduction.

Creating a run#

mcli.api.kube.runs.create_run(run, timeout=10, future=False, _priority=None, _job_type=MCLIJobType.RUN)[source]

Launch a run

Launch a run in the MosaicML platform. The provided run must contain enough details to fully configure the run. If it does not, an error will be thrown.

Parameters
  • run (RunConfig) – A run configuration with enough details to launch. The run will be queued and persisted in the run database.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the run creation takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to create_run() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument.

  • _priority (Optional[PriorityLabel | str]) – DEPRECATED An optional priority level at which the run should be created. Only effective for certain clusters.

  • _job_type (MCLIJobType) – DEPRECATED An optional “job type” descriptor for the run

Raises

InstanceTypeUnavailable – Raised if an invalid compute instance is requested

Returns
  • If future is False – The created Run object

  • Otherwise – A Future for the object

Runs can programmatically be created, giving you flexibility to define custom workflows or create similar runs in quick succession. create_run() will takes a RunConfig object, which is a fully-configured run ready to launch. The method will launch the run and then return a Run object, which includes the RunConfig data in Run.config but also data received at the time the run was launched.

The RunConfig object#

The RunConfig object holds configuration data needed to launch a run. This is the underlying python data structure MCLI uses, so before beginning make sure to familiarize yourself with the Run schema.

class mcli.api.kube.runs.RunConfig(run_name=None, name=None, gpu_type=None, gpu_num=None, cpus=None, platform=None, cluster=None, image=None, partitions=None, optimization_level=None, integrations=<factory>, env_variables=<factory>, scheduling=<factory>, command='', parameters=<factory>, entrypoint='')[source]

A run configuration for the MosaicML platform

Values in here are not yet validated and some required values may be missing.

Parameters
  • name (Optional[str]) – User-defined name of the run

  • gpu_type (Optional[str]) – GPU type (optional if only one gpu type for your cluster)

  • gpu_num (Optional[int]) – Number of GPUs

  • cpus (Optional[int]) – Number of CPUs

  • cluster (Optional[str]) – Cluster to use (optional if you only have one)

  • image (Optional[str]) – Docker image (e.g. mosaicml/composer)

  • integrations (List[Dict[str, Any]]) – List of integrations

  • env_variables (List[Dict[str, str]]) – List of environment variables

  • command (str) – Command to use when a run starts

  • parameters (Dict[str, Any]) – Parameters to mount into the environment

  • entrypoint (str) – Alternative to command

There are two ways to initialize a RunConfig object that can be used to config and create a run. The first is by referencing a YAML file, equivalent to the file argument MCLI:

from mcli.api.runs import RunConfig, create_run

run_config = RunConfig.from_file('hello_world.yaml')
created_run = create_run(run_config)

Alternatively, you can instantiate the RunConfig object directly in python:

from mcli.api.runs import RunConfig, create_run

cluster = "<your-cluster>"
run_config = RunConfig(
    name='hello-world',
    image='bash',
    command='echo "Hello World!" && sleep 60',
    gpu_type='none',
    cluster=cluster,
)
created_run = create_run(run_config)

These can also be used in combination, for example loading a base configuration file and modifying select fields:

from mcli.api.runs import RunConfig, create_run

special_config = RunConfig.from_file('base_config.yaml')
special_config.gpus = 8
created_run = create_run(special_config)

Changing parameters for parameter sweeps

If you are trying to kick off a bunch of runs with similar configurations and different training parameters, make sure you copy the parameters (and any other dict field) instead of modifying them directly

import copy

config = RunConfig.from_file('base_config.yaml')

params = { ... }
for lr in (0.1, 0.01, 0.001):
    new_params = copy.deepcopy(params)
    new_params['optimizers']['sgd']['lr'] = lr
    config.parameters = new_params
    created_run = create_run(config)

The Run object#

Created runs will be returned as a Run object in create_run(). This object can be used as input to any subsequent run function, for example you can start a run and then immediately start following it:

created_run = create_run(config)
for line in follow_run_logs(created_run):
    print(line)
class mcli.api.kube.runs.Run(run_uid, name, status, created_at, updated_at, config, started_at=None, completed_at=None, reason=None, submitted_config=None, nodes=<factory>, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'runInput', 'originalRunInput'), _type=None)[source]

A run that has been launched on the MosaicML platform

Parameters
  • run_uid (str) – Unique identifier for the run

  • name (str) – User-defined name of the run

  • status (RunStatus) – Status of the run at a moment in time

  • created_at (datetime) – Date and time when the run was created

  • updated_at (datetime) – Date and time when the run was last updated

  • config (RunConfig) – The run configuration that was used to launch to the run

  • started_at (Optional[datetime]) – Date and time when the run entered the STARTED RunStatus

  • completed_at (Optional[datetime]) – Date and time when the run entered the COMPLETED RunStatus

Observing a run#

Getting a run’s logs#

There are two functions for fetching run logs:

  • get_run_logs(): Gets currently available logs for any run. Ideal for completed runs or checking progress of an active run

  • follow_run_logs(): Follows logs line-by-line for any run. Ideal for monitoring active runs in real time or a condition is reached (see also wait_for_run_status())

mcli.api.kube.runs.get_run_logs(run, rank=None, timestamps=False, timeout=10, future=False, failed=False)[source]

Get the current logs for an active or completed run

Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a str, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, use follow_run_logs().

Parameters
  • run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

  • rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

  • timestamps (bool) – If True, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can use LogLine.from_line().

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future . If True, the call to get_run_logs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the log text, use return_value.result() with an optional timeout argument.

Raises
  • KubernetesException (HTTPStatus.NOT_FOUND) – Raised if the requested run does not exist

  • KubernetesException (HTTPStatus.BAD_REQUEST) – Raised if the run is not yet running, or if the run does not have a node of the requested rank.

Returns
  • If future is False – The full log text for a run at the time of the request as a str

  • Otherwise – A Future for the log text

mcli.api.kube.runs.follow_run_logs(run, rank=None, timestamps=False, timeout=10, future=False)[source]

Follow the logs for an active or completed run in the MosaicML platform

This returns a generator of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active. If you are only looking for the logs up until the time of the request, consider using get_run_logs() instead.

Parameters
  • run (str | Run) – The run to get logs for. If a name is provided, the remaining required run details will be queried with get_runs().

  • rank (Optional[int]) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.

  • timestamps (bool) – If True, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can use LogLine.from_line().

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future . If True, the call to follow_run_logs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the generator, use return_value.result() with an optional timeout argument.

Returns
  • If future is False – A line-by-line Generator of the logs for a run

  • Otherwise – A Future of a line-by-line generator of the logs for a run

Monitoring a run throughout its lifecycle#

mcli.api.kube.runs.wait_for_run_status(run, status, timeout=None, future=False)[source]

Wait for a launched run to reach a specific status

Parameters
  • run (str | Run) – The run whose status should be watched. This can be provided using the run’s name or an existing Run object.

  • status (str | RunStatus) – Status to wait for. This can be any valid RunStatus value. If the run never reaches this state (e.g. it fails or the wait times out), then an error will be raised. See exception details below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to wait_for_run_status() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the Run output, use return_value.result() with an optional timeout argument.

Raises
  • KubernetesException – Raised with status code 404 if the requested run could not be found

  • RunFailed – Raised if the run failed before reaching the requested status

  • TimeoutError – Raised if the run did not reach the correct status in the specified time

Returns
  • If future is False – A Run object once it has reached the requested status

  • Otherwise – A Future for the run. This will not resolve until the run reaches the requested status

The RunStatus object#

The RunStatus object is attached to each Run object and reflects the most recent status the run has been observed with.

class mcli.api.kube.runs.RunStatus(value)[source]

Possible statuses of a run

PENDING = 'PENDING'

The run has been dispatched and is waiting to be received

SCHEDULED = 'SCHEDULED'

The run has been scheduled and is waiting to be queued

QUEUED = 'QUEUED'

The run is queued and is awaiting execution

STARTING = 'STARTING'

The run is starting up and preparing to run

RUNNING = 'RUNNING'

The run is actively running

TERMINATING = 'TERMINATING'

The run is in the process of being terminated

STOPPING = 'STOPPING'

The run is in the process of being stopped

COMPLETED = 'COMPLETED'

The run has finished without any errors

STOPPED = 'STOPPED'

The run has stopped

FAILED_PULL = 'FAILED_PULL'

The run has failed due to a kubernetes error

FAILED = 'FAILED'

The run has failed due to an issue at runtime

UNKNOWN = 'UNKNOWN'

A valid run status cannot be found

before(other, inclusive=False)[source]

Returns True if this state usually comes “before” the other

Parameters
  • other – Another RunStatus

  • inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is “before” the other

Example

>>> RunStatus.RUNNING.before(RunStatus.COMPLETED)
True
>>> RunStatus.PENDING.before(RunStatus.RUNNING)
True
after(other, inclusive=False)[source]

Returns True if this state usually comes “after” the other

Parameters
  • other – Another RunStatus

  • inclusive – If True, equality evaluates to True. Default False.

Returns

If this state is “after” the other

Example

>>> RunStatus.RUNNING.before(RunStatus.COMPLETED)
True
>>> RunStatus.PENDING.before(RunStatus.RUNNING)
True
classmethod from_string(run_status)[source]

Convert a string to a valid RunStatus Enum

If the run status string is not recognized, will return RunStatus.UNKNOWN instead of raising a KeyError

Listing runs#

All runs that you have launched in the MosaicML platform and have not deleted can be accessed using the get_runs() function. Optional filters allow you to specify a subset of runs to list by name, cluster, gpu type, gpu number, or status.

mcli.api.kube.runs.get_runs(runs=None, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, clusters=None)[source]

Get a filtered list of runs

List runs that have been launched in the MosaicML platform. The returned list will contain all of the details stored about the requested runs.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – List of run names on which to get information

  • cluster_names (Optional[List[str] | List[Cluster ]]) – List of cluster names to filter runs. This can be a list of str or Cluster objects. Only runs submitted to these clusters will be returned.

  • gpu_types (Optional[List[str] | List[GPUType ]]) – List of gpu types to filter runs. This can be a list of str or GPUType enums. Only runs scheduled on these GPUs will be returned.

  • gpu_nums (Optional[List[int]]) – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.

  • statuses (Optional[List[str]|List[RunStatus ]]) – List of run statuses to filter runs. This can be a list of str or RunStatus enums. Only runs currently in these phases will be returned.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to get_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises
  • MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs

  • KubernetesException – Raised when a Kubernetes error occurs when communicating with only 1 cluster

  • RuntimeError – Raised when some error occurs in calls to multiple Kubernetes clusters

Returns
  • If future is False – A list of requested Run objects

  • Otherwise – A Future for the list

Stopping runs#

mcli.api.kube.runs.stop_runs(runs, timeout=10, future=False)[source]

Stop a list of runs

Stop a list of runs currently running in the MosaicML platform.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to stop. Using Run objects is most efficient. See the note below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to stop_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Raises

KubernetesException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status `RunStatus.STOPPED`. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.

Returns
  • If future is False – A list of stopped Run objects

  • Otherwise – A Future for the list

Note

The Kubernetes API requires the cluster for each run. If you provide runs as a list of names, we will get this by calling get_runs(). Since a common way to get the list of runs is to have already called get_runs(), you can avoid a second call by passing the output of that call in directly.

Warning

Stopping runs does not occur immediately. You may see up to a 40 second delay between your request and the run actually stopping.

Deleting runs#

To delete runs, you must supply the run names or Run object. To delete a set of runs, you can use the output of get_runs() or even define your own filters directly:

# delete a run by name
delete_run('delete-this-run')

# delete failed runs on cluster xyz using 1 or 2 GPUs
failed_runs = get_runs(statuses=['FAILED'], cluster_names=['xyz'], gpu_nums=[1, 2])
delete_runs(failed_runs)

# delete completed runs older than a month with name pattern
completed = get_runs(statuses=['COMPLETED'])
ref_date = dt.datetime.now() - dt.timedelta(days=30)
old_runs = [r for r in completed if 'experiment1' in r.name and r.created_at < ref_date ]
delete_runs(old_runs)
mcli.api.kube.runs.delete_runs(runs, timeout=10, future=False)[source]

Delete a list of runs

Delete a list of runs in the MosaicML platform. Any runs that are currently running will first be stopped.

Parameters
  • runs (Optional[List[str] | List[Run ]]) – A list of runs or run names to delete. Using Run objects is most efficient. See the note below.

  • timeout (Optional[float]) – Time, in seconds, in which the call should complete. If the call takes too long, a TimeoutError will be raised. If future is True, this value will be ignored.

  • future (bool) – Return the output as a Future. If True, the call to delete_runs() will return immediately and the request will be processed in the background. This takes precedence over the timeout argument. To get the list of Run output, use return_value.result() with an optional timeout argument.

Returns
  • If future is False – A list of deleted Run objects

  • Otherwise – A Future for the list

Note

The Kubernetes API requires the cluster for each run. If you provide runs as a list of names, we will get this by calling get_runs(). Since a common way to get the list of runs is to have already called get_runs(), you can avoid a second call by passing the output of that call in directly.