Managing Runs#
Runs can be managed through the Python API. Below outlines how to work with runs, including creation, following, getting, stopping, and deleting runs. Before getting started, familiarize yourself with Run Lifecycle and the Python API introduction.
Creating a run#
- mcli.api.kube.runs.create_run(run, timeout=10, future=False, _priority=None, _job_type=MCLIJobType.RUN)[source]
Launch a run
Launch a run in the MosaicML platform. The provided
run
must contain enough details to fully configure the run. If it does not, an error will be thrown.- Parameters
run (
RunConfig
) – Arun configuration
with enough details to launch. The run will be queued and persisted in the run database.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tocreate_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument._priority (
Optional[PriorityLabel | str]
) – DEPRECATED An optional priority level at which the run should be created. Only effective for certain clusters._job_type (
MCLIJobType
) – DEPRECATED An optional “job type” descriptor for the run
- Raises
InstanceTypeUnavailable – Raised if an invalid compute instance is requested
- Returns
Runs can programmatically be created, giving you flexibility to define custom workflows or create similar runs in quick succession.
create_run()
will takes a RunConfig
object, which is a fully-configured run ready to launch. The method will launch the run and then return a Run
object, which includes the RunConfig
data in Run.config
but also data received at the time the run was launched.
The RunConfig
object#
The RunConfig
object holds configuration data needed to launch a run.
This is the underlying python data structure MCLI uses, so before beginning make sure to familiarize yourself with the Run schema.
- class mcli.api.kube.runs.RunConfig(run_name=None, name=None, gpu_type=None, gpu_num=None, cpus=None, platform=None, cluster=None, image=None, partitions=None, optimization_level=None, integrations=<factory>, env_variables=<factory>, scheduling=<factory>, command='', parameters=<factory>, entrypoint='')[source]
A run configuration for the MosaicML platform
Values in here are not yet validated and some required values may be missing.
- Parameters
name (Optional[str]) – User-defined name of the run
gpu_type (Optional[str]) – GPU type (optional if only one gpu type for your cluster)
gpu_num (Optional[int]) – Number of GPUs
cpus (Optional[int]) – Number of CPUs
cluster (Optional[str]) – Cluster to use (optional if you only have one)
image (Optional[str]) – Docker image (e.g. mosaicml/composer)
integrations (List[Dict[str, Any]]) – List of integrations
env_variables (List[Dict[str, str]]) – List of environment variables
command (str) – Command to use when a run starts
parameters (Dict[str, Any]) – Parameters to mount into the environment
entrypoint (str) – Alternative to command
There are two ways to initialize a RunConfig
object that can be used to config and create a run.
The first is by referencing a YAML file, equivalent to the file argument MCLI:
from mcli.api.runs import RunConfig, create_run
run_config = RunConfig.from_file('hello_world.yaml')
created_run = create_run(run_config)
Alternatively, you can instantiate the RunConfig
object directly in python:
from mcli.api.runs import RunConfig, create_run
cluster = "<your-cluster>"
run_config = RunConfig(
name='hello-world',
image='bash',
command='echo "Hello World!" && sleep 60',
gpu_type='none',
cluster=cluster,
)
created_run = create_run(run_config)
These can also be used in combination, for example loading a base configuration file and modifying select fields:
from mcli.api.runs import RunConfig, create_run
special_config = RunConfig.from_file('base_config.yaml')
special_config.gpus = 8
created_run = create_run(special_config)
Changing parameters for parameter sweeps
If you are trying to kick off a bunch of runs with similar configurations and different training parameters, make sure you copy the parameters (and any other dict
field) instead of modifying them directly
import copy
config = RunConfig.from_file('base_config.yaml')
params = { ... }
for lr in (0.1, 0.01, 0.001):
new_params = copy.deepcopy(params)
new_params['optimizers']['sgd']['lr'] = lr
config.parameters = new_params
created_run = create_run(config)
The Run
object#
Created runs will be returned as a Run
object in create_run()
.
This object can be used as input to any subsequent run function, for example you can start a run and then immediately start following it:
created_run = create_run(config)
for line in follow_run_logs(created_run):
print(line)
- class mcli.api.kube.runs.Run(run_uid, name, status, created_at, updated_at, config, started_at=None, completed_at=None, reason=None, submitted_config=None, nodes=<factory>, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'runInput', 'originalRunInput'), _type=None)[source]
A run that has been launched on the MosaicML platform
- Parameters
run_uid (str) – Unique identifier for the run
name (str) – User-defined name of the run
status (
RunStatus
) – Status of the run at a moment in timecreated_at (datetime) – Date and time when the run was created
updated_at (datetime) – Date and time when the run was last updated
config (
RunConfig
) – Therun configuration
that was used to launch to the runstarted_at (Optional[datetime]) – Date and time when the run entered the STARTED
RunStatus
completed_at (Optional[datetime]) – Date and time when the run entered the COMPLETED
RunStatus
Observing a run#
Getting a run’s logs#
There are two functions for fetching run logs:
get_run_logs()
: Gets currently available logs for any run. Ideal for completed runs or checking progress of an active runfollow_run_logs()
: Follows logs line-by-line for any run. Ideal for monitoring active runs in real time or a condition is reached (see alsowait_for_run_status()
)
- mcli.api.kube.runs.get_run_logs(run, rank=None, timestamps=False, timeout=10, future=False, failed=False)[source]
Get the current logs for an active or completed run
Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a
str
, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, usefollow_run_logs()
.- Parameters
run (
str
|Run
) – The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.timestamps (
bool
) – IfTrue
, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can useLogLine.from_line()
.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_run_logs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the log text, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
KubernetesException (HTTPStatus.NOT_FOUND) – Raised if the requested run does not exist
KubernetesException (HTTPStatus.BAD_REQUEST) – Raised if the run is not yet running, or if the run does not have a node of the requested rank.
- Returns
- mcli.api.kube.runs.follow_run_logs(run, rank=None, timestamps=False, timeout=10, future=False)[source]
Follow the logs for an active or completed run in the MosaicML platform
This returns a
generator
of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active. If you are only looking for the logs up until the time of the request, consider usingget_run_logs()
instead.- Parameters
run (
str
|Run
) – The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.timestamps (
bool
) – IfTrue
, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can useLogLine.from_line()
.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tofollow_run_logs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the generator, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
If future is False – A line-by-line
Generator
of the logs for a runOtherwise – A
Future
of a line-by-line generator of the logs for a run
Monitoring a run throughout its lifecycle#
- mcli.api.kube.runs.wait_for_run_status(run, status, timeout=None, future=False)[source]
Wait for a launched run to reach a specific status
- Parameters
run (
str
|Run
) – The run whose status should be watched. This can be provided using the run’s name or an existingRun
object.status (
str
|RunStatus
) – Status to wait for. This can be any validRunStatus
value. If the run never reaches this state (e.g. it fails or the wait times out), then an error will be raised. See exception details below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call towait_for_run_status()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
KubernetesException – Raised with status code 404 if the requested run could not be found
RunFailed – Raised if the run failed before reaching the requested status
TimeoutError – Raised if the run did not reach the correct status in the specified time
- Returns
The RunStatus
object#
The RunStatus
object is attached to each Run
object and reflects the most recent status the run has been observed with.
- class mcli.api.kube.runs.RunStatus(value)[source]
Possible statuses of a run
- PENDING = 'PENDING'
The run has been dispatched and is waiting to be received
- SCHEDULED = 'SCHEDULED'
The run has been scheduled and is waiting to be queued
- QUEUED = 'QUEUED'
The run is queued and is awaiting execution
- STARTING = 'STARTING'
The run is starting up and preparing to run
- RUNNING = 'RUNNING'
The run is actively running
- TERMINATING = 'TERMINATING'
The run is in the process of being terminated
- STOPPING = 'STOPPING'
The run is in the process of being stopped
- COMPLETED = 'COMPLETED'
The run has finished without any errors
- STOPPED = 'STOPPED'
The run has stopped
- FAILED_PULL = 'FAILED_PULL'
The run has failed due to a kubernetes error
- FAILED = 'FAILED'
The run has failed due to an issue at runtime
- UNKNOWN = 'UNKNOWN'
A valid run status cannot be found
- before(other, inclusive=False)[source]
Returns True if this state usually comes “before” the other
- Parameters
other – Another
RunStatus
inclusive – If True, equality evaluates to True. Default False.
- Returns
If this state is “before” the other
Example
>>> RunStatus.RUNNING.before(RunStatus.COMPLETED) True >>> RunStatus.PENDING.before(RunStatus.RUNNING) True
- after(other, inclusive=False)[source]
Returns True if this state usually comes “after” the other
- Parameters
other – Another
RunStatus
inclusive – If True, equality evaluates to True. Default False.
- Returns
If this state is “after” the other
Example
>>> RunStatus.RUNNING.before(RunStatus.COMPLETED) True >>> RunStatus.PENDING.before(RunStatus.RUNNING) True
- classmethod from_string(run_status)[source]
Convert a string to a valid RunStatus Enum
If the run status string is not recognized, will return RunStatus.UNKNOWN instead of raising a KeyError
Listing runs#
All runs that you have launched in the MosaicML platform and have not deleted can be accessed using the get_runs()
function.
Optional filters allow you to specify a subset of runs to list by name, cluster, gpu type, gpu number, or status.
- mcli.api.kube.runs.get_runs(runs=None, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, clusters=None)[source]
Get a filtered list of runs
List runs that have been launched in the MosaicML platform. The returned list will contain all of the details stored about the requested runs.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – List of run names on which to get informationcluster_names (
Optional[List[str] | List[
Cluster
]]
) – List of cluster names to filter runs. This can be a list of str orCluster
objects. Only runs submitted to these clusters will be returned.gpu_types (
Optional[List[str] | List[
GPUType
]]
) – List of gpu types to filter runs. This can be a list of str orGPUType
enums. Only runs scheduled on these GPUs will be returned.gpu_nums (
Optional[List[int]]
) – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.statuses (
Optional[List[str]|List[
RunStatus
]]
) – List of run statuses to filter runs. This can be a list of str orRunStatus
enums. Only runs currently in these phases will be returned.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
KubernetesException – Raised when a Kubernetes error occurs when communicating with only 1 cluster
RuntimeError – Raised when some error occurs in calls to multiple Kubernetes clusters
- Returns
Stopping runs#
- mcli.api.kube.runs.stop_runs(runs, timeout=10, future=False)[source]
Stop a list of runs
Stop a list of runs currently running in the MosaicML platform.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – A list of runs or run names to stop. UsingRun
objects is most efficient. See the note below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tostop_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
KubernetesException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status
`RunStatus.STOPPED`
. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.- Returns
Note
The Kubernetes API requires the cluster for each run. If you provide
runs
as a list of names, we will get this by callingget_runs()
. Since a common way to get the list of runs is to have already calledget_runs()
, you can avoid a second call by passing the output of that call in directly.Warning
Stopping runs does not occur immediately. You may see up to a 40 second delay between your request and the run actually stopping.
Deleting runs#
To delete runs, you must supply the run names or Run
object.
To delete a set of runs, you can use the output of get_runs()
or even define your own filters directly:
# delete a run by name
delete_run('delete-this-run')
# delete failed runs on cluster xyz using 1 or 2 GPUs
failed_runs = get_runs(statuses=['FAILED'], cluster_names=['xyz'], gpu_nums=[1, 2])
delete_runs(failed_runs)
# delete completed runs older than a month with name pattern
completed = get_runs(statuses=['COMPLETED'])
ref_date = dt.datetime.now() - dt.timedelta(days=30)
old_runs = [r for r in completed if 'experiment1' in r.name and r.created_at < ref_date ]
delete_runs(old_runs)
- mcli.api.kube.runs.delete_runs(runs, timeout=10, future=False)[source]
Delete a list of runs
Delete a list of runs in the MosaicML platform. Any runs that are currently running will first be stopped.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – A list of runs or run names to delete. UsingRun
objects is most efficient. See the note below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call todelete_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
Note
The Kubernetes API requires the cluster for each run. If you provide
runs
as a list of names, we will get this by callingget_runs()
. Since a common way to get the list of runs is to have already calledget_runs()
, you can avoid a second call by passing the output of that call in directly.