API Reference#
Runs#
Launch a run |
|
Delete a list of runs |
|
Follow the logs for an active or completed run in the MosaicML platform |
|
Get the current logs for an active or completed run |
|
Get a filtered list of runs |
|
Initialize the MosaicML platform |
|
Set the api key for the MosaicML platform |
|
Stop a list of runs |
|
Wait for a launched run to reach a specific status |
|
A run that has been launched on the MosaicML platform |
|
A run configuration for the MosaicML platform |
|
Possible statuses of a run |
- mcli.sdk.create_run(run, timeout=10, future=False, _priority=None, _job_type=MCLIJobType.RUN)[source]#
Launch a run
Launch a run in the MosaicML platform. The provided
run
must contain enough details to fully configure the run. If it does not, an error will be thrown.- Parameters
run (
RunConfig
) – Arun configuration
with enough details to launch. The run will be queued and persisted in the run database.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the run creation takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tocreate_run()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument._priority (
Optional[PriorityLabel | str]
) – DEPRECATED An optional priority level at which the run should be created. Only effective for certain clusters._job_type (
MCLIJobType
) – DEPRECATED An optional “job type” descriptor for the run
- Raises
InstanceTypeUnavailable – Raised if an invalid compute instance is requested
- Returns
- mcli.sdk.delete_runs(runs, timeout=10, future=False)[source]#
Delete a list of runs
Delete a list of runs in the MosaicML platform. Any runs that are currently running will first be stopped.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – A list of runs or run names to delete. UsingRun
objects is most efficient. See the note below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call todelete_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
Note
The Kubernetes API requires the cluster for each run. If you provide
runs
as a list of names, we will get this by callingget_runs()
. Since a common way to get the list of runs is to have already calledget_runs()
, you can avoid a second call by passing the output of that call in directly.
- mcli.sdk.follow_run_logs(run, rank=None, timestamps=False, timeout=10, future=False)[source]#
Follow the logs for an active or completed run in the MosaicML platform
This returns a
generator
of individual log lines, line-by-line, and will wait until new lines are produced if the run is still active. If you are only looking for the logs up until the time of the request, consider usingget_run_logs()
instead.- Parameters
run (
str
|Run
) – The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.timestamps (
bool
) – IfTrue
, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can useLogLine.from_line()
.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tofollow_run_logs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the generator, usereturn_value.result()
with an optionaltimeout
argument.
- Returns
If future is False – A line-by-line
Generator
of the logs for a runOtherwise – A
Future
of a line-by-line generator of the logs for a run
- mcli.sdk.get_run_logs(run, rank=None, timestamps=False, timeout=10, future=False, failed=False)[source]#
Get the current logs for an active or completed run
Get the current logs for an active or completed run in the MosaicML platform. This returns the full logs as a
str
, as they exist at the time the request is made. If you want to follow the logs for an active run line-by-line, usefollow_run_logs()
.- Parameters
run (
str
|Run
) – The run to get logs for. If a name is provided, the remaining required run details will be queried withget_runs()
.rank (
Optional[int]
) – Node rank of a run to get logs for. Defaults to the lowest available rank. This will usually be rank 0 unless something has gone wrong.timestamps (
bool
) – IfTrue
, each log line will also contain the timestamp at which it was emitted. If you wish to parse out a line’s timestamp and text, you can useLogLine.from_line()
.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_run_logs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the log text, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
KubernetesException (HTTPStatus.NOT_FOUND) – Raised if the requested run does not exist
KubernetesException (HTTPStatus.BAD_REQUEST) – Raised if the run is not yet running, or if the run does not have a node of the requested rank.
- Returns
- mcli.sdk.get_runs(runs=None, cluster_names=None, before=None, after=None, gpu_types=None, gpu_nums=None, statuses=None, timeout=10, future=False, clusters=None)[source]#
Get a filtered list of runs
List runs that have been launched in the MosaicML platform. The returned list will contain all of the details stored about the requested runs.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – List of run names on which to get informationcluster_names (
Optional[List[str] | List[
Cluster
]]
) – List of cluster names to filter runs. This can be a list of str orCluster
objects. Only runs submitted to these clusters will be returned.gpu_types (
Optional[List[str] | List[
GPUType
]]
) – List of gpu types to filter runs. This can be a list of str orGPUType
enums. Only runs scheduled on these GPUs will be returned.gpu_nums (
Optional[List[int]]
) – List of gpu counts to filter runs. Only runs scheduled on this number of GPUs will be returned.statuses (
Optional[List[str]|List[
RunStatus
]]
) – List of run statuses to filter runs. This can be a list of str orRunStatus
enums. Only runs currently in these phases will be returned.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call toget_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
MAPIException – If connecting to MAPI, raised when a MAPI communication error occurs
KubernetesException – Raised when a Kubernetes error occurs when communicating with only 1 cluster
RuntimeError – Raised when some error occurs in calls to multiple Kubernetes clusters
- Returns
- mcli.sdk.initialize(api_key=None)[source]#
Initialize the MosaicML platform
- Parameters
api_key – Optional value to set
- mcli.sdk.set_api_key(api_key)[source]#
Set the api key for the MosaicML platform
- Parameters
api_key – value to set
- mcli.sdk.stop_runs(runs, timeout=10, future=False)[source]#
Stop a list of runs
Stop a list of runs currently running in the MosaicML platform.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – A list of runs or run names to stop. UsingRun
objects is most efficient. See the note below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tostop_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
KubernetesException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status
`RunStatus.STOPPED`
. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.- Returns
Note
The Kubernetes API requires the cluster for each run. If you provide
runs
as a list of names, we will get this by callingget_runs()
. Since a common way to get the list of runs is to have already calledget_runs()
, you can avoid a second call by passing the output of that call in directly.Warning
Stopping runs does not occur immediately. You may see up to a 40 second delay between your request and the run actually stopping.
- mcli.sdk.stop_runs(runs, timeout=10, future=False)[source]#
Stop a list of runs
Stop a list of runs currently running in the MosaicML platform.
- Parameters
runs (
Optional[List[str] | List[
Run
]]
) – A list of runs or run names to stop. UsingRun
objects is most efficient. See the note below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call tostop_runs()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get the list ofRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
KubernetesException – Raised if stopping any of the requested runs failed. All successfully stopped runs will have the status
`RunStatus.STOPPED`
. You can freely retry any stopped and unstopped runs if this error is raised due to a connection issue.- Returns
Note
The Kubernetes API requires the cluster for each run. If you provide
runs
as a list of names, we will get this by callingget_runs()
. Since a common way to get the list of runs is to have already calledget_runs()
, you can avoid a second call by passing the output of that call in directly.Warning
Stopping runs does not occur immediately. You may see up to a 40 second delay between your request and the run actually stopping.
- mcli.sdk.wait_for_run_status(run, status, timeout=None, future=False)[source]#
Wait for a launched run to reach a specific status
- Parameters
run (
str
|Run
) – The run whose status should be watched. This can be provided using the run’s name or an existingRun
object.status (
str
|RunStatus
) – Status to wait for. This can be any validRunStatus
value. If the run never reaches this state (e.g. it fails or the wait times out), then an error will be raised. See exception details below.timeout (
Optional[float]
) – Time, in seconds, in which the call should complete. If the call takes too long, aTimeoutError
will be raised. Iffuture
isTrue
, this value will be ignored.future (
bool
) – Return the output as aFuture
. If True, the call towait_for_run_status()
will return immediately and the request will be processed in the background. This takes precedence over thetimeout
argument. To get theRun
output, usereturn_value.result()
with an optionaltimeout
argument.
- Raises
KubernetesException – Raised with status code 404 if the requested run could not be found
RunFailed – Raised if the run failed before reaching the requested status
TimeoutError – Raised if the run did not reach the correct status in the specified time
- Returns
- class mcli.sdk.Run(run_uid, name, status, created_at, updated_at, config, started_at=None, completed_at=None, reason=None, submitted_config=None, nodes=<factory>, _required_properties=('id', 'name', 'status', 'createdAt', 'updatedAt', 'runInput', 'originalRunInput'), _type=None)[source]#
A run that has been launched on the MosaicML platform
- Parameters
run_uid (str) – Unique identifier for the run
name (str) – User-defined name of the run
status (
RunStatus
) – Status of the run at a moment in timecreated_at (datetime) – Date and time when the run was created
updated_at (datetime) – Date and time when the run was last updated
config (
RunConfig
) – Therun configuration
that was used to launch to the runstarted_at (Optional[datetime]) – Date and time when the run entered the STARTED
RunStatus
completed_at (Optional[datetime]) – Date and time when the run entered the COMPLETED
RunStatus
- class mcli.sdk.RunConfig(run_name=None, name=None, gpu_type=None, gpu_num=None, cpus=None, platform=None, cluster=None, image=None, partitions=None, optimization_level=None, integrations=<factory>, env_variables=<factory>, scheduling=<factory>, command='', parameters=<factory>, entrypoint='')[source]#
A run configuration for the MosaicML platform
Values in here are not yet validated and some required values may be missing.
- Parameters
name (Optional[str]) – User-defined name of the run
gpu_type (Optional[str]) – GPU type (optional if only one gpu type for your cluster)
gpu_num (Optional[int]) – Number of GPUs
cpus (Optional[int]) – Number of CPUs
cluster (Optional[str]) – Cluster to use (optional if you only have one)
image (Optional[str]) – Docker image (e.g. mosaicml/composer)
integrations (List[Dict[str, Any]]) – List of integrations
env_variables (List[Dict[str, str]]) – List of environment variables
command (str) – Command to use when a run starts
parameters (Dict[str, Any]) – Parameters to mount into the environment
entrypoint (str) – Alternative to command
- class mcli.sdk.RunStatus(value)[source]#
Possible statuses of a run
- COMPLETED = 'COMPLETED'#
The run has finished without any errors
- FAILED = 'FAILED'#
The run has failed due to an issue at runtime
- FAILED_PULL = 'FAILED_PULL'#
The run has failed due to a kubernetes error
- PENDING = 'PENDING'#
The run has been dispatched and is waiting to be received
- QUEUED = 'QUEUED'#
The run is queued and is awaiting execution
- RUNNING = 'RUNNING'#
The run is actively running
- SCHEDULED = 'SCHEDULED'#
The run has been scheduled and is waiting to be queued
- STARTING = 'STARTING'#
The run is starting up and preparing to run
- STOPPED = 'STOPPED'#
The run has stopped
- STOPPING = 'STOPPING'#
The run is in the process of being stopped
- TERMINATING = 'TERMINATING'#
The run is in the process of being terminated
- UNKNOWN = 'UNKNOWN'#
A valid run status cannot be found
- after(other, inclusive=False)[source]#
Returns True if this state usually comes “after” the other
- Parameters
other – Another
RunStatus
inclusive – If True, equality evaluates to True. Default False.
- Returns
If this state is “after” the other
Example
>>> RunStatus.RUNNING.before(RunStatus.COMPLETED) True >>> RunStatus.PENDING.before(RunStatus.RUNNING) True
- before(other, inclusive=False)[source]#
Returns True if this state usually comes “before” the other
- Parameters
other – Another
RunStatus
inclusive – If True, equality evaluates to True. Default False.
- Returns
If this state is “before” the other
Example
>>> RunStatus.RUNNING.before(RunStatus.COMPLETED) True >>> RunStatus.PENDING.before(RunStatus.RUNNING) True