Sweep Hyperparameters#

Hyperparameter sweeps are typically multiple runs that are similar, except for a few parameters that are being tested.

With our Python API, submitting custom sweeps is fairly straightforward, and can easily be customized to your specific sweeping strategy.

Suppose we start with a simple run configuration as a YAML file simple_sweep.yaml:

name: sweep-test
cluster: <your cluster>
gpu_num: 0
image: bash
command: >
  cat /mnt/config/parameters.yaml
  foo: 0
  bar: 1

The parameters field is mounted as a a yaml file into the run for your code to access. To sweep through some of these parameters, the following code will load the above config, modify the parameters, and submit the run.

from mcli.sdk import RunConfig, create_run
import copy

# or create directly in python with RunConfig(...)
config = RunConfig.from_file('simple_sweep.yaml')
parameters = copy.deepcopy(config.parameters)

# Sweep over a few different values of 'foo'
runs = []
for foo in [0, 1, 2, 3, 4, 5]:
    # set the name of the run
    config.name = f'example-foo-{foo}'

    # Update the parameters
    # deepcopy for safety
    run_params = copy.deepcopy(parameters)
    run_params['foo'] = foo
    config.parameters = run_params

    # And run!
    run = create_run(config)
    print(f'Launching run {run.name} with foo {foo}')

For more information about the python API, see Python API.


For more complicated sweep setups, we always recommend deep-copying the base parameters first, and then copying and modifying those base parameters for each element in the sweep. Otherwise, parameter changes may leak into other runs.

Fun with Futures#

Every method in the Python API comes with the optional argument future=False. Setting this to True causes an API call to return a concurrent.futures.Future for the actual return.

Futures allows you to, for example, return the submitted run as they complete.

from concurrent.futures import as_completed

wait_futures = [
    wait_for_run_status(run, status='completed', future=True)
    for run in runs

for run_future in as_completed(wait_futures):
    run = run_future.result()
    print(f'Run {run.name} has completed with status {run.status}')