First Model#

Let’s train your first 1 billion parameter GPT model!

Download the following run YAML file as mosaic_gpt_1b.yaml:

name: mosaic-gpt-1b-gpus-8
image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
gpu_num: 8
gpu_type: a100_40gb

  - integration_type: git_repo
    git_repo: mosaicml/examples
    git_branch: v0.0.3
    pip_install: -e .[llm]
    ssh_clone: false

command: |
  cd examples/examples/llm
  python ../common/ --out_root ./my-copy-c4 --splits train_small val \
    --concat_tokens 2048 --tokenizer gpt2 --eos_text '<|endoftext|>'
  composer yamls/mosaic_gpt/1b.yaml \
    train_loader.dataset.split=train_small \
    max_duration=100ba \

This run clones MosaicML’s LLM code from our public examples repository and trains a GPT 1 billion parameter language model on the C4 dataset with 8x A100 40GB GPUs.


The configuration above first runs a conversion script to convert the C4 dataset into a format usable by our streaming dataloader. For more details, see the Streaming documentation.

After submitting this run, training starts after a brief setup period:

mcli run -f mosaic_gpt_1b.yaml --follow
from mcli.sdk import RunConfig, create_run

config = RunConfig.from_file('mosaic_gpt_1b.yaml')
# config.cluster = <your_cluster_name> # Only needed if you have more than one cluster

Starting training...
[batch=0/24800]: epoch: 0
[batch=0/24800]: trainer/global_step: 0
[batch=0/24800]: trainer/batch_idx: 0
[trace]: algorithm_traces/GradientClipping/Event.AFTER_TRAIN_BATCH:1
[batch=0/24800]: memory/alloc_requests: 15805
[batch=0/24800]: memory/free_requests: 15740
[batch=0/24800]: memory/allocated_mem: 1632767455232
[batch=0/24800]: memory/active_mem: 6029999616
[batch=0/24800]: memory/inactive_mem: 269844992
[batch=0/24800]: memory/reserved_mem: 38658899968
[batch=0/24800]: memory/alloc_retries: 0
[batch=0/24800]: trainer/grad_accum: 4
[batch=0/24800]: loss/train/total: 11.2582
[batch=0/24800]: metrics/train/LanguageCrossEntropy: 11.2582
[batch=0/24800]: metrics/train/Perplexity: 77512.6016
[batch=1/24800]: wall_clock/train: 12.1141
[batch=1/24800]: wall_clock/val: 0.0000
[batch=1/24800]: wall_clock/total: 12.1141
[batch=1/24800]: lr-DecoupledAdamW/group0: 0.0000
[batch=1/24800]: trainer/global_step: 1
[batch=1/24800]: trainer/batch_idx: 1

Unique Names

MosaicML platform will append a unique six-character identifier to your provided run name in order to ensure uniqueness.

View your run status, and the unique run name, at any time with:

> mcli get runs
NAME                       CLUSTER  GPU_TYPE   GPU_NUM  ...  STATUS
mosaic-gpt-1b-gpus-8-3isk9a  r8z2     a100_40gb  8        ...  Running

Let’s stop all your runs:

mcli stop run <run-name>

Scaling up the number of GPUs is easy. If you have access to multiple nodes, simply do:

mcli run -f mosaic_gpt_1b.yaml --gpus 16 --follow
i  Run mosaic-gpt-1b-gpus-16-4czz submitted. Waiting for it to start...
i  You can press Ctrl+C to quit and follow your run manually.
⠏ Rank 0: Waiting for resources to become available... 0:00:03
⠏ Rank 1: Waiting for resources to become available... 0:00:03

Clean up all your runs with:

mcli delete runs --all


Our examples repository is designed to be easily modifiable for your own use cases. For example, you could fork the repository and edit the 1b.yaml configuration file.

Or, we recommend using the parameters field to make it easy to tweak these settings with each run. Anything under parameters is mounted at /mnt/config/parameters.yaml for your code to access.

The resulting run yaml looks the same as above, except (1) the python program is pointed at /mnt/config/parameters.yaml instead of its own config, and (2) we append the contents of 1b.yaml under parameters field.

Example yaml
name: mosaic-gpt-1b-gpus-8
image: mosaicml/pytorch:1.12.1_cu116-python3.9-ubuntu20.04
gpu_num: 8
gpu_type: a100_40gb

  - integration_type: git_repo
    git_repo: mosaicml/examples
    git_branch: v0.0.2
    pip_install: -r llm/requirements.txt

command: |
  cd examples/llm
  python --out_root ./my-copy-c4 --splits val
  composer /mnt/config/parameters.yaml \
    train_loader.dataset.split=val \
    progress_bar=false \

  data_remote: &data_remote ./my-copy-c4
  data_local: &data_local ./my-copy-c4
  max_seq_len: &max_seq_len 2048
  tokenizer_name: &tokenizer_name gpt2

  # Run Name
  run_name: gpt-1b

  # Model
    name: mosaic_gpt
    device: meta
    tokenizer_name: *tokenizer_name
    d_model: 2048
    n_heads: 16
    n_layers: 24
    mlp_ratio: 4
    max_seq_len: *max_seq_len
    vocab_size: 50257
    init_std: 0.02
    attn_pdrop: 0.0
    resid_pdrop: 0.0
    emb_pdrop: 0.0
    attn_impl: flash

  # Tokenizer
    type: hftokenizer
        tokenizer_name: *tokenizer_name
        max_seq_len: *max_seq_len

  # Dataloaders
    name: c4
        remote: *data_remote
        local: *data_local
        split: train
        shuffle: true
        prefetch: 1_000_000
        tokenizer_name: *tokenizer_name
        max_seq_len: *max_seq_len
        group_method: concat
    drop_last: true
    num_workers: 8
    pin_memory: true
    prefetch_factor: 2
    persistent_workers: true
    timeout: 0

    name: c4
        remote: *data_remote
        local: *data_local
        split: val
        shuffle: false
        prefetch: 1000
        tokenizer_name: *tokenizer_name
        max_seq_len: *max_seq_len
        group_method: truncate
    drop_last: false
    num_workers: 8
    pin_memory: true
    prefetch_factor: 2
    persistent_workers: true
    timeout: 0

  # Optimization
    name: cosine_with_warmup
    t_warmup: 100ba
    alpha_f: 0.1

    name: decoupled_adamw
    lr: 2.0e-4
    - 0.9
    - 0.95
    eps: 1.0e-08
    weight_decay: 0.0

  max_duration: 24800ba
  eval_interval: 2000ba
  global_train_batch_size: 512
  grad_clip_norm: 1.0

  # System
  seed: 17
  device_eval_batch_size: 16
  device_train_microbatch_size: 16
  # device_train_microbatch_size: auto
  precision: bf16

  # FSDP
    sharding_strategy: FULL_SHARD
    min_params: 2e8
    mixed_precision: DEFAULT
    activation_checkpointing: true
    activation_cpu_offload: false
    verbose: true

  # Logging
  progress_bar: true
  log_to_console: true

        window_size: 10
    lr_monitor: {}
    memory_monitor: {}