First Model#
Let’s train your first 1 billion parameter GPT model!
Download the following run YAML file as mosaic_gpt_1b.yaml
:
name: mosaic-gpt-1b-gpus-8
image: mosaicml/pytorch:1.13.1_cu117-python3.10-ubuntu20.04
gpu_num: 8
gpu_type: a100_40gb
integrations:
- integration_type: git_repo
git_repo: mosaicml/examples
git_branch: v0.0.3
pip_install: -e .[llm]
ssh_clone: false
command: |
cd examples/examples/llm
python ../common/convert_c4.py --out_root ./my-copy-c4 --splits train_small val \
--concat_tokens 2048 --tokenizer gpt2 --eos_text '<|endoftext|>'
composer main.py yamls/mosaic_gpt/1b.yaml \
train_loader.dataset.split=train_small \
max_duration=100ba \
eval_interval=0
This run clones MosaicML’s LLM code from our public examples repository and trains a GPT 1 billion parameter language model on the C4 dataset with 8x A100 40GB GPUs.
C4
The configuration above first runs a conversion script to convert the C4 dataset into a format usable by our streaming dataloader. For more details, see the Streaming documentation.
After submitting this run, training starts after a brief setup period:
mcli run -f mosaic_gpt_1b.yaml --follow
from mcli.sdk import RunConfig, create_run
config = RunConfig.from_file('mosaic_gpt_1b.yaml')
# config.cluster = <your_cluster_name> # Only needed if you have more than one cluster
create_run(config)
Starting training...
[batch=0/24800]: epoch: 0
[batch=0/24800]: trainer/global_step: 0
[batch=0/24800]: trainer/batch_idx: 0
[trace]: algorithm_traces/GradientClipping/Event.AFTER_TRAIN_BATCH:1
[batch=0/24800]: memory/alloc_requests: 15805
[batch=0/24800]: memory/free_requests: 15740
[batch=0/24800]: memory/allocated_mem: 1632767455232
[batch=0/24800]: memory/active_mem: 6029999616
[batch=0/24800]: memory/inactive_mem: 269844992
[batch=0/24800]: memory/reserved_mem: 38658899968
[batch=0/24800]: memory/alloc_retries: 0
[batch=0/24800]: trainer/grad_accum: 4
[batch=0/24800]: loss/train/total: 11.2582
[batch=0/24800]: metrics/train/LanguageCrossEntropy: 11.2582
[batch=0/24800]: metrics/train/Perplexity: 77512.6016
[batch=1/24800]: wall_clock/train: 12.1141
[batch=1/24800]: wall_clock/val: 0.0000
[batch=1/24800]: wall_clock/total: 12.1141
[batch=1/24800]: lr-DecoupledAdamW/group0: 0.0000
[batch=1/24800]: trainer/global_step: 1
[batch=1/24800]: trainer/batch_idx: 1
Unique Names
MosaicML platform will append a unique six-character identifier to your provided run name in order to ensure uniqueness.
View your run status, and the unique run name, at any time with:
> mcli get runs
NAME CLUSTER GPU_TYPE GPU_NUM ... STATUS
mosaic-gpt-1b-gpus-8-3isk9a r8z2 a100_40gb 8 ... Running
Let’s stop all your runs:
mcli stop run <run-name>
Scaling up the number of GPUs is easy. If you have access to multiple nodes, simply do:
mcli run -f mosaic_gpt_1b.yaml --gpus 16 --follow
i Run mosaic-gpt-1b-gpus-16-4czz submitted. Waiting for it to start...
i You can press Ctrl+C to quit and follow your run manually.
⠏ Rank 0: Waiting for resources to become available... 0:00:03
⠏ Rank 1: Waiting for resources to become available... 0:00:03
Clean up all your runs with:
mcli delete runs --all
Customization#
Our examples repository is designed to be easily modifiable for your own use cases. For example, you could fork the repository and edit the 1b.yaml configuration file.
Or, we recommend using the parameters
field to make it easy to tweak these settings with each run. Anything under parameters
is mounted at /mnt/config/parameters.yaml
for your code to access.
The resulting run yaml looks the same as above, except (1) the python program is pointed at /mnt/config/parameters.yaml
instead of its own config, and (2) we append the contents of 1b.yaml
under parameters
field.
Example yaml
name: mosaic-gpt-1b-gpus-8
image: mosaicml/pytorch:1.12.1_cu116-python3.9-ubuntu20.04
gpu_num: 8
gpu_type: a100_40gb
integrations:
- integration_type: git_repo
git_repo: mosaicml/examples
git_branch: v0.0.2
pip_install: -r llm/requirements.txt
command: |
cd examples/llm
python convert_c4.py --out_root ./my-copy-c4 --splits val
composer main.py /mnt/config/parameters.yaml \
train_loader.dataset.split=val \
progress_bar=false \
run_name=$COMPOSER_RUN_NAME
parameters:
data_remote: &data_remote ./my-copy-c4
data_local: &data_local ./my-copy-c4
max_seq_len: &max_seq_len 2048
tokenizer_name: &tokenizer_name gpt2
# Run Name
run_name: gpt-1b
# Model
model:
name: mosaic_gpt
device: meta
tokenizer_name: *tokenizer_name
d_model: 2048
n_heads: 16
n_layers: 24
mlp_ratio: 4
max_seq_len: *max_seq_len
vocab_size: 50257
init_std: 0.02
attn_pdrop: 0.0
resid_pdrop: 0.0
emb_pdrop: 0.0
attn_impl: flash
# Tokenizer
tokenizer:
type: hftokenizer
args:
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
# Dataloaders
train_loader:
name: c4
dataset:
remote: *data_remote
local: *data_local
split: train
shuffle: true
prefetch: 1_000_000
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: concat
drop_last: true
num_workers: 8
pin_memory: true
prefetch_factor: 2
persistent_workers: true
timeout: 0
eval_loader:
name: c4
dataset:
remote: *data_remote
local: *data_local
split: val
shuffle: false
prefetch: 1000
tokenizer_name: *tokenizer_name
max_seq_len: *max_seq_len
group_method: truncate
drop_last: false
num_workers: 8
pin_memory: true
prefetch_factor: 2
persistent_workers: true
timeout: 0
# Optimization
scheduler:
name: cosine_with_warmup
t_warmup: 100ba
alpha_f: 0.1
optimizer:
name: decoupled_adamw
lr: 2.0e-4
betas:
- 0.9
- 0.95
eps: 1.0e-08
weight_decay: 0.0
max_duration: 24800ba
eval_interval: 2000ba
global_train_batch_size: 512
grad_clip_norm: 1.0
# System
seed: 17
device_eval_batch_size: 16
device_train_microbatch_size: 16
# device_train_microbatch_size: auto
precision: bf16
# FSDP
fsdp_config:
sharding_strategy: FULL_SHARD
min_params: 2e8
mixed_precision: DEFAULT
activation_checkpointing: true
activation_cpu_offload: false
verbose: true
# Logging
progress_bar: true
log_to_console: true
callbacks:
speed_monitor:
window_size: 10
lr_monitor: {}
memory_monitor: {}