Managing Data#

The MosaicML platform is a streaming-only platform. That means that all data (code, datasets, checkpoints, etc) is streamed in over the network, and not kept after the run is complete. A few key advantages of this approach:

  • Privacy: Your data is ephermal in our cluster, and destroyed after a training run completes.

  • Portability: You can access your data from anywhere and we can also run the compute on any cluster without worrying about data locality.

  • Versioning: Rather than keep copies of datasets everywhere locally and potentially face data sprawl, instead use object stores like S3 with versioning capabilities or your favorite data versioning provider.

The MosaicML platform does not support persistent storage of data. All data used to train models is streamed into the cluster, and any checkpoints or artifacts are saved remotely.

We recommend the following workflows for our users:

  • Store your datasets in an object store (e.g. AWS S3). Use our fast and accurate Streaming tool, which provides a [StreamingDataset] to initialize remote datasets for training. For more details, see the Streaming documentation.

  • Use one of the popular experiment tracking platforms (e.g. Weights & Biases, CometML, MLFlow) to store your run results.

Remember to create all the needed secrets to access your data. For example, to access object stores in S3, create an s3 secret with:

mcli create secrets s3

To authenticate with SFTP servers, create the corresponding secret sftp-ssh. More more information, see our documentation on Secrets

Our library Composer is designed for this stateless training environment, with dataset streaming and checkpoints uploading as natively supported. See File Uploading and Checkpointing for more information.