PRESENTER: Collin Wilson
With the popularity of Large Language Models and the general trend of scaling up model and dataset sizes comes challenges in training. Despite hardware improvements, many models are too large to fit onto a single GPU or large enough that small batch sizes lead to long training times.
One strategy for parallelizing training is Fully Sharded Data Parallel (FSDP), provided by PyTorch. This strategy splits models into shards and distributes shards across parallel GPUs. This strategy can be used to train very large models and to scale up training. In this talk, we'll discuss implementing FSDP in your training code, examine training performance from an efficiency perspective and compare with another parallelization strategy, data parallelism. Some experience with Python, PyTorch and deep learning is expected.