Wednesday, February 15, 2023
HomeCloud ComputingAzure empowers easy-to-use, high-performance, and hyperscale mannequin coaching utilizing DeepSpeed | Azure...

Azure empowers easy-to-use, high-performance, and hyperscale mannequin coaching utilizing DeepSpeed | Azure Weblog and Updates


This weblog was written in collaboration with the DeepSpeed group, the Azure ML group, and the Azure HPC group at Microsoft.

Massive-scale transformer-based deep studying fashions skilled on massive quantities of information have proven nice outcomes lately in a number of cognitive duties and are behind new merchandise and options that increase human capabilities. These fashions have grown a number of orders of magnitude in dimension over the last 5 years. Ranging from just a few million parameters of the unique transformer mannequin all the best way to the newest 530 billion-parameter Megatron-Turing (MT-NLG 530B) mannequin as proven in Determine 1. There’s a rising want for purchasers to coach and fine-tune massive fashions at an unprecedented scale.

Hardware is unable to match 200+ times growth in AI models. DeepSpeed enables to scale AI training on thousands of nodes to achieve 4000+ times speedup.

Determine 1: Panorama of huge fashions and {hardware} capabilities.

Azure Machine Studying (AzureML) brings massive fleets of the newest GPUs powered by the InfiniBand interconnect to deal with large-scale AI coaching. We already prepare among the largest fashions together with Megatron/Turing and GPT-3 on Azure. Beforehand, to coach these fashions, customers wanted to arrange and keep a posh distributed coaching infrastructure that normally required a number of handbook and error-prone steps. This led to a subpar expertise each by way of usability and efficiency.

At present, we’re proud to announce a breakthrough in our software program stack, utilizing DeepSpeed and 1024 A100s to scale the coaching of a 2T parameter mannequin with a streamlined consumer expertise at 1K+ GPU scale. We’re bringing these software program improvements to you thru AzureML (together with a totally optimized PyTorch atmosphere) that provides nice efficiency and an easy-to-use interface for large-scale coaching.

Clients can now use DeepSpeed on Azure with simple-to-use coaching pipelines that make the most of both the really useful AzureML recipes or through bash scripts for VMSS-based environments. As proven in Determine 2, Microsoft is taking a full stack optimization strategy the place all the required items together with the {hardware}, the OS, the VM picture, the Docker picture (containing optimized PyTorch, DeepSpeed, ONNX Runtime, and different Python packages), and the user-facing Azure ML APIs have been optimized, built-in, and well-tested for wonderful efficiency and scalability with out pointless complexity.

Stack diagram of different layers in Azure AI software.

Determine 2: Microsoft full-stack optimizations for scalable distributed coaching on Azure.

This optimized stack enabled us to effectively scale coaching of huge fashions utilizing DeepSpeed on Azure. We’re glad to share our efficiency outcomes supporting 2x bigger mannequin sizes (2 trillion vs. 1 trillion parameters), scaling to 2x extra GPUs (1024 vs. 512), and as much as 1.8x greater compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs) in comparison with these printed on different cloud suppliers.

We provide near-linear scalability each by way of an enhance in mannequin dimension in addition to enhance in variety of GPUs. As proven in Determine 3a, along with the DeepSpeed ZeRO-3, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs, we have been in a position to keep an environment friendly throughput/GPU (>157 TFLOPs) in a near-linear trend because the mannequin dimension elevated from 175 billion parameters to 2 trillion parameters. Then again, for a given mannequin dimension, for instance, 175B, we obtain near-linear scaling as we enhance the variety of GPUs from 128 all the best way to 1024 as proven in Determine 3b. The important thing takeaway from the outcomes offered on this weblog is that Azure and DeepSpeed collectively are breaking the GPU reminiscence wall and enabling our prospects to simply and effectively prepare trillion-parameter fashions at scale.

Throughput/GPU measured at ~157TFLOPS through model sizes 175 billion and 2 trillion parameters exhibiting near-perfect scaling.Training throughput scales linearly with number of GPUs exhibiting near-perfect scaling efficiency on 1K GPUs.

(a)                                                                                          (b)

Determine 3: (a) Close to-perfect throughput/GPU as we enhance the mannequin dimension from 175 billion to 2 trillion parameters (BS/GPU=8), (b) Close to-perfect efficiency scaling with the rise in variety of GPU units for the 175B mannequin (BS/GPU=16). The sequence size is 1024 for each circumstances.

Study extra

To be taught extra concerning the optimizations, applied sciences, and detailed efficiency tendencies offered above, please confer with our prolonged technical weblog.

  • Study extra about DeepSpeed, which is a part of Microsoft’s AI at Scale initiative.
  • Study extra about Azure HPC + AI.
  • To get began with DeepSpeed on Azure, please comply with our getting began tutorial.
  • The outcomes offered on this weblog have been produced on Azure by following the recipes and scripts printed as a part of the Megatron-DeepSpeed repository. The really useful and most easy-to-use technique to run the coaching experiments is to make the most of the AzureML recipe.
  • If you’re working experiments on a customized atmosphere constructed utilizing Azure VMs or VMSS, please confer with the bash scripts we offer in Megatron-DeepSpeed.

RELATED ARTICLES

Most Popular