NeurIPS 2018 Systems for ML 論文読む会

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
 5
 
  1. NeurIPS 2018 Systems for ML 論⽂読む会 Feb. 8, 2019 福⽥圭祐 PREFERRED NETWORKS, INC. 2. 紹介論⽂: Image Classification at Supercomputer Scale Chris…
Share
Transcript
  • 1. NeurIPS 2018 Systems for ML 論⽂読む会 Feb. 8, 2019 福⽥圭祐 PREFERRED NETWORKS, INC.
  • 2. 紹介論⽂: Image Classification at Supercomputer Scale Chris Ying, Sameer Kumar, Dehao Chen,Tao Wang, Youlong Cheng Google, Inc.
  • 3. 0 10 20 30 40 50 60 70 Time[min] Training time of ResNet-50 (90 epochs) on ImageNet 3 Background: The ImageNet Racing Gangs: 2018 July 2018 Nov 2017 Nov 2018 Nov (各種資料より福⽥が作成)
  • 4. Abstract • They trained ResNet-50 model on the ImageNet dataset in 2.2 minutes • They used 1024-Chip TPU v3 Pod • Key technical challenges and contributions: – Mixed Precision Training – Learning rate scheduling (Gradual warmup & learning late decay) – LARS – Distributed Batch Normalization – Input pipeline optimization – 2-D Torus Allreduce 4
  • 5. Cloud TPU v3 5
  • 6. TPU v3 Pod 6 From [1] The Next Platform • 1024 Chip (256 Cloud TPU v3 ?) • 107.5 Pflops peak (in bfloat16) • Interconnect: ???
  • 7. Technical challenges • Model Accuracy: –The “large batch” problem • Training Speed: –Feeding the accelerators –Gradient sharing (= communication) 7
  • 8. Mixed Precision Training • bloat16 format: – 7 bits for mantissa, 8 bits for exponential – (Ref: IEEE754-2008 = 10 bits for mantissa, 5 bits for exp) • Convolutions are all in bfloat16 • Other parts are in float32 – BN, loss, gradient summation 8
  • 9. Learning rate scheduling • Learning rate schedules – Gradual warmup and learning rate decay • LARS (layer wise adaptive rate scheduling) 9
  • 10. Distributed Batch Normalization • In most of the past larege batch ImageNet study, BN are ”local” (“per- microbatch” or “per-replica”) • Distributed Batch Normalization: compute BN over multiple replicas 10
  • 11. Distributed Batch Normalization (cont.) 11 In the final experiment, BN batch size = 64 (group size = 4)
  • 12. Input Pipeline Optimization • Dataset sharding and caching (“Scatter” dataset over TPUs) • Pipelining input and compute by prefetching • Fused JPEG decode and cropping • Multicore processing 12
  • 13. Allreduce acceleration 13 • Two phase, concurrent ring all-reduce on 2-D torus network • Allreduce the halves of gradients separately • Utilize the capacity of 2-D, duplex network hardware.
  • 14. Allreduce acceleration (cont) 14 Better
  • 15. Results 15
  • 16. Summary • They trained ResNet-50 model on the ImageNet dataset in 2.2 minutes • They used 1024-Chip TPU v3 Pod • Key technical challenges and contributions: – Mixed Precision Training – Learning rate scheduling (Gradual warmup & learning late decay) – LARS – Distributed Batch Normalization – Input pipeline optimization – 2-D Torus Allreduce 16
  • 17. References [1] The Next Platform ” TEARING APART GOOGLE’S TPU 3.0 AI COPROCESSOR” https://www.nextplatform.com/2018/05/10/tearing-apart-googles-tpu-3-0-ai-coprocessor/ [2] CLOUD TPU https://cloud.google.com/tpu/?hl=ja 17
  • 18. 18
  • Related Search
    Similar documents
    View more
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks
    SAVE OUR EARTH

    We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

    More details...

    Sign Now!

    We are very appreciated for your Prompt Action!

    x