The goal of Project Fiddle is to build systems infrastructure to systematically speed-up distributed deep neural network (DNN) training while eking out the most from the resources used. Specifically, we are aiming for 100x more efficient training. To achieve this goal, we take a broad view of training: from a single GPU, to multiple GPUs on a machine, all the way to multiple machines in a cluster. Our innovations cut across the systems stack from the memory subsystem, to structuring parallel computation, and interconnects between GPUs and machines. Our work has generated interest and led to collaborations with product groups such as Cognitive Toolkit and Cloud Server Infrastructure.


0 comment