Demystifying the TensorFlow Eager Execution of Deep Learning Inference on a CPU-GPU Tandem
Abstract
Machine Learning (ML) frameworks are tools that facilitate the development and deployment of ML models. These tools are major catalysts of the recent explosion in ML models and hardware accelerators thanks to their high programming abstraction. However, such an abstraction also obfuscates the run-time execution of the model and complicates the understanding and identification of performance bottlenecks. In this paper, we demystify how a modern ML framework manages code execution from a high-level programming language. We focus our work on the TensorFlow eager execution, which remains obscure to many users despite being the simplest mode of execution in TensorFlow. We describe in detail the process followed by the runtime to run code on a CPU-GPU tandem. We propose new metrics to analyze the framework's runtime performance overhead. We use our metrics to conduct in-depth analysis of the inference process of two Convolutional Neural Networks (CNNs) (LeNet-5 and ResNet-50) and a transformer (BERT) for different batch sizes. Our results show that GPU kernels execution need to be long enough to exploit thread parallelism, and effectively hide the runtime overhead of the ML framework.
Origin | Files produced by the author(s) |
---|