Lifeng Nai

佴立峰 · nailifeng [at]

I am currently a computer architect at Google, leading the ML system (TPUs) codesign and optimization effort for LLMs (e.g. Gemini, Bard, Magi) and Ads recommendation etc. My work spans across the hardware-software stack, from future TPU architecture pathfinding, system and compiler optimizations, to AutoML neural architecture search.

Before Google, I received my Ph.D. degree from Georgia Institute of Technology, where I worked in the HPArch lab under the advisement of Prof. Hyesoon Kim. I also worked in the MARS lab, advised by Dr. Hsien-Hsin S. Lee and co-advised by Dr. Bo Hong.


  • TripLe: Revisiting Pretrained Model Reuse and Progressive Learning for Efficient Vision Transformer Scaling and Searching
    Cheng Fu, Hanxian Huang, Zixuan Jiang, Yun Ni, Lifeng Nai, Gang Wu, Liqun Cheng, Yanqi Zhou, Sheng Li, Andrew Li, Jishen Zhao
    International Conference on Computer Vision (ICCV), 2023

  • TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
    Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, Clifford Young, Xiang Zhou, Zongwei Zhou, David A Patterson
    International Symposium on Computer Architecture (ISCA), 2023

  • V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and Fairness
    Yuqi Xue, Yiqi Liu, Lifeng Nai, Jian Huang
    International Symposium on Computer Architecture (ISCA), 2023

  • NeuroMeter: An Integrated Power, Area, and Timing Modeling Framework for Machine Learning Accelerators
    Tianqi Tang, Sheng Li, Lifeng Nai, Norm Jouppi, Yuan Xie
    International Symposium on High-Performance Computer Architecture (HPCA), 2021

  • Thermal-Aware Processing-in-memory Instruction Offloading
    Lifeng Nai, Ramyad Hadidi, He Xiao, Hyojong Kim, Jaewoong Sim, Hyesoon Kim
    Journal of Parallel and Distributed Computing (JPDC), 2019

  • CODA: Enabling Co-location of Computation and Data for Near-Data Processing
    Hyojong Kim, Ramyad Hadidi, Lifeng Nai, Hyesoon Kim, Nuwan Jayasena, Yasuko Eckert, Onur Kayiran, Gabriel H. Loh
    ACM Transactions on Architecture and Code Optimization (TACO), 2018

  • CoolPIM: Thermal-Aware Source Throttling for Efficient PIM Instruction Offloading
    Lifeng Nai, Ramyad Hadidi, He Xiao, Hyojong Kim, Jaewoong Sim, Hyesoon Kim
    International Parallel and Distributed Processing Symposium (IPDPS), 2018

  • CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory
    Ramyad Hadidi, Lifeng Nai, Hyojong Kim, Hyesoon Kim
    ACM Transactions on Architecture and Code Optimization (TACO), 2017

  • SimProf: A Sampling Framework for Data Analytic Workloads
    Jen-Cheng Huang, Lifeng Nai, Pranith Kumar, Hyojong Kim, Hyesoon Kim
    International Parallel and Distributed Processing Symposium (IPDPS), 2017

  • GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks
    Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, Hyesoon Kim
    International Symposium on High Performance Computer Architecture (HPCA), 2017
    [PDF] [Slides] [Lightning]

  • Exploring Big Graph Computing --- an Empirical Study from Architectural Perspective
    Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim
    Journal of Parallel and Distributed Computing (JPDC), 2016

  • Analyzing Consistency Issues In HMC Atomics
    Pranith Kumar, Lifeng Nai, Hyesoon Kim
    International Symposium on Memory Systems (MEMSYS), 2016

  • LDBC Graphalytics: A Benchmark for Large-Scale Graph Analysis on Parallel and Distributed Platforms
    Alexandru Iosup, Tim Hageman, Wing Lung Ngaio, Stijin Heldens, Arnau Prat Perez, Thomas Manhardt, Mihai Capota, Narayanan Sundaram, Michael Anderson, Ilie G. Tanase, Yinglong Xia, Lifeng Nai, Peter Boncz
    International Conference on Very Large Data Bases (VLDB), 2016

  • GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions
    Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, Ching-Yung Lin
    International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2015
    [PDF] [SC15 Presentation] [Code Repository] [GraphBIG Doc] [GraphBIG Wiki]

  • Instruction Offloading with HMC 2.0 Standard - A Case Study for Graph Traversals
    Lifeng Nai, Hyesoon Kim
    International Symposium on Memory Systems (MEMSYS), 2015
    [PDF] [Slides]

  • Towards Balance-Affinity Tradeoff in Concurrent Subgraph Traversals
    Yinglong Xia, Lifeng Nai, Jui-Hsin Lai
    International Parallel and Distributed Processing Symposium (IPDPS), 2015

  • Explore Efficient Data Organization for Large Scale Graph Analytics and Storage
    Yinglong Xia, Ilie G. Tanase, Lifeng Nai, Wei Tan, Yanbin Liu, Jason Crawford, Ching-Yung Lin
    International Conference on Big Data (BigData), 2014

  • Concurrent Image Query Using Local Random Walk with Restart on Large Scale Graphs
    Yinglong Xia, Jui-Hsin Lai, Lifeng Nai, Ching-Yung Lin
    Workshop on Multimedia Big Data Computing (MBDC) in conjunction to ICME, 2014

  • A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics
    Ilie G. Tanase, Yinglong Xia, Lifeng Nai, Wei Tan, Yanbin Liu, Jason Crawford, Ching-Yung Lin
    Workshop on Graph Data management Experiences and Systems (GRADES) in conjunction to SIGMOD, 2014

  • Cache-Conscious Graph Collaborative Filtering on Multisocket Multicore Systems
    Lifeng Nai, Yinglong Xia, Ching-Yung Lin, Bo Hong, Hsien-Hsin Lee
    ACM International Conference on Computing Frontiers (CF), 2014

  • TBPoint: Reducing Simulation Time for Large Scale GPGPU Kernels
    Jen-Cheng Huang, Lifeng Nai, Hyesoon Kim, Hsien-Hsin Lee
    International Parallel and Distributed Processing Symposium (IPDPS), 2014

  • Reducing False Transactional Conflicts with Speculative Sub-blocking State - An Empirical Study for ASF Transactional Memory System
    Lifeng Nai, Hsien-Hsin Lee
    International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2013

  • Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


  • Accelerated Embedding Layer Computations
    US Patent App. 11,651,209

  • Multiplier and Adder in Systolic Array
    US Patent App. 17/377,743

  • Neural Network Accelerator in DIMM Form Factor
    US Patent App. 16/994,990

  • Graph-based Online Image Query
    US Patent App. 15/215,864

  • Efficient Property Graph Storage for Streaming/Multi-versioning Graphs
    US Patent App. 15/264,570

  • Trace/Trajectory Reconstruction via Wearable and/or Mobile Sensors for Indoor/Outdoor Location
    US Patent App. 15/263,314

  • Wearable Sensor based System for Person Identification
    US Patent 9,769,166

  • Remote Control System with Muscle Sensor and Alerting Sensor
    US Patent App. 15/286,528

  • A Differential Processing Mechanism for Spark-based Graph Computing
    (Filed @IBM), June 2015

  • A Controlling Method of Host Storage Device for Embedded Systems
    CN 200910116305.8

  • A New Video Decoding Method
    CN 200910116303.9

  • A New Embedded Storage Device Management Method for Multiple Hosts
    CN 200910116304.3

  • Academia Activities

  • TPC member, IEEE International Conference on Big Data (BigData 2019/2020/2021/2022/2023)
  • ERC member, ACM International Conference on Supercomputing (ICS 2019)
  • TPC member, IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019)
  • TPC member, International Workshop on Big Graph Processing (BGP 2017, in conjunction with ICDCS 2017)
  • TPC member, IEEE International Conference on Parallel and Distributed Systems (ICPADS 2014)