Zhenkun Cai

蔡振坤

Amazon Web Service

Email: zekucai@gmail.com

Han

I am an Applied Scientist at AWS, working on large-scale machine learning infrastructure. I joined AWS in 2022, starting in the Shanghai office, where I worked on Graph Neural Network (GNN) systems. Later, I moved to Santa Clara, AWS US, where I now focus on LLM infrastructure. Before joining AWS, I earned my PhD from The Chinese University of Hong Kong (CUHK), advised by Prof. James Cheng.

Awards

  • CUHK Postgraduate Studentship, 2018 - 2022
  • Undergraduate National Scholarship, 2014, 2015, 2016
  • ACM-ICPC World Finalist, 2017
  • ACM-ICPC Regional Contest Gold Medals, 2016
  • Guangdong Collegiate Programming Contest (GDCPC) Champion, 2016

Publications

  1. gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning. SOSP 2023

    Ping Gong, Renjie Liu, Zunyao Mao, Zhenkun Cai, Xiao Yan, Cheng Li, Minjie Wang, Zhuozhao Li

  2. FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication. SIGMOD 2023

    Kaihao Ma, Xiao Yan, Zhenkun Cai, Yuzhen Huang, Yidi Wu, James Cheng

  3. DGI: Easy and Efficient Inference for GNNs. KDD 2023

    Peiqi Yin, Xiao Yan, Jinjing Zhou, Qiang Fu, Zhenkun Cai, James Cheng, Bo Tang, Minjie Wang

  4. DSP: Efficient GNN training with multiple GPUs. PPoPP 2023

    Zhenkun Cai*, Qihui Zhou*, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, George Karypis

  5. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism. TPDS 2022

    Zhenkun Cai, Kaihao Ma, Xiao Yan, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, Fan Yu

  6. DGCL: An Efficient Communication Library for Distributed GNN Training. Eurosys 2021

    Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, Fan Yu

  7. Seastar: Vertex-Centric Programming for Graph Neural Networks. Eurosys 2021

    Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chenguang Zheng, James Cheng, Fan Yu

  8. Elastic Deep Learning in Multi-Tenant GPU Clusters. TPDS 2021

    Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, Fan Yu

  9. Improving Resource Utilization by Timely Fine-Grained Scheduling. Eurosys 2020

    Tatiana Jin, Zhenkun Cai, Boyang Li, Chenguang Zheng, Guanxian Jiang, James Cheng

  10. FlexPS: Flexible Parallelism Control in Parameter Server Architecture. VLDB 2018

    Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Yuying Guo, Fan Yang, Jinfeng Li, James Cheng

  11. Scalable De Novo Genome Assembly Using Pregel. ICDE 2018

    Da Yan, Hongzhi Chen, Zhenkun Cai, James Cheng, Bin Shao

Projects

Systems for large-scale machine learning

TensorOpt: Training Large-scale DNNs with Auto-parallell

  • Supports distributed training of large models (e.g., Transformer, WideResNet) using limited GPU memory
  • Optimally decides the parallelization strategy and automatically generates code for operators in a DNN
  • Developed on top of TensorFlow with user-friendly Python APIs

EDL: An Elastic Deep Learning System on GPUs

  • Supports elastic deep learning, i.e., dynamically adjust the number of GPU at runtime
  • Stop-free scaling and dynamic data pipeline on Horovod

FlexPS: A Parameter Server with Flexible Parallelism Control

  • A novel multi-stage abstraction to support flexible parallelism control in parameter server
  • Optimizations to reduce the overhead of parallelism adjustment

Systems for graph neural networks

DGCL: A Distributed Graph Communication Library for GNN systems

  • A general library to scale single-GPU GNN systems (e.g., DGL and PyG) to the multi-GPU setting
  • Efficient communcation kernels optimized for load balancing and bandwidth utilization on NVLinks, PCIe and InfiniBand

Seastar: A Vertex-centric GNN System

  • A vertex-centric programming model for GNNs
  • Kernel optimizations such as operator fusion and vertex parallelism to reduce GPU memory consumption and improve training efficiency

Large-scale cluster scheduling

PPS: Fair And Efficient Scheduling for Multi-Tenant GPU Clusters

  • Probabilistic prediction based scheduling for clusters with thousands of GPUs
  • Black-box and non-preemptive scheduling for GPU jobs
  • Achieves both efficiency and fairness at the same time

Ursa: A Framework for both Resource Scheduling and Execution for OLAP Jobs

  • Captures dynamic resource needs at runtime and enables fine-grained, timely scheduling
  • Achieves high resource utilization, which translates into significantly improved makespan and average JCT