Zhenkun Cai

蔡振坤

AWS Shanghai AI Lab

Email: zkcai [at] gmail.com

Han

I received my Ph.D. degree from CUHK in 2022, advised by Prof. James Cheng. Before that, I obtained my B.Eng. degree from SCUT in 2017. My research interests cover the area of large-scale machine learning frameworks and GNN systems.

Currently, I am in the job market.

I joined AWS Shanghai AI lab as an applied scientist in 2022, working closely with Dr. Minjie Wang and Prof. Zheng Zhang.

Internships

  • Oct 2021 - Aug 2022: Applied Scientist Intern at Amazon Web Service, AI Lab, DGL Team
  • Nov 2020 - Sept 2021: Research Intern at Alibaba Group, Apsara Platform, Fuxi Team
  • July 2019 - Sept 2020: Research Intern at HUAWEI, 2012 Lab, Mindspore Team
  • July 2018 - Aug 2018: Research Intern at Alibaba Group, Apsara Platform, MaxCompute Team
  • July 2017 - July 2018: Research Assistant at CUHK, supervised by Prof. James Cheng

Awards

  • CUHK Postgraduate Studentship, 2018 - 2022
  • Undergraduate National Scholarship, 2014, 2015, 2016
  • ACM-ICPC World Finalist, 2017
  • ACM-ICPC Regional Contest Gold Medals, 2016
  • Guangdong Collegiate Programming Contest (GDCPC) Champion, 2016

Publications

  1. gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning. SOSP 2023

    Ping Gong, Renjie Liu, Zunyao Mao, Zhenkun Cai, Xiao Yan, Cheng Li, Minjie Wang, Zhuozhao Li

  2. FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication. SIGMOD 2023

    Kaihao Ma, Xiao Yan, Zhenkun Cai, Yuzhen Huang, Yidi Wu, James Cheng

  3. DGI: Easy and Efficient Inference for GNNs. KDD 2023

    Peiqi Yin, Xiao Yan, Jinjing Zhou, Qiang Fu, Zhenkun Cai, James Cheng, Bo Tang, Minjie Wang

  4. DSP: Efficient GNN training with multiple GPUs. PPoPP 2023

    Zhenkun Cai*, Qihui Zhou*, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, George Karypis

  5. TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism. TPDS 2022

    Zhenkun Cai, Kaihao Ma, Xiao Yan, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, Fan Yu

  6. DGCL: An Efficient Communication Library for Distributed GNN Training. Eurosys 2021

    Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, Fan Yu

  7. Seastar: Vertex-Centric Programming for Graph Neural Networks. Eurosys 2021

    Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chenguang Zheng, James Cheng, Fan Yu

  8. Elastic Deep Learning in Multi-Tenant GPU Clusters. TPDS 2021

    Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, Fan Yu

  9. Improving Resource Utilization by Timely Fine-Grained Scheduling. Eurosys 2020

    Tatiana Jin, Zhenkun Cai, Boyang Li, Chenguang Zheng, Guanxian Jiang, James Cheng

  10. FlexPS: Flexible Parallelism Control in Parameter Server Architecture. VLDB 2018

    Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Yuying Guo, Fan Yang, Jinfeng Li, James Cheng

  11. Scalable De Novo Genome Assembly Using Pregel. ICDE 2018

    Da Yan, Hongzhi Chen, Zhenkun Cai, James Cheng, Bin Shao

Projects

Systems for large-scale machine learning

TensorOpt: Training Large-scale DNNs with Auto-parallell

  • Supports distributed training of large models (e.g., Transformer, WideResNet) using limited GPU memory
  • Optimally decides the parallelization strategy and automatically generates code for operators in a DNN
  • Developed on top of TensorFlow with user-friendly Python APIs

EDL: An Elastic Deep Learning System on GPUs

  • Supports elastic deep learning, i.e., dynamically adjust the number of GPU at runtime
  • Stop-free scaling and dynamic data pipeline on Horovod

FlexPS: A Parameter Server with Flexible Parallelism Control

  • A novel multi-stage abstraction to support flexible parallelism control in parameter server
  • Optimizations to reduce the overhead of parallelism adjustment

Systems for graph neural networks

DGCL: A Distributed Graph Communication Library for GNN systems

  • A general library to scale single-GPU GNN systems (e.g., DGL and PyG) to the multi-GPU setting
  • Efficient communcation kernels optimized for load balancing and bandwidth utilization on NVLinks, PCIe and InfiniBand

Seastar: A Vertex-centric GNN System

  • A vertex-centric programming model for GNNs
  • Kernel optimizations such as operator fusion and vertex parallelism to reduce GPU memory consumption and improve training efficiency

Large-scale cluster scheduling

PPS: Fair And Efficient Scheduling for Multi-Tenant GPU Clusters

  • Probabilistic prediction based scheduling for clusters with thousands of GPUs
  • Black-box and non-preemptive scheduling for GPU jobs
  • Achieves both efficiency and fairness at the same time

Ursa: A Framework for both Resource Scheduling and Execution for OLAP Jobs

  • Captures dynamic resource needs at runtime and enables fine-grained, timely scheduling
  • Achieves high resource utilization, which translates into significantly improved makespan and average JCT