Zhenkun Cai

蔡振坤

Amazon Web Service

Email: zekucai@gmail.com

[Google Scholar]
[Publications]
[Projects]

I am an Applied Scientist at AWS, working on large-scale machine learning infrastructure. I joined AWS in 2022, starting in the Shanghai office, where I worked on Graph Neural Network (GNN) systems. Later, I moved to Santa Clara, AWS US, where I now focus on LLM infrastructure. Before joining AWS, I earned my PhD from The Chinese University of Hong Kong (CUHK), advised by Prof. James Cheng.

Awards

CUHK Postgraduate Studentship, 2018 - 2022
Undergraduate National Scholarship, 2014, 2015, 2016
ACM-ICPC World Finalist, 2017
ACM-ICPC Regional Contest Gold Medals, 2016
Guangdong Collegiate Programming Contest (GDCPC) Champion, 2016

Publications

gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning. SOSP 2023

Ping Gong, Renjie Liu, Zunyao Mao, Zhenkun Cai, Xiao Yan, Cheng Li, Minjie Wang, Zhuozhao Li
FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication. SIGMOD 2023

Kaihao Ma, Xiao Yan, Zhenkun Cai, Yuzhen Huang, Yidi Wu, James Cheng
DGI: Easy and Efficient Inference for GNNs. KDD 2023

Peiqi Yin, Xiao Yan, Jinjing Zhou, Qiang Fu, Zhenkun Cai, James Cheng, Bo Tang, Minjie Wang
DSP: Efficient GNN training with multiple GPUs. PPoPP 2023

Zhenkun Cai*, Qihui Zhou*, Xiao Yan, Da Zheng, Xiang Song, Chenguang Zheng, James Cheng, George Karypis
TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism. TPDS 2022

Zhenkun Cai, Kaihao Ma, Xiao Yan, Yidi Wu, Yuzhen Huang, James Cheng, Teng Su, Fan Yu
DGCL: An Efficient Communication Library for Distributed GNN Training. Eurosys 2021

Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, Fan Yu
Seastar: Vertex-Centric Programming for Graph Neural Networks. Eurosys 2021

Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chenguang Zheng, James Cheng, Fan Yu
Elastic Deep Learning in Multi-Tenant GPU Clusters. TPDS 2021

Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, James Cheng, Han Yuan, Fan Yu
Improving Resource Utilization by Timely Fine-Grained Scheduling. Eurosys 2020

Tatiana Jin, Zhenkun Cai, Boyang Li, Chenguang Zheng, Guanxian Jiang, James Cheng
FlexPS: Flexible Parallelism Control in Parameter Server Architecture. VLDB 2018

Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Yuying Guo, Fan Yang, Jinfeng Li, James Cheng
Scalable De Novo Genome Assembly Using Pregel. ICDE 2018

Da Yan, Hongzhi Chen, Zhenkun Cai, James Cheng, Bin Shao

Projects

Systems for large-scale machine learning

TensorOpt: Training Large-scale DNNs with Auto-parallell

Supports distributed training of large models (e.g., Transformer, WideResNet) using limited GPU memory

Optimally decides the parallelization strategy and automatically generates code for operators in a DNN

Developed on top of TensorFlow with user-friendly Python APIs

EDL: An Elastic Deep Learning System on GPUs

Supports elastic deep learning, i.e., dynamically adjust the number of GPU at runtime

Stop-free scaling and dynamic data pipeline on Horovod

FlexPS: A Parameter Server with Flexible Parallelism Control

A novel multi-stage abstraction to support flexible parallelism control in parameter server

Optimizations to reduce the overhead of parallelism adjustment

Systems for graph neural networks

DGCL: A Distributed Graph Communication Library for GNN systems

A general library to scale single-GPU GNN systems (e.g., DGL and PyG) to the multi-GPU setting

Efficient communcation kernels optimized for load balancing and bandwidth utilization on NVLinks, PCIe and InfiniBand

Seastar: A Vertex-centric GNN System

A vertex-centric programming model for GNNs

Kernel optimizations such as operator fusion and vertex parallelism to reduce GPU memory consumption and improve training efficiency

Large-scale cluster scheduling

PPS: Fair And Efficient Scheduling for Multi-Tenant GPU Clusters

Probabilistic prediction based scheduling for clusters with thousands of GPUs

Black-box and non-preemptive scheduling for GPU jobs

Achieves both efficiency and fairness at the same time

Ursa: A Framework for both Resource Scheduling and Execution for OLAP Jobs

Captures dynamic resource needs at runtime and enables fine-grained, timely scheduling

Achieves high resource utilization, which translates into significantly improved makespan and average JCT