Top Research Labs in Machine Learning Systems (MLSys) 🔬

This is a brief introduction to top-tier MLSys labs for Ph.D. applicants, with a primary focus on those in the United States and mainland China. If there are any omissions, please feel free to contact me to add them.

🇺🇸 United States

Google Brain

Microsoft Research (MSR)

Catalyst (Carnegie Mellon University)

  • Eric Xing: Prof. Eric was a student of Prof. Michael I. Jordan.

  • Tianqi Chen:

    • TVM: TVM is an open-source framework for optimizing and deploying deep learning models, with its name derived from “Tensor Virtual Machine.” Its primary goal is to optimize and compile deep learning models in an automated manner, enabling efficient execution on various hardware platforms such as CPUs, GPUs, FPGAs, and specialized AI accelerators.

    • XGBoost: XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

    • MXNet: Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scalable to many GPUs and machines.

  • Zhihao Jia: Prof. Zhihao Jia is a student of Prof. Matei Zaharia (now at UC Berkeley). He seems to be more focused on system for LLM.

DSAIL (MIT)

  • Song Han: Pruning and Sparse related work. Prof. Song Han seems to be working on algorithm modifications and hardware, and he’s not really focused on TinyML anymore. Now he’s working on diffusion models and LLM models.

  • Tim Kraska:

CSAIL (MIT)

DAWN Project (Stanford)

  • Matei Zaharia (now at UC Berkeley): Prof. Matei Zaharia (Stanford & Databricks) is highly respected for building Apache Spark (one of the most widely used frameworks for distributed data processing, and co-started other datacenter software such as Apache Mesos and Spark Streaming) from scratch to a billion-dollar level. He serves as a PC and chair for major conferences. PipeDream, TASO, and FlexFlow is the project built by his Ph.D. student Zhihao Jia. One standout aspect of his research is that it addresses real system needs, making it impactful and practical. Not all his work prioritizes performance; for instance, one recent paper discusses offloading computation to GPUs using annotation for ease of use. Overall, pursuing a PhD under his guidance would likely lead to significant influence in the industry.

Hazy Research (Stanford AI Lab)

This research group focuses on MLSys and also organized a seminar series called Stanford MLSys Seminar Series.

RISELab (University of California, Berkeley)

Most recent project: Ray

Professors at RISE Lab have offered a course called AI for Systems and Systems for AI (CS294).

System Lab (University of Washington)

  • Luis Ceze: Prof. Luis Ceze focuses on Programming Language and Computer Architecture.
    • TVM: TVM is an open-source framework for optimizing and deploying deep learning models, with its name derived from “Tensor Virtual Machine.” Its primary goal is to optimize and compile deep learning models in an automated manner, enabling efficient execution on various hardware platforms such as CPUs, GPUs, FPGAs, and specialized AI accelerators.
  • Arvind Krishnamurthy: Prof. Arvind Krishnamurthy primarily focuses on computer networks. His work involves applying networking technology to address challenges in distributed machine learning. So there is always cutting-edge support in the field of networking.

Sample (University of Washington)

SymbioticLab (University of Michigan, Ann Arbor)

System Group (New York University)

Shivaram Venkataraman Research Group (University of Wisconsin, Madison)

  • Shivaram Venkataraman: Prof. Shivaram is the student of Prof. Ion Stoica. He understands more about machine learning and less about systems. The papers he published is not too many, but the workload is substantial.

EcoSystem (University of Toronto)

🇨🇳 China

In mainland China, it seems that most of the work in MLSys is being done in companies. However, some strong teams in distributed systems often also work on MLSys to some extent, such as IPADS (Shanghai Jiaotong Univeristy).

Microsoft Research Lab(Asia)

PACMAN Group(Tsinghua)

More related to Arch

Center for Energy-efficient Computing and Applications(Peking University)

More related to Arch

Cheng LI’s Research Group (Univeristy of Science and Technology China)

IPADS(Shanghai Jiaotong Univeristy)

The best System Lab in Mainland China, and now is also working on some MLSys projects.

🌟 Appendix

Some people worth following on Zhihu

  • Tianqi Chen: Dr. Tianqi Chen is currently an Assistant Professor at Carnegie Mellon University. He helps run the Catalyst Group.

  • Huaizheng Zhang: Dr. Huaizheng Zhang’s AI-System-School is an open project aimed at collecting and organizing research papers, tools, and resources related to MLSys, Large Language Models (LLM), and Generative AI (GenAI). It provides researchers and engineers with a systematic learning path and practical guide to help them better understand and apply these cutting-edge technologies. Here is his personal website.

  • Yue Zhao: Dr. Yue Zhang is currently an Assistant Professor at the University of Southern California. He was also a student of Prof. Zhihao Jia. Here is his personal website.