• Login
  • Register

Work for a Member company and need a Member Portal account? Register here with your company email address.

Event

Abhishek Singh Dissertation Defense

Dissertation Title: Decentralized Machine Learning over Fragmented Data

Abstract: 

The remarkable scaling of data and computation has unlocked unprecedented capabilities in text and image generation, raising the question: why hasn't healthcare seen similar breakthroughs? This disparity stems primarily from healthcare data being fragmented across thousands of institutions, each safeguarding patient records in regulatory-compliant silos. The problem is not limited to healthcare but extends to other industries with fragmented data across institutions and individuals. Instead of centralizing various datasets to solve the fragmentation problem, which raises regulatory and ethical concerns, this thesis proposes systems and algorithms to decentralize machine learning to enable learning over distributed data.

Current approaches to decentralized machine learning have centered on Federated Learning (FL), which enables model training across distributed data sources. However, FL's dependence on central coordination, narrow focus on training, and inflexibility with heterogeneous systems limit its applicability in healthcare settings. While the internet's layered protocols demonstrate how distributed systems can collaborate effectively through standardized interfaces and algorithms, applying these distributed computing principles to machine learning introduces unique statistical and optimization challenges. This thesis explores three core themes that address these challenges and enable machine learning to operate effectively in distributed settings.

1) Coordination – Today's coordination algorithms typically rely on static rules or randomized communication, approaches that turn out to be sub-optimal when data distributions and institutional capabilities evolve. I present a new system and benchmark framework that enables systematic assessment of different coordination algorithms. Building on this, I propose an adaptive coordination algorithm that leverages historical performance and learning dynamics to evolve coalition patterns, improving overall system convergence and learning efficiency.

2) Heterogeneity – Data owners can vary significantly in their data distributions, computational resources, and privacy requirements. To address this heterogeneity, I first present algorithms for privacy-preserving collaborative inference, shifting focus from the traditionally protected training phase to securing the critical inference process. Next, I develop techniques for distributed training that adapt to heterogeneous computational capabilities across different agents.

3) Scalability – Enabling scaling in decentralized ML requires addressing three key challenges: parallelization, synchronization, and self-scaling. While parallelization has advanced significantly, the other two remain challenging. I present a framework for offline collaboration through sanitized, synthetic datasets that eliminates constant synchronization needs while preserving privacy. Additionally, I develop open-source protocols and a peer-to-peer system that enables researchers to deploy decentralized ML solutions without central coordination, facilitating organic system scaling.

This thesis identifies and addresses the bottlenecks along these three core themes through a complementary set of solutions: adaptive coordination, heterogeneity-aware training, and scalable asynchronous collaboration. Together, these building blocks can enable a practical framework for unlocking healthcare data silos across institutions and patients of varying capabilities.



Committee members: 

Ramesh Raskar, Associate Professor, Camera Culture, MIT Media Lab
Tamar Sofer, Associate Professor in the Department of Biostatistics, Beth Israel Deaconess Medical Center and Harvard Medical School
Martin Jaggi, Associate Professor, EPFL


More Events