Publications

Large-Scale Benchmarking of Gene and Expression Encoding Strategies for Single-Cell Foundation Models

Igor Sadalski

ICLR 2026 Workshop on Generative AI for Genomics

We present a systematic benchmark comparing gene and expression encoding strategies for single-cell foundation models by training models from scratch under controlled conditions, scaling to 10 million cells across 100 diverse datasets. Contrary to common assumptions, we find that pretrained embeddings from large protein models like ESM-2 consistently underperform task-specific learned embeddings. Our work provides clear empirical guidance for model design decisions and establishes a systematic benchmark for evaluating encoding strategies in single-cell foundation models.

Paper link

Scaling Laws for Noise for Cellular Representation Learning

Gokul Gowri, Igor Sadalski, Dan Raviv, Peng Yin, Jonathan Rosenfeld, Allon Klein

Nature Machine Intelligence (Under Review)

Large genomic and imaging datasets can be used to fit models that learn representations of cellular systems, extracting informative structure from data. In other domains, model performance improves predictably with dataset size, providing a basis for allocating data and computation. In biological data, however, performance is also limited by measurement noise arising from technical factors such as molecular undersampling or imaging variability. By learning representations of single-cell genomic and imaging data, we show that noise defines a distinct axis along which performance improves predictably across tasks. This scaling follows a simple logarithmic law that is consistent across model types, tasks, and datasets, and can be derived quantitatively from a model of noise propagation. We identify robustness to noise and saturating performance as properties that vary across models and tasks.

Paper link

Scaling up Measurement Noise Scaling Laws

Igor Sadalski, Dan Raviv, Jonathan Rosenfeld, Allon Klein, Gokul Gowri

ICML 2025 Workshop on Multi-modal Foundational Models for Life Sciences

Learning meaningful representations of cellular states is a key problem in computational biology. Yet, the scaling behavior of single-cell representation learning models remains poorly understood. While recent work has proposed that model performance scales predictably with measurement noise, this hypothesis has only been validated with relatively small models and datasets. We demonstrate that previously observed noise-scaling behavior again consistently emerges in these large-scale models and datasets.

Paper link

Generative Modelling of Residuals for Real-Time Risk-Sensitive Safety with Discrete-Time Control Barrier Functions

Ryan K. Cosner, Igor Sadalski, Jana K. Woo, Preston Culbertson, Aaron D. Ames

International Conference on Robotics and Automation 2024

A key source of brittleness for robotic systems is the presence of model uncertainty and external disturbances. This work proposes a training a state-conditioned generative model to represent the distribution of error residuals between the nominal dynamics and the actual system. We demonstrate our approach in simulations and hardware, and show that our method can learn a disturbance model that is accurate enough to enable risk-sensitive control of a quadrotor flying aggressively with an unmodelled slung load.

Paper link