Large-Scale Benchmarking of Gene and Expression Encoding Strategies for Single-Cell Foundation Models
ICLR 2026 Workshop on Generative AI for Genomics
We present a systematic benchmark comparing gene and expression encoding strategies for single-cell foundation models by training models from scratch under controlled conditions, scaling to 10 million cells across 100 diverse datasets. Contrary to common assumptions, we find that pretrained embeddings from large protein models like ESM-2 consistently underperform task-specific learned embeddings. Our work provides clear empirical guidance for model design decisions and establishes a systematic benchmark for evaluating encoding strategies in single-cell foundation models.
