Zhihan Zhou Final Defense Monday, February 24th
Webcast Link (Hybrid)
Deciphering the Language of DNA with Genome Foundation Models
Abstract:
Deciphering the language of the genome is a fundamental challenge with transformative implications across multiple critical domains. This thesis aims to advance genome analysis and synthesis through the introduction of genome foundation models (gFMs)—general-purpose genomic models designed to address a broad spectrum of genomics and metagenomics problems.
We first demonstrate the promise of self-supervised DNA modeling with DNABERT, the first gFM that outperforms traditional methods in diverse prediction tasks. Building on these insights, we introduce DNABERT-2, incorporating compact sequence representations, modern computational techniques, and multi-species genomes to enhance the applicability, efficiency, and effectiveness of discriminative gFMs.
Moving beyond analysis, we further explore generative modeling in genomics through GenomeOcean, a 4-billion-parameter generative gFM trained on massive metagenomic assemblies. GenomeOcean exhibits a profound understanding of protein functions and higher-order genomic functional modules by generating novel and realistic sequences under diverse prompts. Finally, we propose a generalizable framework for producing effective DNA embeddings tailored to biologically meaningful relationships.
Together, this new class of models opens new avenues and provides a robust foundation for advancing genomics and metagenomics research.
Wynante R Charles
(847) 467-8174
Email