When:
Monday, February 24, 2025
3:30 PM - 5:00 PM CT
Where:
Mudd Hall ( formerly Seeley G. Mudd Library), 3514, 2233 Tech Drive, Evanston, IL 60208 map it
Webcast Link
(Hybrid)
Audience: Faculty/Staff - Student - Post Docs/Docs - Graduate Students
Contact:
Wynante R Charles
(847) 467-8174
Group: Department of Computer Science (CS)
Category: Academic
Deciphering the Language of DNA with Genome Foundation Models
Abstract:
Deciphering the language of the genome is a fundamental challenge with transformative implications across multiple critical domains. This thesis aims to advance genome analysis and synthesis through the introduction of genome foundation models (gFMs)—general-purpose genomic models designed to address a broad spectrum of genomics and metagenomics problems.
We first demonstrate the promise of self-supervised DNA modeling with DNABERT, the first gFM that outperforms traditional methods in diverse prediction tasks. Building on these insights, we introduce DNABERT-2, incorporating compact sequence representations, modern computational techniques, and multi-species genomes to enhance the applicability, efficiency, and effectiveness of discriminative gFMs.
Moving beyond analysis, we further explore generative modeling in genomics through GenomeOcean, a 4-billion-parameter generative gFM trained on massive metagenomic assemblies. GenomeOcean exhibits a profound understanding of protein functions and higher-order genomic functional modules by generating novel and realistic sequences under diverse prompts. Finally, we propose a generalizable framework for producing effective DNA embeddings tailored to biologically meaningful relationships.
Together, this new class of models opens new avenues and provides a robust foundation for advancing genomics and metagenomics research.