When:
Friday, April 14, 2023
11:00 AM - 12:00 PM CT
Where: Chambers Hall, Ruan Conference Room – lower level , 600 Foster St, Evanston, IL 60208 map it
Audience: Faculty/Staff - Post Docs/Docs - Graduate Students
Contact:
Kisa Kowal
(847) 491-3974
Group: Department of Statistics and Data Science
Category: Academic, Lectures & Meetings
Foundations for Feature Learning via Gradient Descent
Mahdi Soltanolkotabi, Director of Center on AI Foundations for the Sciences (AIF4S) and Associate Professor, Departments of Electrical and Computer Engineering, Computer Science, and Industrial and Systems Engineering, University of Southern California
Abstract: One of the key mysteries in modern learning is that a variety of models such as deep neural networks when trained via (stochastic) gradient descent can extract useful features and learn high quality representations directly from data simultaneously with fitting the labels. This feature learning capability is also at the forefront of the recent success of a variety of contemporary paradigms such as transformer architectures, self-supervised and transfer learning. Despite a flurry of exciting activity over the past few years, existing theoretical results are often too crude and/or pessimistic to explain feature/representation learning in practical regimes of operation or serve as a guiding principle for practitioners. Indeed, existing literature often requires unrealistic hyperparameter choices (e.g. very small step sizes, large initialization or wide models). In this talk I will focus on demystifying this feature/representation learning phenomena for a variety of problems spanning matrix reconstruction, neural network training, transfer learning and prompt-tuning via transformers. Our results are based on an intriguing spectral bias phenomena for gradient descent, that puts the iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well by simultaneously finding good features/representations of the data while fitting to the labels. The proofs combine ideas from high-dimensional probability/statistics, optimization and nonlinear control to develop a precise analysis of model generalization along the trajectory of gradient descent.