Feature learning and "the linear representation hypothesis" for monitoring and steering LLMs
Mikhail Belkin, HDSI Endowed Chair Professor in AI, Halicioglu Data Science Institute, University of California San Diego
Abstract: A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always ``know what they know'' and may even be unintentionally or actively misleading. In this talk I will discuss feature learning introducing Recursive Feature Machines—a powerful method originally designed for extracting relevant features from tabular data. I will demonstrate how this technique enables us to detect and precisely guide LLM behaviors toward almost any desired concept by manipulating a single fixed vector in the LLM activation space.
Cost: free
Audience
- Faculty/Staff
- Student
- Post Docs/Docs
- Graduate Students
Contact
Kisa Kowal
(847) 491-3974
Email
Interest
- Academic (general)