Statistics and Data Science Seminar: "Towards Data-efficient Training of Large Language Models (LLMs)" (Zoom) 2/14/2025: Northwestern Events Calendar

Northwestern Events Calendar

Feb

2025

Statistics and Data Science Seminar: "Towards Data-efficient Training of Large Language Models (LLMs)" (Zoom)

When: Friday, February 14, 2025
11:00 AM - 12:00 PM CT

Where: Online

Audience: Faculty/Staff - Student - Post Docs/Docs - Graduate Students

Cost: free

Contact: Kisa Kowal (847) 491-3974
k-kowal@northwestern.edu

Group: Department of Statistics and Data Science

Category: Academic, Lectures & Meetings

Description:

Towards Data-efficient Training of Large Language Models (LLMs)

Baharan Mirzasoleiman, Assistant Professor, Computer Science Department, UCLA

Abstract: High quality data is crucial for training LLMs with superior performance. In this talk, I will present two theoretically-rigorous approaches to find smaller subsets of examples that can improve the performance and efficiency of training LLMs. First, I will present a one-shot data selection method for supervised fine-tuning of LLMs. Then, I'll talk about an iterative data selection strategy to pretrain or fine-tune LLMs on imbalanced mixtures of language data. I'll conclude by showing empirical results confirming that the above data selection strategies can effectively improve the performance of various LLMs during fine-tuning and pretraining.

Add Event To My Group:

Please sign-in