Name: Language Models and Language Data
Availability: InStock

Course Overview

In this course, you will analyze how large language models are constructed from diverse text sources and examine the entire model life cycle, from pretraining data collection to generating meaningful outputs. You'll explore how choices about data type, genre, and tokenization affect a model's performance, discovering how to compare real-world corpora such as Wikipedia, Reddit, and GitHub.

Through hands-on projects, you will design tokenizers, quantify text characteristics, and apply methods like byte-pair encoding to see how different preprocessing strategies shape model capabilities. You'll also investigate how models interpret context by studying keywords in context (KWIC) views and embedding-based analysis.

By the end of this course, you will have a clear understanding of how data selection and processing decisions influence the way LLMs behave, preparing you to evaluate or improve existing models.

You are required to have completed the following courses or have equivalent experience before taking this course:

LLM Tools, Platforms, and Prompts
Language Models and Next-Word Prediction
Fine-Tuning LLMs

Key Course Takeaways

Summarize the life cycle of a language model, detailing each phase from data collection through inference
Assess the impact of data collection and curation choices on a model's predictive capabilities and domain coverage
Analyze pretraining documents to see how an LLM extends prompts within specific, real-world contexts
Classify and quantify text collections by genre, language, and code to gauge their effect on model behavior
Explain how a pretraining dataset's composition influences tokenizer coverage and performance across different text domains

Discover More

How It Works

Course Length

2 weeks

Effort

6 to 8 hours of study per week

Format

100% online, instructor-led

Course Author

view details

David Mimno

Associate Professor

Cornell Bowers Computing and Information Science

Associate Professor and Chair of the Department of Information Science, Cornell Bowers Computing and Information Science

David Mimno is an Associate Professor and Chair of the Department of Information Science in the Ann S. Bowers College of Computing and Information Science at Cornell University. He holds a Ph.D. from UMass Amherst and was previously the head programmer at the Perseus Project at Tufts as well as a researcher at Princeton University. Professor Mimno’s work has been supported by the Sloan Foundation, the NEH, and the NSF.

Who Should Enroll

Engineers
Developers
Analysts
Data scientists
AI engineers
Entrepreneurs
Data journalists
Product managers
Researchers
Policymakers
Legal professionals

Get It Done
100% Online

Our programs are expressly designed to fit the lives of busy professionals like you.

Learn From
cornell's Top Minds

Courses are personally developed by faculty experts to help you gain today's most in-demand skills.

Power Your
career

Cornell's internationally recognized standard of excellence can set you apart.

Stack To A Certificate

Large Language Model Fundamentals

Request Information Now by completing the form below.

Act today—courses are filling fast.

I want to learn more about: *

Do you wish to communicate with our team by text message?