In this course, you will analyze how large language models are constructed from diverse text sources and examine the entire model life cycle, from pretraining data collection to generating meaningful outputs. You'll explore how choices about data type, genre, and tokenization affect a model's performance, discovering how to compare real-world corpora such as Wikipedia, Reddit, and GitHub.

Through hands-on projects, you will design tokenizers, quantify text characteristics, and apply methods like byte-pair encoding to see how different preprocessing strategies shape model capabilities. You'll also investigate how models interpret context by studying keywords in context (KWIC) views and embedding-based analysis.

By the end of this course, you will have a clear understanding of how data selection and processing decisions influence the way LLMs behave, preparing you to evaluate or improve existing models.

You are required to have completed the following courses or have equivalent experience before taking this course:

  • LLM Tools, Platforms, and Prompts
  • Language Models and Next-Word Prediction
  • Fine-Tuning LLMs
 

How It Works

Course Length
2 weeks

Effort
6 to 8 hours of study per week

Format
100% online, instructor-led
  • Engineers
  • Developers
  • Analysts
  • Data scientists
  • AI engineers
  • Entrepreneurs
  • Data journalists
  • Product managers
  • Researchers
  • Policymakers
  • Legal professionals
Get It Done 100% Online
Our programs are expressly designed to fit the lives of busy professionals like you.

Learn From cornell's Top Minds
Courses are personally developed by faculty experts to help you gain today's most in-demand skills.

Power Your career
Cornell's internationally recognized standard of excellence can set you apart.

Request Information Now by completing the form below.

Act today—courses are filling fast.