Skills

Build an End-to-End ML Pipeline

12 weeks · 0 milestones

Build a machine learning pipeline from raw data to deployed model serving real predictions in production.

Milestone map

3 milestones

Before writing any model code, define the prediction or classification problem precisely, select a real-world dataset (not a pre-split tutorial dataset like the Iris dataset or MNIST without modification), and write an architecture decision record that justifies the dataset choice, the evaluation metric, and the initial model family. The dataset must have a genuine problem framing — 'predict whether a customer will churn' or 'classify the sentiment of product reviews' — not 'I applied logistic regression to the Titanic dataset'. The architecture decision record is a real ML engineering deliverable; it exists to prevent scope creep, justify design choices, and give the reviewer something concrete to challenge.

Proof required

Submit your architecture decision record (500–800 words) covering: the prediction problem with a specific business or research context, the dataset (source URL, size, class distribution or target distribution, and a note on known data quality issues), the evaluation metric selected and why it is appropriate for this problem (not 'I chose accuracy because it is standard'), the model family you plan to try first and why, and a reproducibility plan (how you will set seeds, log experiments, and ensure someone else could reproduce your results).

What gets checked

Problem framing is specific — includes a named decision-maker, context, and the cost of a wrong prediction in each direction (false positives and false negatives are not symmetric in most real problems)
Evaluation metric is justified relative to the class imbalance and problem context — a problem with 1% positive rate should not use accuracy as the primary metric without explanation
Dataset is a real dataset with a source URL and a note on at least one known data quality issue — clean pre-split tutorial datasets without any quality issues are not real-world datasets

Resources

Enroll free to unlock learning resources →

Mastery

UCI ML Repository

Use for datasets with published academic baselines — knowing the state-of-the-art performance on the dataset lets you benchmark your pipeline against a meaningful target.

Unlocks after completing Foundation + Depth