Your First Dataset
Garbage in = garbage out — understand your data before you model it
📌 Before You Start
- Complete Module 1 first
- Familiarity with Python lists and dictionaries
Estimated time: ~50 minutes
What you’ll learn: What a dataset is, the vocabulary of ML data (samples, features, labels), and how to explore a dataset using basic Python before building any model.
💡 The Big Idea
Before you can train a model, you need to understand your data. This is called Exploratory Data Analysis (EDA), and experienced ML engineers spend more time here than anywhere else.
A dataset is a table. Every row is one example (a sample). Every column is a measurement about that example (a feature). One special column is the target — the thing you’re trying to predict.
The famous Iris dataset is the “Hello World” of machine learning. It contains 150 measurements of iris flowers with 4 features and 3 species labels. We’ll use it throughout this course.
🧠 How It Works
Dataset Vocabulary
| Term | What It Means | Iris Example |
|---|---|---|
| Sample / Instance | One row — one observation | One specific iris flower |
| Feature / Attribute | One column — one measurement | Petal length, sepal width |
| Target / Label | The column you want to predict | Species (setosa, versicolor, virginica) |
| X (features matrix) | All input columns | All 4 measurements |
| y (target vector) | The output column | Species label |
| Shape | Rows × columns | 150 × 5 |
EDA: The 5 Questions to Ask Every Dataset
The Iris Dataset Up Close
Here is what one row (one flower sample) looks like:
Features: the 4 numeric measurements. Target: species. Shape: 150 × 5.
▶️ See It In Code
Watch how we explore a dataset: shape, class counts, basic statistics. Run it to see the output.
This is a read-only example. The interactive exercise is below.
👋 Your Turn
The code above shows stats for all features. Your task: modify the code to show statistics for sepal_length only, and also print whether the class distribution is balanced (equal number of each species).
len(set(counts.values())) == 1 to check if all counts are identical. If they are, the dataset is balanced!☕ Brain Break — 2 Minutes
Think about a dataset you interact with every day without realizing it — maybe your Spotify listening history, your purchase history, or your step counter.
- What would each row represent?
- What would the features be?
- What would you want the model to predict (the target)?
This mental habit — thinking in rows, columns, and targets — is how ML practitioners see the world.
✅ Key Takeaways
- A dataset = a table. Rows = samples, Columns = features, one column = target.
- Always explore before modeling: check shape, missing values, class distribution, and basic statistics.
- The Iris dataset: 150 samples, 4 features (sepal/petal length & width), 3 target classes. It’s the classic ML starting point.
- Imbalanced classes (e.g., 95% class A, 5% class B) will cause your model to ignore the minority class — something to watch for.
- “Garbage in = garbage out” — even the best algorithm fails if you don’t understand your data first.
🎉 Module 2 Complete!
You know how to read and explore a dataset. Next, we tackle the messiest part of ML — cleaning and preparing real data.