Your First Dataset

Garbage in = garbage out — understand your data before you model it

← Module 1: What IS ML? Module 2 of 8 Module 3: Preparing Data →

Loading Python… (first load ~15 seconds)

Before You Start

Complete Module 1 first
Familiarity with Python lists and dictionaries

Estimated time: ~50 minutes

What you’ll learn: What a dataset is, the vocabulary of ML data (samples, features, labels), and how to explore a dataset using basic Python before building any model.

The Big Idea

Before you can train a model, you need to understand your data. This is called Exploratory Data Analysis (EDA), and experienced ML engineers spend more time here than anywhere else.

A dataset is a table. Every row is one example (a sample). Every column is a measurement about that example (a feature). One special column is the target — the thing you’re trying to predict.

The famous Iris dataset is the “Hello World” of machine learning. It contains 150 measurements of iris flowers with 4 features and 3 species labels. We’ll use it throughout this course.

How It Works

Dataset Vocabulary

Term	What It Means	Iris Example
Sample / Instance	One row — one observation	One specific iris flower
Feature / Attribute	One column — one measurement	Petal length, sepal width
Target / Label	The column you want to predict	Species (setosa, versicolor, virginica)
X (features matrix)	All input columns	All 4 measurements
y (target vector)	The output column	Species label
Shape	Rows × columns	150 × 5

EDA: The 5 Questions to Ask Every Dataset

How big is it? Check rows and columns. Too small (<100 samples) can cause problems. Too large might need special tools.

What are the features? Which are numbers? Which are categories? Understand each column before using it.

Are there missing values? ML models can’t handle gaps in data — you need a plan (drop them or fill them).

What’s the class distribution? In classification, are your target classes balanced? A dataset with 99% one class is almost useless.

Are there obvious patterns? Do certain features seem to separate the classes well? This tells you what might matter most.

The Iris Dataset Up Close

Here is what one row (one flower sample) looks like:

 | sepal_length | sepal_width | petal_length | petal_width | species   |
|--------------|-------------|--------------|-------------|-----------|
|     5.1      |     3.5     |     1.4      |     0.2     | setosa    |  ← one flower
|     7.0      |     3.2     |     4.7      |     1.4     | versicolor|  ← one flower
|     6.3      |     3.3     |     6.0      |     2.5     | virginica |  ← one flower

Features: the 4 numeric measurements. Target: species. Shape: 150 × 5.

▶ See It In Code

Watch how we explore a dataset: shape, class counts, basic statistics. Run it to see the output.

import numpy as np from collections import Counter # Simulated Iris-like dataset (subset) # Columns: sepal_length, sepal_width, petal_length, petal_width, species data = [ [5.1, 3.5, 1.4, 0.2, 'setosa'], [4.9, 3.0, 1.4, 0.2, 'setosa'], [4.7, 3.2, 1.3, 0.2, 'setosa'], [5.0, 3.6, 1.4, 0.2, 'setosa'], [5.4, 3.9, 1.7, 0.4, 'setosa'], [7.0, 3.2, 4.7, 1.4, 'versicolor'], [6.4, 3.2, 4.5, 1.5, 'versicolor'], [6.9, 3.1, 4.9, 1.5, 'versicolor'], [5.5, 2.3, 4.0, 1.3, 'versicolor'], [6.5, 2.8, 4.6, 1.5, 'versicolor'], [6.3, 3.3, 6.0, 2.5, 'virginica'], [5.8, 2.7, 5.1, 1.9, 'virginica'], [7.1, 3.0, 5.9, 2.1, 'virginica'], [6.3, 2.9, 5.6, 1.8, 'virginica'], [6.5, 3.0, 5.8, 2.2, 'virginica'], ] feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] # ---- QUESTION 1: How big is the dataset? ---- print(f"Shape: {len(data)} rows x {len(data[0])} columns") print(f"Features: {feature_names}") print(f"Target: species (last column)") # ---- QUESTION 2: Show first 3 rows ---- print("\n--- First 3 Rows ---") print(f"{'sepal_l':>9} {'sepal_w':>9} {'petal_l':>9} {'petal_w':>9} {'species'}") for row in data[:3]: print(f"{row[0]:>9.1f} {row[1]:>9.1f} {row[2]:>9.1f} {row[3]:>9.1f} {row[4]}") # ---- QUESTION 4: Class distribution ---- labels = [row[-1] for row in data] print("\n--- Class Distribution ---") for species, count in Counter(labels).items(): print(f" {species}: {count} samples") # ---- QUESTION 5: Stats per feature ---- print("\n--- Feature Statistics ---") print(f"{'Feature':<15} {'Min':>6} {'Max':>6} {'Mean':>6}") print("-" * 37) for i, name in enumerate(feature_names): vals = [row[i] for row in data] print(f"{name:<15} {min(vals):>6.2f} {max(vals):>6.2f} {sum(vals)/len(vals):>6.2f}")

This is a read-only example. The interactive exercise is below.

Your Turn

The code above shows stats for all features. Your task: modify the code to show statistics for sepal_length only, and also print whether the class distribution is balanced (equal number of each species).

from collections import Counter

data = [
    [5.1, 3.5, 1.4, 0.2, 'setosa'],
    [4.9, 3.0, 1.4, 0.2, 'setosa'],
    [4.7, 3.2, 1.3, 0.2, 'setosa'],
    [5.0, 3.6, 1.4, 0.2, 'setosa'],
    [5.4, 3.9, 1.7, 0.4, 'setosa'],
    [7.0, 3.2, 4.7, 1.4, 'versicolor'],
    [6.4, 3.2, 4.5, 1.5, 'versicolor'],
    [6.9, 3.1, 4.9, 1.5, 'versicolor'],
    [5.5, 2.3, 4.0, 1.3, 'versicolor'],
    [6.5, 2.8, 4.6, 1.5, 'versicolor'],
    [6.3, 3.3, 6.0, 2.5, 'virginica'],
    [5.8, 2.7, 5.1, 1.9, 'virginica'],
    [7.1, 3.0, 5.9, 2.1, 'virginica'],
    [6.3, 2.9, 5.6, 1.8, 'virginica'],
    [6.5, 3.0, 5.8, 2.2, 'virginica'],
]

# Task 1: Extract sepal_length values (index 0) and print stats
sepal_lengths = [row[0] for row in data]
print("=== Sepal Length Statistics ===")
print(f"  Min:  {min(sepal_lengths):.2f}")
print(f"  Max:  {max(sepal_lengths):.2f}")
print(f"  Mean: {sum(sepal_lengths)/len(sepal_lengths):.2f}")

# Task 2: Check if classes are balanced
labels = [row[-1] for row in data]
counts = Counter(labels)
print("\n=== Class Balance Check ===")
for species, count in counts.items():
    print(f"  {species}: {count} samples")

# YOUR CODE HERE: Add an if/else that prints either
# " Balanced dataset!" or " Imbalanced dataset!"
# based on whether all species have the same count

Output will appear here after you click Run…

Hint: Use len(set(counts.values())) == 1 to check if all counts are identical. If they are, the dataset is balanced!

Brain Break — 2 Minutes

Think about a dataset you interact with every day without realizing it — maybe your Spotify listening history, your purchase history, or your step counter.

What would each rowrepresent?
What would the features be?
What would you want the model to predict (the target)?

This mental habit — thinking in rows, columns, and targets — is how ML practitioners see the world.

Key Takeaways

A dataset = a table. Rows = samples, Columns = features, one column = target.
Always explore before modeling: check shape, missing values, class distribution, and basic statistics.
The Iris dataset: 150 samples, 4 features (sepal/petal length & width), 3 target classes. It’s the classic ML starting point.
Imbalanced classes (e.g., 95% class A, 5% class B) will cause your model to ignore the minority class — something to watch for.
“Garbage in = garbage out” — even the best algorithm fails if you don’t understand your data first.

Module 2 Complete!

You know how to read and explore a dataset. Next, we tackle the messiest part of ML — cleaning and preparing real data.

Continue to Module 3: Preparing Data →

← Module 1: What IS ML? Module 2 of 8 Module 3: Preparing Data →