Learn Without Walls
← Back to Machine Learning Basics
Module 2 of 8 — Machine Learning Basics

Your First Dataset

Garbage in = garbage out — understand your data before you model it

← Module 1: What IS ML? Module 2 of 8 Module 3: Preparing Data →
⏳ Loading Python… (first load ~15 seconds)

📌 Before You Start

Estimated time: ~50 minutes

What you’ll learn: What a dataset is, the vocabulary of ML data (samples, features, labels), and how to explore a dataset using basic Python before building any model.

💡 The Big Idea

Before you can train a model, you need to understand your data. This is called Exploratory Data Analysis (EDA), and experienced ML engineers spend more time here than anywhere else.

A dataset is a table. Every row is one example (a sample). Every column is a measurement about that example (a feature). One special column is the target — the thing you’re trying to predict.

The famous Iris dataset is the “Hello World” of machine learning. It contains 150 measurements of iris flowers with 4 features and 3 species labels. We’ll use it throughout this course.

🧠 How It Works

Dataset Vocabulary

TermWhat It MeansIris Example
Sample / InstanceOne row — one observationOne specific iris flower
Feature / AttributeOne column — one measurementPetal length, sepal width
Target / LabelThe column you want to predictSpecies (setosa, versicolor, virginica)
X (features matrix)All input columnsAll 4 measurements
y (target vector)The output columnSpecies label
ShapeRows × columns150 × 5

EDA: The 5 Questions to Ask Every Dataset

1
How big is it? Check rows and columns. Too small (<100 samples) can cause problems. Too large might need special tools.
2
What are the features? Which are numbers? Which are categories? Understand each column before using it.
3
Are there missing values? ML models can’t handle gaps in data — you need a plan (drop them or fill them).
4
What’s the class distribution? In classification, are your target classes balanced? A dataset with 99% one class is almost useless.
5
Are there obvious patterns? Do certain features seem to separate the classes well? This tells you what might matter most.

The Iris Dataset Up Close

Here is what one row (one flower sample) looks like:

| sepal_length | sepal_width | petal_length | petal_width | species | |--------------|-------------|--------------|-------------|-----------| | 5.1 | 3.5 | 1.4 | 0.2 | setosa | ← one flower | 7.0 | 3.2 | 4.7 | 1.4 | versicolor| ← one flower | 6.3 | 3.3 | 6.0 | 2.5 | virginica | ← one flower

Features: the 4 numeric measurements. Target: species. Shape: 150 × 5.

▶️ See It In Code

Watch how we explore a dataset: shape, class counts, basic statistics. Run it to see the output.

import numpy as np from collections import Counter # Simulated Iris-like dataset (subset) # Columns: sepal_length, sepal_width, petal_length, petal_width, species data = [ [5.1, 3.5, 1.4, 0.2, 'setosa'], [4.9, 3.0, 1.4, 0.2, 'setosa'], [4.7, 3.2, 1.3, 0.2, 'setosa'], [5.0, 3.6, 1.4, 0.2, 'setosa'], [5.4, 3.9, 1.7, 0.4, 'setosa'], [7.0, 3.2, 4.7, 1.4, 'versicolor'], [6.4, 3.2, 4.5, 1.5, 'versicolor'], [6.9, 3.1, 4.9, 1.5, 'versicolor'], [5.5, 2.3, 4.0, 1.3, 'versicolor'], [6.5, 2.8, 4.6, 1.5, 'versicolor'], [6.3, 3.3, 6.0, 2.5, 'virginica'], [5.8, 2.7, 5.1, 1.9, 'virginica'], [7.1, 3.0, 5.9, 2.1, 'virginica'], [6.3, 2.9, 5.6, 1.8, 'virginica'], [6.5, 3.0, 5.8, 2.2, 'virginica'], ] feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'] # ---- QUESTION 1: How big is the dataset? ---- print(f"Shape: {len(data)} rows x {len(data[0])} columns") print(f"Features: {feature_names}") print(f"Target: species (last column)") # ---- QUESTION 2: Show first 3 rows ---- print("\n--- First 3 Rows ---") print(f"{'sepal_l':>9} {'sepal_w':>9} {'petal_l':>9} {'petal_w':>9} {'species'}") for row in data[:3]: print(f"{row[0]:>9.1f} {row[1]:>9.1f} {row[2]:>9.1f} {row[3]:>9.1f} {row[4]}") # ---- QUESTION 4: Class distribution ---- labels = [row[-1] for row in data] print("\n--- Class Distribution ---") for species, count in Counter(labels).items(): print(f" {species}: {count} samples") # ---- QUESTION 5: Stats per feature ---- print("\n--- Feature Statistics ---") print(f"{'Feature':<15} {'Min':>6} {'Max':>6} {'Mean':>6}") print("-" * 37) for i, name in enumerate(feature_names): vals = [row[i] for row in data] print(f"{name:<15} {min(vals):>6.2f} {max(vals):>6.2f} {sum(vals)/len(vals):>6.2f}")

This is a read-only example. The interactive exercise is below.

👋 Your Turn

The code above shows stats for all features. Your task: modify the code to show statistics for sepal_length only, and also print whether the class distribution is balanced (equal number of each species).

Output will appear here after you click Run…
💡 Hint: Use len(set(counts.values())) == 1 to check if all counts are identical. If they are, the dataset is balanced!

☕ Brain Break — 2 Minutes

Think about a dataset you interact with every day without realizing it — maybe your Spotify listening history, your purchase history, or your step counter.

This mental habit — thinking in rows, columns, and targets — is how ML practitioners see the world.

✅ Key Takeaways

🎉 Module 2 Complete!

You know how to read and explore a dataset. Next, we tackle the messiest part of ML — cleaning and preparing real data.

Continue to Module 3: Preparing Data →

← Module 1: What IS ML? Module 2 of 8 Module 3: Preparing Data →