pandas Basics

The data analyst’s most essential library

← Lab 7: Data Lab 8 of 10 Lab 9: Visualization →

Loading Python + pandas... (this may take 15–20 seconds)

This lab loads pandas and numpy — allow extra time on first load. You’ll see the green “Python ready!” message when it’s done.

Concept Recap

pandas is the go-to Python library for data analysis. Key concepts:

DataFrame — a 2D table (like a spreadsheet). Rows = observations, Columns = variables.
Series — a single column (1D labeled array).
Selecting columns:df['Name'] or df[['Name', 'Age']]
Filtering rows: df[df['Score'] > 90] or combined: df[(cond1) & (cond2)]
Groupby: df.groupby('Category')['Value'].mean()
Stats: df['col'].describe(), .mean(), .sum(), .value_counts()
Sorting: df.sort_values('col', ascending=False)

Worked Example

Creating and exploring a DataFrame:

import pandas as pd import numpy as np data = { 'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve'], 'Department': ['Eng', 'Marketing', 'Eng', 'HR', 'Marketing'], 'Salary': [95000, 72000, 105000, 68000, 78000], 'Years': [5, 3, 8, 2, 6] } df = pd.DataFrame(data) print(df) print("\nShape:", df.shape) print("\nBasic stats:") print(df['Salary'].describe()) print("\nAvg salary by dept:") print(df.groupby('Department')['Salary'].mean())

Guided

Exercise 1 — Student DataFrame Explorer

Complete the pandas operations by filling in the column name strings in the blanks.

 import pandas as pd

students = pd.DataFrame({
    'Name':        ['Alex','Beth','Carlos','Diana','Ethan','Fiona','George','Hana'],
    'Major':       ['CS','Math','CS','English','CS','Math','English','CS'],
    'GPA':         [3.8, 3.2, 3.9, 2.9, 3.5, 3.7, 3.1, 3.6],
    'Year':        [3, 2, 4, 1, 2, 3, 4, 1],
    'Scholarship': [True, False, True, False, False, True, False, True]
})

# 1. Show only CS students
cs = students[students[___] == 'CS']
print("CS Students:\n", cs[['Name', 'GPA']])

# 2. Average GPA by Major
print("\nAvg GPA by Major:")
print(students.groupby(___)[___].mean())

# 3. Scholarship students with GPA > 3.5
elite = students[(students['Scholarship'] == True) & (students[___] > 3.5)]
print("\nElite scholars:\n", elite[['Name', 'GPA']])

Output will appear here...

Hint: The blanks are column name strings: 'Major', 'Major', 'GPA', 'GPA'. In pandas, column names are always quoted strings.

Independent

Exercise 2 — DataFrame Analysis

Using the same students DataFrame, write pandas code to:

Find the student with the highest GPA (use .idxmax() or .sort_values())
Count students per year (use .value_counts())
Find the average GPA of scholarship vs non-scholarship students (use .groupby())
Sort by GPA descending and show the top 3 students

import pandas as pd

students = pd.DataFrame({
    'Name':        ['Alex','Beth','Carlos','Diana','Ethan','Fiona','George','Hana'],
    'Major':       ['CS','Math','CS','English','CS','Math','English','CS'],
    'GPA':         [3.8, 3.2, 3.9, 2.9, 3.5, 3.7, 3.1, 3.6],
    'Year':        [3, 2, 4, 1, 2, 3, 4, 1],
    'Scholarship': [True, False, True, False, False, True, False, True]
})

# Your 4 analyses here

Output will appear here...

Hint: Highest GPA: students.loc[students['GPA'].idxmax()]. Year counts: students['Year'].value_counts(). Top 3: students.sort_values('GPA', ascending=False).head(3).

Challenge

Exercise 3 — GPA Categories with pd.cut()

Create a new column 'GPA_Category' using pd.cut() that labels each student’s GPA:

Excellent: GPA ≥ 3.7
Good: GPA ≥ 3.3
Satisfactory: GPA ≥ 3.0
Needs Improvement: GPA < 3.0

Then count how many students fall into each category.

 import pandas as pd

students = pd.DataFrame({
    'Name': ['Alex','Beth','Carlos','Diana','Ethan','Fiona','George','Hana'],
    'GPA':  [3.8, 3.2, 3.9, 2.9, 3.5, 3.7, 3.1, 3.6],
})

# Use pd.cut() to create GPA_Category column
# bins = [0, 3.0, 3.3, 3.7, 4.0]
# labels = ['Needs Improvement', 'Satisfactory', 'Good', 'Excellent']
students['GPA_Category'] = pd.cut(
    students['GPA'],
    bins=___,
    labels=___,
    right=False   # left-inclusive: 3.7 goes into 'Excellent'
)

print(students[['Name', 'GPA', 'GPA_Category']])
print("\nCategory counts:")
print(students['GPA_Category'].value_counts())

Output will appear here...

Hint: bins is the list [0, 3.0, 3.3, 3.7, 4.01] (slightly above 4 to include 4.0). labels is the list of 4 category strings in matching order.

Mini Project

Mini Project — Sales Dataset Explorer

Analyze the sales DataFrame below. Answer all 5 business questions using pandas and print a formatted report:

Total revenue and total quota across all rows
Top salesperson by total revenue
Best month by total revenue
Revenue by region (sorted highest to lowest)
Percentage of rows where rep exceeded their quota (Revenue > Quota)

 import pandas as pd

sales = pd.DataFrame({
    'Rep':     ['Alice','Bob','Carol','Alice','David','Bob','Carol','David','Eve','Eve'],
    'Region':  ['West','East','West','West','South','East','West','South','North','North'],
    'Month':   ['Jan','Jan','Jan','Feb','Jan','Feb','Feb','Feb','Jan','Feb'],
    'Revenue': [18500,12300,22100,19800,15600,16900,25400,13200,11800,17600],
    'Quota':   [15000,15000,20000,15000,15000,15000,20000,15000,12000,12000]
})

# Answer all 5 business questions here

Output will appear here...

Hint: Q1: sales['Revenue'].sum(). Q2: sales.groupby('Rep')['Revenue'].sum().idxmax(). Q5: sales['Hit_Quota'] = sales['Revenue'] > sales['Quota'] then .mean() * 100.

Lab 8 Complete!

You’ve created DataFrames, filtered rows, grouped data, and answered real business questions with pandas. This is the core skill of a data analyst.

Continue to Lab 9: Data Visualization →

← Lab 7: Data Lab 8 of 10 Lab 9: Visualization →