🧠 Episode 001: — AI Interview

Detecting Missing Data

Jul 12, 2025

Welcome to DS/AI Interview Pulse — a series helping you crush ML interviews one real-world topic at a time. Think less textbook, more “how this shows up in a Stripe or Meta loop.”

This week: missing values.
Simple? Yes. Ignored? Constantly. Mismanaged in interviews? All the time..

🚨 Quick Hit: Why Interviewers Ask This

Because it’s everywhere.
And because how you handle missing data tells them if you:

Know how real datasets actually look
Can build pipelines that won’t crash
Avoid subtle leakage that wrecks models

Also: cleaning data is still 60–70% of the job. It's boring to some, but if you skip it, you're out.

🎯 The Ask Behind the Ask

If they ask:

“How do you detect missing values in a dataset?”

They’re actually testing:

🧠 Are you comfortable with isnull(), .info(), and value counts?
🔍 Do you notice weird encodings? (like "?", -999, None)
📊 Do you use visualizations (heatmaps, missingno)?
⚠️ Do you think about what causes missingness?

It’s not just “run .isnull().sum()”. It’s:
→ Can you spot broken data pipelines, business rules, or bugs?

🧪 In Code — What You Should Say (and Show)

# Step 1: Basic null scan
df.isnull().sum()
df.info()

# Step 2: % missing per column
df.isnull().mean().sort_values(ascending=False)

# Step 3: Visual
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False)

Also: mention missingno.matrix(df) — bonus points.

🧠 If You're Smart, You’ll Say This

“First I check the shape of missingness with .info() and .isnull().mean(). Then I look for weird patterns — like if high-income users have fewer blanks. That tells me if data is missing at random, or if it's tied to a group. Also — I always search for things like ‘-999’ or ‘missing’ — those aren’t nulls but they should be.”

This is what interviewers love:

Realism
Risk thinking
Awareness of traps

⚠️ Common Pitfall (They Hope You Miss It)

Data isn’t always missing as NaN. It’s often hidden.

Look for:

"?", "N/A", "Unknown", 0, -1, -999
Empty strings ("")
Categorical features with "none" or "blank" as levels

Pro tip: Run .value_counts(dropna=False) for every column you suspect.

🗣️ Mock Follow-Up Question (That Catches People)

"If a feature has 45% missing, what do you do?"

Wrong answers:

“Drop it.”
“Just fill it with the mean.”

Better:

“Depends. If it’s informative (like Employment History), I might create a ‘missing’ flag and treat missing as a category. If it’s random noise or no business meaning, I might drop it. But I’d check first if missingness correlates with the target.”

📦 The Interviewer’s Checklist

They’re looking to check if you:

Mention .isnull() or .info()
Bring up weird encodings
Show awareness of correlation with target
Don’t blindly drop columns or rows
Can tie this to real-world ETL problems

🧩 Mini Case: The Curveball

You’re working with a hiring dataset. Education_Level is missing in 38% of rows. Turns out, 90% of those are intern applicants.
What do you do?

Best answer shows:

You ask why it’s missing
You segment the missingness
You don’t panic-drop or blindly impute

🚀 Coming Up

Episode 002: Filling Missing Data — The Safe, the Lazy, and the Risky
Mean imputation is for rookies. Let’s talk smarter strategies (and how to explain them like a pro).

Want the next episodes in your inbox? Hit subscribe.

CoreSignal’s Substack

Discussion about this post