🧠 Episode 002: — AI Interview

Filling Missing Data — The Safe, the Lazy, and the Risky

Jul 12, 2025

Welcome back to DS/AI Interview Pulse — a curated breakdown of technical ML interview questions that actually show up, especially in roles at Big Tech, startups, and high-bar AI teams.

🤔 Today’s Question:

“How do you fill in missing values?”

Sounds innocent.
But it’s one of the fastest ways for an interviewer to check:

Have you seen real-world messiness?
Do you handle missing data with nuance?
Can you detect and prevent leakage?

Let’s break it down — with the tools, instincts, and red flags they’re looking for.

💀 Let’s Start Here: Mean Imputation is Lazy

Yes, it’s fast. Yes, it “works.”
But mean/median/mode imputation is the duct tape of data science.

# Lazy — and it shows 
df['age'] = df['age'].fillna(df['age'].mean())

You’re smoothing over real-world gaps without asking why they exist.
Interviewers want more than one-liners.

🛡️ The Safe Route: Group-Aware and Flagged

Better strategy:

# Segment-aware imputing 
df['age'] = df.groupby('job_level')['age'].transform(lambda x: x.fillna(x.median())) 
# Add a missing flag 
df['age_missing'] = df['age'].isnull().astype(int)

Why this works:

You honor structure (e.g., age varies by job level)
You preserve signal in the fact that data is missing
You give the model flexibility to decide what’s relevant

Bonus points: check if that missing flag correlates with your label.

🤖 ML-Based Imputation (Flashy, Risky)

You can predict missing values using ML.
It’s powerful, but you need to prove you understand the trade-offs.

from sklearn.impute import KNNImputer 
imputer = KNNImputer(n_neighbors=5) 
df_imputed = imputer.fit_transform(df)

Other options:

IterativeImputer (scikit-learn’s version of chained regression)
LightGBM/XGBoost to predict a column

⚠️ Caution:

Requires careful pipeline control
Can accidentally leak info from target variable
Might not be worth the complexity in a lean system

🎤 What to Say in the Interview

“I’d start with understanding why data is missing — is it random, segment-based, or time-related? Then I’d look at group-level imputing, like filling age by job level or department. For features with meaningful missingness, I’d add a binary flag to help the model pick up on that. For some columns I might try KNN or iterative imputation — but only if I’m confident I’m avoiding target leakage and preserving reproducibility.”

This shows:

You’ve seen real datasets
You think before filling
You can argue your choices with business or ML reasoning

🧠 Categorical Features? Don’t Just Mode-It

# Safe default 
df['gender'] = df['gender'].fillna(df['gender'].mode()[0]) # More nuanced df['gender'] = df['gender'].fillna('missing')

Sometimes "missing" is a real category.
Example: if education_level is missing for 90% of interns — don’t fill it. Encode it.

🚨 Common Traps (That Get You Dinged)

“Imputing before train-test split.”
→ This can cause leakage. Impute after split, inside your pipeline.
“Filling categorical columns with 0.”
→ That’s numerically valid, but semantically wrong.
“Dropping too quickly.”
→ Dropping 30% of rows is sometimes worse than noisy imputing. Know when to take that risk.

🧩 Interviewer’s Checklist

They’re listening for:

Awareness of .fillna() — and its limits
Mention of group-level imputation
Use of missing-value flags
Risk reasoning (leakage, overfitting)
Strategic thinking about feature vs. model performance

⚔️ Mock Follow-Up: The Curveball

“What if 40% of values in income are missing?”

Bad answers:

“I’d drop the column.”
“I’d fill with the mean.”

Good answer:

“I’d segment the missing rows — are they all from one user type? If so, maybe that missingness means something. I’d create a flag, try group-level imputing, and measure the difference in predictive power with vs. without that column.”

🔜 Next Drop

Episode 003: How to Avoid Overfitting — Think Like an Interviewer

CoreSignal’s Substack

Discussion about this post