Welcome back to DS/AI Interview Pulse â a curated breakdown of technical ML interview questions that actually show up, especially in roles at Big Tech, startups, and high-bar AI teams.
đ¤ Todayâs Question:
âHow do you fill in missing values?â
Sounds innocent.
But itâs one of the fastest ways for an interviewer to check:
Have you seen real-world messiness?
Do you handle missing data with nuance?
Can you detect and prevent leakage?
Letâs break it down â with the tools, instincts, and red flags theyâre looking for.
đ Letâs Start Here: Mean Imputation is Lazy
Yes, itâs fast. Yes, it âworks.â
But mean/median/mode imputation is the duct tape of data science.
# Lazy â and it shows
df['age'] = df['age'].fillna(df['age'].mean())
Youâre smoothing over real-world gaps without asking why they exist.
Interviewers want more than one-liners.
đĄď¸ The Safe Route: Group-Aware and Flagged
Better strategy:
# Segment-aware imputing
df['age'] = df.groupby('job_level')['age'].transform(lambda x: x.fillna(x.median()))
# Add a missing flag
df['age_missing'] = df['age'].isnull().astype(int)
Why this works:
You honor structure (e.g., age varies by job level)
You preserve signal in the fact that data is missing
You give the model flexibility to decide whatâs relevant
Bonus points: check if that missing flag correlates with your label.
đ¤ ML-Based Imputation (Flashy, Risky)
You can predict missing values using ML.
Itâs powerful, but you need to prove you understand the trade-offs.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)
Other options:
IterativeImputer
(scikit-learnâs version of chained regression)LightGBM/XGBoost to predict a column
â ď¸ Caution:
Requires careful pipeline control
Can accidentally leak info from target variable
Might not be worth the complexity in a lean system
đ¤ What to Say in the Interview
âIâd start with understanding why data is missing â is it random, segment-based, or time-related? Then Iâd look at group-level imputing, like filling age by job level or department. For features with meaningful missingness, Iâd add a binary flag to help the model pick up on that. For some columns I might try KNN or iterative imputation â but only if Iâm confident Iâm avoiding target leakage and preserving reproducibility.â
This shows:
Youâve seen real datasets
You think before filling
You can argue your choices with business or ML reasoning
đ§ Categorical Features? Donât Just Mode-It
# Safe default
df['gender'] = df['gender'].fillna(df['gender'].mode()[0]) # More nuanced df['gender'] = df['gender'].fillna('missing')
Sometimes "missing"
is a real category.
Example: if education_level
is missing for 90% of interns â donât fill it. Encode it.
đ¨ Common Traps (That Get You Dinged)
âImputing before train-test split.â
â This can cause leakage. Impute after split, inside your pipeline.âFilling categorical columns with 0.â
â Thatâs numerically valid, but semantically wrong.âDropping too quickly.â
â Dropping 30% of rows is sometimes worse than noisy imputing. Know when to take that risk.
đ§Š Interviewerâs Checklist
Theyâre listening for:
Awareness of
.fillna()
â and its limitsMention of group-level imputation
Use of missing-value flags
Risk reasoning (leakage, overfitting)
Strategic thinking about feature vs. model performance
âď¸ Mock Follow-Up: The Curveball
âWhat if 40% of values in
income
are missing?â
Bad answers:
âIâd drop the column.â
âIâd fill with the mean.â
Good answer:
âIâd segment the missing rows â are they all from one user type? If so, maybe that missingness means something. Iâd create a flag, try group-level imputing, and measure the difference in predictive power with vs. without that column.â
đ Next Drop
Episode 003: How to Avoid Overfitting â Think Like an Interviewer