🏠 California House Price Prediction Model By Mirza Yasir Abdullah Baig
🧠 Introduction
The real estate market is influenced by multiple complex factors such as location, income, population, and housing conditions. Accurately estimating house prices is a challenging yet valuable problem — especially for buyers, sellers, and investors.
To explore how machine learning can address real-world business use cases, I built an AI-powered web application that predicts median house prices in California using the California Housing Dataset.
This project demonstrates an end-to-end ML workflow — from data preprocessing, feature scaling, model training, and evaluation to interactive deployment using Streamlit — bridging the gap between raw data and actionable real estate insights.
Also Read: https://github.com/mirzayasirabdullahbaig07/House-Price-Prediction-Model
💡 Motivation & Intuition Behind the Project
-
Understanding Real-World Regression Problems:
Most practical ML tasks are regression-based — predicting continuous values like price, temperature, or sales. I wanted to implement such a use case using a real dataset. -
Showcasing End-to-End ML Skills:
My goal was to demonstrate the complete data science lifecycle — preprocessing, training, scaling, saving, and deploying a model in a web app with user interaction. -
Practical & Business-Oriented Problem:
Real estate price prediction is a tangible, understandable domain that connects data science with real-world economics — perfect for interviews and portfolio showcases. -
Portfolio & Learning Value:
This project helped me strengthen my knowledge of Linear Regression, feature scaling, and deployment — all core concepts for ML engineering roles.
📊 Dataset Overview
Dataset Name: California Housing Dataset
Source: Scikit-learn Built-in Datasets
The dataset contains real-world information about California districts collected from the 1990 U.S. Census, making it perfect for housing price prediction tasks.
Feature | Description |
---|---|
MedInc | Median income in block group |
HouseAge | Median age of houses |
AveRooms | Average number of rooms per household |
AveBedrms | Average number of bedrooms per household |
Population | Population of the block |
AveOccup | Average occupancy per household |
Latitude | Latitude coordinate of the district |
Longitude | Longitude coordinate of the district (optional in model) |
Target: | Median house value |
Why This Dataset?
-
It’s clean, reliable, and widely used in regression benchmarks.
-
It demonstrates how socioeconomic and geographic factors affect housing prices.
-
Ideal for understanding linear and non-linear feature relationships.
⚙️ Techniques & Workflow
The entire project was built following a professional Machine Learning Pipeline.
1️⃣ Data Preprocessing
-
Loaded dataset using
sklearn.datasets.fetch_california_housing()
. -
Checked for missing values and outliers.
-
Scaled features using StandardScaler for better regression performance.
-
Selected key predictors:
MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude
.
2️⃣ Feature Scaling
Since the dataset includes features of different scales (income in thousands, rooms in hundreds, etc.), I applied feature scaling using StandardScaler to bring all features to a comparable range.
This step prevents bias in regression weights.
3️⃣ Model Selection
Chosen Model: Linear Regression
I experimented with a few models (Decision Tree, Random Forest), but chose Linear Regression for deployment because:
-
It’s simple, interpretable, and efficient for continuous value prediction.
-
The relationship between predictors and target is largely linear.
-
Excellent baseline model for regression tasks.
4️⃣ Model Training
-
Split data into training and test sets (80/20).
-
Trained Linear Regression model on scaled features.
-
Evaluated using R² score, MAE, and RMSE.
Metric | Result |
---|---|
R² Score | ~0.73 |
MAE (Mean Absolute Error) | Low, showing accurate predictions |
RMSE (Root Mean Square Error) | Moderate, indicating reliable performance |
5️⃣ Model Serialization
-
Saved the trained model as
AIModel_For_House.pkl
. -
Saved the fitted scaler as
scaler.pkl
for consistent input normalization during deployment.
💻 Web App Development – Streamlit Deployment
I built a Streamlit web app to make the model accessible to anyone, even non-technical users.
🌐 App Features
-
Sidebar input form for key housing features.
-
Predicts median house price instantly using the trained Linear Regression model.
-
Displays predicted value with a clear visual style.
-
Includes an About Me section with portfolio links.
🛠️ Tech Stack
Component | Technology Used |
---|---|
Programming Language | Python 3.9+ |
Frontend Framework | Streamlit |
Modeling Library | Scikit-learn |
Data Processing | Pandas, NumPy |
Model Storage | Pickle |
Visualization | Matplotlib / Seaborn |
📈 Results & Insights
-
The model accurately predicts median house prices within a reasonable range.
-
The most influential feature is Median Income, followed by Latitude and House Age — consistent with real-world housing market trends.
-
Achieved a strong balance between simplicity and performance.
📊 Key Insights:
-
Areas with higher median income → higher predicted prices.
-
Newer houses (lower HouseAge) → higher value.
-
Population density has a mild negative effect due to congestion.
🧩 Interview Talking Points
If asked to elaborate this project in an interview, here’s how to respond:
-
What problem does it solve?
It predicts house prices based on socioeconomic and geographic data — useful for buyers and investors. -
Why did you choose Linear Regression?
It provides high interpretability, requires minimal tuning, and fits well with continuous target prediction problems. -
What preprocessing did you perform?
Data scaling, feature selection, and train-test splitting to ensure balanced and unbiased learning. -
How did you evaluate your model?
Used R², MAE, and RMSE to measure accuracy and reliability. -
What did you learn?
-
How to design and deploy a full ML system from scratch.
-
The importance of data preprocessing and scaling.
-
Real-world regression interpretation using feature importance.
-
🚀 Demo & Access
🎥 Video Demo: House-Prediction.webm
🔗 Live App: https://housepredictionapp07.streamlit.app/
🧠 Model Files:
-
AIModel_For_House.pkl
-
scaler.pkl
👨💻 Author
Mirza Yasir Abdullah Baig
❤️ Acknowledgements
⚠️ Disclaimer
This project is for educational and demonstration purposes only.
It is not intended for commercial real estate use without further professional validation.