Predictive Analysis of Housing Data

The aim of this project is to estimate the Sale Price of residential properties in 2010, with the renovatable and non-renovatable features of the house as predictors. A pre-2010 housing dataset is provided for training the model. How can Datascience(descriptive and predictive analytics) be used to determine the best properties to buy and re-sell.

Project Outline

Problem Statement
Predict Sale price and find feature that are most effective
- EDA
- Data Modelling
Find features that predict an Abnormal Sale condition
Conclusion

Problem Statement

Predict Sale price and find feature that are most effective
- Develop an algorithm to reliably estimate the value of residential houses based on fixed characteristics.
- Identify characteristics of houses that the company can cost-effectively change/renovate with their construction team.
What property characteristics predict an “abnormal” sale?

Predict Sale price and find feature that are most effective

EDA

Initial look at the dataset

Shape 1460, 81. The 81 features are described here

The following categories of predictors are given in the dataset.

  Fixed characteristics, changeable characteristics, Property characteristics, External surrounding characteristics
  Material used in the property, Utilities in property, Condition of property feture, Quality factor of any property feature
  The Area of any property feature, No of rooms in the property, Basement characteristics of property, 
  Garage characteristics of property, Any year related to property buy/sell/renovate, 
  Porch related characteristics of the house, Pool related characteristics of the house, 
  Miscellaneous characteristics of the house, Price related details, Continuous , Categorical(make dummies)

There are 10 non-residential properties. Drop it.

Take a look at the target Sale price. Best way would be to plot a distribution. This also helps in checking outliers.

Examined the fixed characteristics of the house and after required feature engineering selected predictors and plotted a heatmap before modelling.

Predictors selected
Total number of rooms, total basement square feet
GarageArea, Age of the house at the time of sale

Data Modelling

A.Linear regression model results with selected predictors

The model was validated using Cross validation(10 fold) and Train/Test.
I tried different folds from 5 to 20 but there was not much of a variation.

B.Linear regression model results after Regularization

Lasso regularization provided better results than Ridge regularization.

C.Linear Regression model with renovatable and non-renovatable features of the house as predictors

The R2 value is high and could be a result of overfitting
Evaluating the residual error is also important to evaluate the goodness-of-fit of a model.
The residual error is high, hence the variance explained by the renovatable features may not be good predictors for the model.
The second model suggest that renovating the houses may not provide any profit, instead it may even result in a net loss.

Find features that predict an Abnormal Sale condition

Lower Sale price contributed to more abnormal sale conditions

Residential Low density zones have more sales overall and the most abnormal sales. May not be a geat predictor though.

Which suburb has the most number of abnormal sales - Briardale

Conclusion

The best renovatable predictors selected after EDA - Total number of rooms, total basement square feet and GarageArea.
Predictor added after Feature engineering - Age of the house at the time of sale
These features explain 73% of the variance with a Linear regression model.
Renovating the houses may not provide any profit, instead it may even result in a net loss.
More abnormal sales are found in Briardale suburb and houses with lower sale prices are higher contributors.

View project 3 code on github

Written on April 29, 2017