Click here to view the project in GitHub

Research Question and Data Overview

Our research question aims to investigate the accuracy of predicting the sale price of a house based on available information about the property. We utilized a comprehensive dataset encompassing nearly all aspects of homes sold in King County, Washington, between May 2014 and May 2015. This dataset contains a wealth of information in the form of 21 variables, enabling a multifaceted approach to understanding the factors that contribute to the sale price of a house. By leveraging this data, we hope to develop a robust model that can accurately predict house sale prices and provide valuable insights for both buyers and sellers in the real estate market.

Data Importing and Preprocessing

In the data importing and preprocessing phase, we focused on the target variable, sale price, and processed a total of 21,613 records. During the preprocessing, we identified four variables with missing values and imputed them using the median values based on zip codes. To ensure accurate representation, we rounded the imputed values for bedrooms to the nearest integer and for bathrooms to the nearest 0.5 value. This careful preprocessing allowed us to maintain the integrity of the data and set the stage for robust analysis and modeling.

Finding Missing Data

Handling Outliers

Normalizing the Target Variable

Correlation Matrix

Data Analytics - Models Used

In the data analytics phase, we employed various algorithms to predict the house sale price, including Linear Regression (without an intercept), K nearest neighbors, Decision Tree, Random Forest, and Gradient Boosted Machines such as XGBoost and LightGBM. To ensure a reliable assessment of the model performance, we split the dataset into 80% for training and 20% for testing. The evaluation metric chosen for this study was the Root Mean Squared Error (RMSE), which measures the average deviation between the predicted and actual sale prices, enabling us to identify the most accurate and effective model for predicting house sale prices.