“Zestimates” are estimated home values based on 7.5 million statistical and machine learning models that analyze hundreds of data points on each property. And, by continually improving the median margin of error (from 14% at the onset to 5% today), Zillow has since become established as one of the largest, most trusted marketplaces for real estate information in the U.S. and a leading example of impactful machine learning.
I am aiming to answer the following questions: Question 1: Has houses got bigger over the years in California?
Question 2: Is there seasonality in the transactions?
Question 3: Where are the underestimated/overestimated houses located?
Question 4: Where are the historic houses located?
By using Linear Regression: Predict the Logerror.
Step 1: To understand the data provided by looking at summary statistics alongside given data definitions, to make sense of what each data series (column) is representing, whether it's continous, binary or categorical, what values it takes, whether its affected by duplicates and nulls.
Step 2: Based on the finding from previous step, we will clean the data, replacing, converting or deleting Nulls, duplicates as appropriate.
Step 3: Based on the given data, and with the ultimate aim of predicting house price in mind. We ask a few questions that may review interesting trends or useful insights. We plan to use visulisation to help answer the questions.
Step 4: We use visulisation to further study the distribution of potential key independent variables as indicated by correlation matrix. We will also check for a linear relationship between independent variables and the dependent variable via scatter plots. We will also be looking for any obvious data outliers in the graphs.
Step 5: Predictive modelling fitting and evaluation. Including checking of the model assumptions.
Step 6: Interpret the model, give examples of how the x variables relate to y variable, and how we can use it to predict y.
Step 7: Further insight into data knowing limitations of model.