Goal of this project is building a data model that predicts electricity consumption, located in the KWH field in the dataset.
This dataset contains information of energy costs and usage for heating, cooling, appliances and other end uses, from a sample of housing units.
The dataset taken from link.
(Number of Rows: approx. 12000,
Number of Columns: approx. 940)
- Google Colab
- Random Forest Regressor
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Scikit-learn
- Data understanding
- Data exploration
- Data preparation
- One-Hot Encoding the categorical columns
- Handling NaN values
- Removing the unneacesary columns
- Assumptions and considerations:
- Columns starting with 'Z' are the imputation flags for other variables. So are to be removed as they will not contribute in the prediction.
- Columns with thermal unit other than KWH are assumend to be not helpful. Hence are removed.
- Columns which show the total consumptions of elements' electricity usage are redundant as the individual contributions by those elements are already present in the data. Hence are removed for avoiding data redundancy.
- One-Hot Encoding the categorical columns
- Data Analysis
- Finding the correlation of features with output variable and visualizing
- Finding the correlation of features with output variable and visualizing
- Random Forest Regressor
- Using GridSearchCV for selecting optimal hyperparameters for the model
- Choosing important features by calculating feature importances
- Using GridSearchCV for selecting optimal hyperparameters for the model
There are about 14 features from the entire dataset that are found to be contributing the most towards the consumption of electricity, and are found after several steps of data cleaning, processing and feature engineering.
Random Forest Regressor is giving fair output for prediction of the consumption in Kilo Watt Hour (KWH ) with R2 score of 0.875. With more data exploration and manipulation, more optimised prediction can be obtained.
Other models such as Neural Networks can be used for the prediction.
The features can be dugged deep with more EDA and by using libraries such as FeatureSelector to further improve the model and working more on feature importance.
- https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0
- https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
- https://www.youtube.com/watch?v=ioXKxulmwVQ&t=0s
- https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0
- https://machinelearninghd.com/gridsearchcv-hyperparameter-tuning-sckit-learn-regression-classification/
- https://towardsdatascience.com/improving-random-forest-in-python-part-1-893916666cd