This repository contains the solution file and dataset for analytics vidhya JOB-A-THON. Here you will find the notebook which has the method I used during the event to get the best solution possible.
In this contest the dataset has 9 years of data for a country named Green Energy for green energy consumption in the country per hour. The problem statement was that using this data for training purposes we have to build a Machine Learning model which could predict for 3 years consumption of green energy in the future for every hour.
The dataset contains following columns:
- datetime
- energy
The dataset has no null values.
- Upon plotting the values it is seen that the energy consumption shows a positive growth for increasing years in a linear manner.
- The dataset also has some outliers so they are removed.
- For prediction I needed more features which could fit the 3 years prediction horizon. So I added more features in our dataset according to seasonality.
- For that I made 8 different columns with a lag of 1 year each in every column based on energy.
- Using the datetime column I extracted different values like 'year', 'day', 'dayofweek', 'month', 'weekofyear' etc. and added these features to the dataset.
- The I used timeseriessplit from sklearn to split the time series data as it is sequential data so the data needs to be progressive while selecting training and test set and not random.
For modelling purpoes I used two different algorithms:
- RandomForest
- XGBoost
The best result was achieved using XGBoost.
Time series data is complex and if needed to be used for prediction one should make sure that sufficient data is present. The horizon for which the prediction is needed to be done must be based on the data available. Time series data is not easy to work with but at the same time it also give too much insight about the trend and pattern based on real world. For time series prediction XGBoost is one of the best algorithm to work with and gives most accurate predictions.
Secured 18th rank on the leaderboard.