Predict numerical target variable with R^2 as metric of quality (yes, not the best metric).
3263 values for 145 features including year (so time series cross validation etc.) in train and 1000 samples in test.
Target variable and one feature exploratory chart:
- Lots of exploratory stuff, mainly removing features which are pure noise or constant over time
- Search of features which are correlated with peculiar peaks in target time series: not all the peaks in target can be attributed to peaks (or downs) in features
- Selecting the existing features that matter and trying to construct new ones: after trying things like PCA and rolling mean feature transfromations, which did not bring significant improvements to the model
- Training model is simple glmnet with time series crossvalidation
- Folder with results containg training parameters and score in filename.
- Result - ~0.8969 R^2 (R^2 SD is 0.0305) on local crossvalidation and ~0.87 on test after final scoring.
- init.R - reading data in
- explore.R - automatically build lots of charts, scatters, feature correlations.
- imp_features.R - train model with all the features and get the ranking of the most important features
- exp_features.R - experiments with OHE features (proved not very useful)
- train_glmnet.R - train final model and save predictions to .csv