vc_robot/README.md at master · zhitkovk/vc_robot · GitHub

VC ds contest

Task

Predict numerical target variable with R^2 as metric of quality (yes, not the best metric).

Data

3263 values for 145 features including year (so time series cross validation etc.) in train and 1000 samples in test.

Target variable and one feature exploratory chart:

Solution

Lots of exploratory stuff, mainly removing features which are pure noise or constant over time
Search of features which are correlated with peculiar peaks in target time series: not all the peaks in target can be attributed to peaks (or downs) in features
Selecting the existing features that matter and trying to construct new ones: after trying things like PCA and rolling mean feature transfromations, which did not bring significant improvements to the model
Training model is simple glmnet with time series crossvalidation
Folder with results containg training parameters and score in filename.
Result - ~0.8969 R^2 (R^2 SD is 0.0305) on local crossvalidation and ~0.87 on test after final scoring.

Files breakdown

init.R - reading data in
explore.R - automatically build lots of charts, scatters, feature correlations.
imp_features.R - train model with all the features and get the ranking of the most important features
exp_features.R - experiments with OHE features (proved not very useful)
train_glmnet.R - train final model and save predictions to .csv