This script extracts the month names of the file to be preprocessed according to the 'config.txt' file. Each of the preprocessed files are renamed with a 'proc_' prefix. All processed files are then merged into a single file 'RS_data.csv'. Lastly, based on the first and the last timestamp, the file 'days_scheduler.txt' is created. To run the preprocessing and create the scheduler:
bash s3_bucket/preprocessing.sh
This script will iterate though each day (raw in scheduler.txt) and extract the rows corresponding to that day to train the model. At the end of the training the script will sleep for 10 minutes and then train the data corresponding to the next day.
bash s3_bucket/train_scheduler.sh
- run_preprocessing.sh
- train_scheduler.sh
- Oct.csv, Nov.csv, Dec.csv (piccoli da sostiuire con file interi)
- config.txt
- create_scheduler.py
- preprocessing.py
- train.py
The project is built upon Python 3.8 using the PySpark package.
We recommend installing Anaconda, which comes bundled with many useful modules and tools such as the virtual environments.
After Anaconda is installed, you can install Python's dependencies with:
pip install -r requirements.txt
At this point you should have the correct environment to interact with the scripts in this project.
Be sure to have the dependencies installed and just type:
python train.py
To build the container, from the root folder (the one with Dockerfile
, requirements.txt
etc) type:
bash scripts/docker_build.sh
To run the container, from the root folder type:
bash scripts/docker_run.sh