This project can be used to modify almost all type of structured file and send the data to a endpoint through a post request (tipically a Rest API request). The project is very flexible and you can handle every your need by the .cfg file. You can also produce a simple statistc file about the data in your file before any data mining action.
These are the instructions to follow to set up the project on your local environment. The steps 3 and 4 are optional.
I used Python 3.6 to launch the commands in the project so if you prefer (or you have already installed) a version Python 2.* there are some not optional edits to do.
In this case just have look to the comments in the code to fix the enviroment and be ready to start
Prerequisites
Install every library used in the files:
- pandas
- requests
- json
- time
- datetime
- xlrd
You can use the pip command:
- Windwos: python -m pip install library-name
;
- Linux: pip install library-name
;
-
Git clone the repository into your folder.
git clone https://github.com/your_username/data-ingestion.git
-
Copy project.cfg.example to project.cfg
Use GENERAL section to setup the file info to use in step 3 and 4.
-
Launch the summury.py file to get basic statistics about the dataset or some particolar columns. Set in project.cfg file the columns you want to analyze using STATISTICSCOLS section. It will generate a .csv file in the folder
statistics
with a name established in SUMMARY section.python summury.py
-
Launch the mining.py file to go do data mining and correct the dataset changing column names, modifying values or dropping columns. Set in project.cfg file (in GENERAL section) the flag to decide if it's necessary any edits in the dataset, then set the columns you want to modify (MODIFIERSCOLS section), the values to change (MODIFIERSVALUES section), the columns to merge (MERGECOLS section) and the column(s) to drop. It will generate a .csv file in the folder
files
with the namedata.csv
; it will produce also some logs file to trace every step.python mining.py
-
Launch the ingestion.py file to finally send the data into your file to a endpoint (maybe a your application in which you want to increment the data). Set in project.cfg file every parameter (in INGESTION section) that is necessary to send the data. This step will generate a errors.csv file in the folder
history_errors
with a name that is incremental and composed by date_hour_minute to keep every file of error and reuse this file. It will produce also some logs file to trace every step.python ingestion.py