ML Prep and Train Scripts

This repository contains a collection of Python scripts designed to streamline data analysis, preprocessing, and binary classification modeling.

Note Please note that the scripts in this repository are a work in progress and may not be perfect. I'm aware of potential improvements, and your insights and contributions are welcome! If you find something that could be enhanced or have ideas for new features, feel free to open an issue or submit a clear and concise Pull Request.

Scripts Overview

`analyze_dataset.py`

Description

The analyze_dataset.py script provides functionality to analyze a given CSV file, offering insights into the structure and statistics of the dataset. It covers essential details such as:

Total number of columns
Total number of rows
Number of duplicate rows
Information about each column, including data type, unique values, missing values, and more

Usage

You can run the script from the command line using the following syntax:

python analyze_dataset.py <path_to_csv_file> [OPTIONS]

Output

The script will generate a text file containing detailed information about the dataset's structure and characteristics. The output file will be named according to the source CSV file, appending _analysis.txt to the original name. For example, if the analyzed CSV file is named data.csv, the output text file will be named data_analysis.txt.

Additional Help

For more detailed information on the usage of the script and the available options, you can run the script with the --help flag from the command line:

python analyze_dataset.py --help

`preprocess_data.py`

Description

The preprocess_data.py script provides a comprehensive tool to preprocess data according to a user-defined JSON configuration. It performs various data preprocessing tasks, including but not limited to:

Dropping unnecessary columns
Handling missing values
Encoding categorical features
Scaling numerical features
Splitting the data into training and testing sets
Handling imbalanced classes

The script takes a CSV file as input and produces processed training and testing datasets, allowing seamless integration into a machine learning workflow.

Usage

You can run the script from the command line using the following syntax:

python preprocess_data.py <path_to_csv_file> <path_to_json_config> [OPTIONS]

Output

The script will create processed training and testing datasets ready for model training. The output files might include:

Training dataset
Testing dataset
Any additional files or visualizations based on the specific functionalities of the script

Additional Help

For more detailed information on the usage of the script and the available options, you can run the script with the --help flag from the command line:

python preprocess_data.py --help

`create_models.py`

Description

The create_models.py script is designed to create (train) models for binary classification tasks according to a user-specified JSON configuration. It supports a wide range of classification algorithms, including but not limited to:

Logistic Regression
Random Forest
Gradient Boosting
XGBoost
LightGBM
CatBoost
Neural Networks (using Keras)

The script leverages the configuration provided in the JSON file to set up the desired models with their training details.

Usage

You can run the script from the command line using the following syntax:

python create_models.py <path_to_train_file> <path_to_test_file> <path_to_json_config>

Output

The script will create and save the trained models, possibly along with other information such as performance metrics, visualizations, or additional files based on the specific functionalities of the script.

Additional Help

For more detailed information on the usage of the script and the available options, you can run the script with the --help flag from the command line:

python create_models.py --help

Example Usage & Documentation

While comprehensive documentation for these scripts has not yet been created and is hoped to be developed in the future, I understand the importance of getting started with ease. To facilitate your work with the scripts and provide a hands-on example, please refer to the examples folder.

Inside this folder, you will find an additional README file that guides you through a step-by-step example of creating a model based on the well-known "Titanic Survival Datasets" available from Kaggle. This example should help illustrate how to utilize the scripts for your own datasets and projects.

The examples folder contains:

README with instructions
User configuration
Example dataset

Feel free to explore and adapt this example to your needs, and don't hesitate to reach out with any questions or suggestions.

Contributing

If you find areas for improvement or have ideas for enhancements, feel free to open an issue or submit a Pull Request!

Share Your Support

Like the project? Please give it a star ⭐

You can find more about starring here.

Contributors

^{Made with contrib.rocks.}

License

GNU General Public License v3.0. See LICENSE for full details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ML Prep and Train Scripts

Scripts Overview

`analyze_dataset.py`

Description

Usage

Output

Additional Help

`preprocess_data.py`

Description

Usage

Output

Additional Help

`create_models.py`

Description

Usage

Output

Additional Help

Example Usage & Documentation

Contributing

Share Your Support

Contributors

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

ML Prep and Train Scripts

Scripts Overview

analyze_dataset.py

Description

Usage

Output

Additional Help

preprocess_data.py

Description

Usage

Output

Additional Help

create_models.py

Description

Usage

Output

Additional Help

Example Usage & Documentation

Contributing

Share Your Support

Contributors

License

`analyze_dataset.py`

`preprocess_data.py`

`create_models.py`