Project Overview

This project runs machine learning experiments for protein analysis using classical models and advanced configurations.

Configuration Parameters

The configuration file allows customization of various aspects of the experiment:

random_state: Controls the random seed.
test_size: Defines the test set size. Use 0.0 to skip the initial train/test split.
n_jobs: Number of processors to use (1 for a single processor, -1 for all processors).
use_GPU: Enables GPU usage if available.
nested_cv_outer_splits / nested_cv_inner_splits: Sets the number of splits for nested cross-validation.
make_protein_level_splits: Specifies if splits should be made at the protein level.
per_fold: If True, computes metrics for each fold separately and then averages them.
print_debug_messages: Enables detailed debugging messages.

There are two example configuration files that are used to obtain the article results, experiment_random_split.py and experiment_unseen_protein_split.py.

Available Models

You can select classic machine learning models via the models_to_exec parameter:

KNN: k-nearest neighbors
LR: logistic regression classifier
RF: random forest classifier

Embedding Combinations

To handle different embedding representations, the following combinations are offered via embeddings_combinators:

ConcatEmbeddings: Concatenates embeddings.
AddEmbeddings: Sums the values of embeddings.
MultiplyEmbeddings: Multiplies the values of embeddings.

Example Usage

Set up the experiment: Edit the EXPERIMENT CONFIGURATION section in analysis.py.
Run the experiment: Use the following command to run the analysis with the selected machine learning model and, optionally, a name for the experiment log:
```
python analysis.py experiment_unseen_protein_split [experiment_name]
```
The first argument is the name of the file with the selected configuration, and the second argument, optional, is an additional name for the experiment logs folder

If you do not want that python uses buffered output, which is useful when you want to see stdout logs as soon as they are produced, especially when the stdout is written to a file (e.g. nohup), where large buffers are used that may retain the output for a while, run python with "-u" option (unbuffered). For example:
```
python -u analysis.py experiment_unseen_protein_split unseen_protein_split
```

Creating the virtual environment

Python virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Conda

conda create -n rapids-24.02 -c rapidsai -c conda-forge -c nvidia cuml=24.02 python=3.10 cuda-version=11.8

Running with GPU

To run on GPU, activate the Conda environment and ensure that the use_GPU = True setting is enabled in the configuration file.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Datasets		Datasets
models		models
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
analysis.py		analysis.py
datasets.py		datasets.py
embeddings.py		embeddings.py
environment.yaml		environment.yaml
experiment_random_split.py		experiment_random_split.py
experiment_unseen_protein_split.py		experiment_unseen_protein_split.py
functions.py		functions.py
print.py		print.py
requirements.txt		requirements.txt
scoring.py		scoring.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview

Configuration Parameters

Available Models

Embedding Combinations

Example Usage

Creating the virtual environment

Python virtual environment

Conda

Running with GPU

About

Releases

Packages

Languages

License

sing-group/pp-mcc-ppi-ml

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Configuration Parameters

Available Models

Embedding Combinations

Example Usage

Creating the virtual environment

Python virtual environment

Conda

Running with GPU

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages