Judred

Judred is a peptide specific molecular descriptor generator, it is loosely based on the Mordred python package. The remit of this programme is to generate descriptors of peptides from only single-letter codes in order to provide a significantly faster method of generating descriptors for very large search spaces. It uses the HDF5 file format to store the data as this format is supported by many programming languages and platforms.

This programme is first used in the Journal of Chemical Theory and Computation article "Beyond tripeptides": https://doi.org/10.1021/acs.jctc.1c00159

Example usage

Generate the full dataset of zwitterionic dipeptides:

julia judred.jl 2

Generate the full dataset of zwitterionic tripeptides:

julia judred.jl 3

Generate the full dataset of zwitterionic tetrapeptides:

julia judred.jl 4

etc.

Example output

This image below shows example data (dipeptides) generated by the Judred programme.

The HDF5 database contain (left) a matrix of values contain the values of each descriptor for each peptide, a list of descriptor names (middle) and the list of peptides (right). These values can be recombined in for example a pandas DataFrame or the matrix of values used alone.

Version 2

The unofficial version 2 (Judred_tiny.py) is written in Python and no longer uses the HDF5 file format. By removing the peptide labels and switching to the Apache Parquet format for data output we can reduce the dataset size by a factor of 13. This can be done as the specific peptide to which the parameters represent can be calculated from the parameters position in the dataset.

We have alse added the isoelectric point (pI) parameter as we found this to be a useful descriptor in an upcoming publication.

Since this is a complete dataset and all limits are know for each parameter is terms of min/min values we have 'cooked-in' scaling such that all values are saved between 0/-1 and 1 so there is no longer a need to scale after loading the dataset. As such the new filenaming scheme is: "tetrapeptides_normalized.parquet" instead of the old "tetrapeptides.hdf5".

Another improvement that has been made is the addition of GPU calculations. The arguments for the program are now:

python Judred_tiny.py [peptide_length] [chunk_size]
python Judred_tiny.py 4 10000

Where the chunksize refers to how many values to calculate simulatneously (or as many as your processors will allow) - this in an important parameter particularly in the GPU computational as a lot of the 'processing' time is just moving data to and from the GPU so you want to set the chunck size to as large as your system will allow.

See the following on how this effects run time (10k/100k cpu/gpu):

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
cudred		cudred
peptideutils		peptideutils
peptoid_data		peptoid_data
.gitattributes		.gitattributes
.gitignore		.gitignore
FindNormValue.py		FindNormValue.py
HDFView.PNG		HDFView.PNG
Judred_pandas.py		Judred_pandas.py
Judred_tiny.py		Judred_tiny.py
Judred_tiny_hdf5.py		Judred_tiny_hdf5.py
LICENSE		LICENSE
LoadHDF5.py		LoadHDF5.py
LoadParquet.py		LoadParquet.py
Peptide_index_to_name.py		Peptide_index_to_name.py
Product_lowmem.py		Product_lowmem.py
README.md		README.md
chucksize_100k_cpu.PNG		chucksize_100k_cpu.PNG
chucksize_100k_gpu.PNG		chucksize_100k_gpu.PNG
chucksize_10k_cpu.PNG		chucksize_10k_cpu.PNG
chucksize_10k_gpu.PNG		chucksize_10k_gpu.PNG
dipeptides.hdf5		dipeptides.hdf5
judred.jl		judred.jl
peptideutils.jl		peptideutils.jl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Judred

Example usage

Example output

Version 2

About

Releases

Packages

Contributors 2

Languages

License

avanteijlingen/Judred

Folders and files

Latest commit

History

Repository files navigation

Judred

Example usage

Example output

Version 2

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages