-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dependance on chemaxon is annoying #17
Comments
Agree, chemaxon change of an academic license is inappropriate. I now consider different options. Previously I tested Dimophite-DL, pkasolver and chemaxon and at the first glance chemaxon performed better in complex cases. However, I would prefer a more systematic study. At least it would be reasonable to compare Dimorphite-DL and pkasolver. Unfortunately, this will take time, so it will not be fast, but we will try different alternatives. Actually, this could be an opportunity to develop something new |
It is still possible to disable chemaxon protonation ( |
as another hackish alternative, there would be obabel's -p option (Add Hydrogens appropriate for pH) |
I implemented an easy way to integrate protonation tools. I also implemented dimorphite as an example in the separate branch What I dislike with dimorphite
I would prefer if dimorphite will be converted into a package to be installed by pip and can be run using If someone will help with that, it will be faster to integrate dimorphote. @Feriolet, would you be able to help with that? |
Yes, these have been some of the pain points of using Dimorphite-DL for me as well. I have tried to run 250k compounds, but the It is also annoying the argparse of Dimorphite is problematic so I cant be used it as a package, and the current solution I have uses subprocess. I can try to fix the bug of Dimorphite_DL |
I have found the reason for the argparse error for dimorphite_dl. The main reason that the dimorphite catches the main program argument is because of the command line Since the def convert_kwargs_to_sys_argv(params: dict) -> list:
sys_argv_list = []
for key, value in params.items():
# Since some argument have store_true argument, we don't need to include the 'True' value into the sys.argv
if value is True:
sys_argv_list.append('--'+key)
elif value is False:
continue
else:
sys_argv_list.append('--'+key)
sys_argv_list.append(str(value))
return sys_argv_list Then, the def main(params=None):
"""The main definition run when you call the script from the commandline.
:param params: The parameters to use. Entirely optional. If absent,
defaults to None, in which case argments will be taken from
those given at the command line.
:param params: dict, optional
:return: Returns a list of the SMILES strings return_as_list parameter is
True. Otherwise, returns None.
"""
sys_argv_list = convert_kwargs_to_sys_argv(params)
parser = ArgParseFuncs.get_args()
args = vars(parser.parse_args(sys_argv_list))
print(args)
if not args["silent"]:
print_header()
# Add in any parameters in params.
if params is not None:
for k, v in params.items():
args[k] = v
# If being run from the command line, print out all parameters.
if __name__ == "__main__":
if not args["silent"]:
print("\nPARAMETERS:\n")
for k in sorted(args.keys()):
print(k.rjust(13) + ": " + str(args[k]))
print("")
if args["test"]:
# Run tests.
TestFuncs.test()
else:
# Run protonation
if "output_file" in args and args["output_file"] is not None:
# An output file was specified, so write to that.
with open(args["output_file"], "w") as file:
for protonated_smi in Protonate(args):
file.write(protonated_smi + "\n")
elif "return_as_list" in args and args["return_as_list"] == True:
return list(Protonate(args))
else:
# No output file specified. Just print it to the screen.
for protonated_smi in Protonate(args):
print(protonated_smi) |
Good first step. Thank you! I checked the code of Dimorphite for parallelization and may say it can be not the most difficult part. The code of a generator class looks quite complex and I do not understand whether this complexity is necessary or not. |
Yes, it will take time for me to understand the code as it uses multiple Classes in huge chunk as well. |
I noticed that the EDIT: nvm it seems that a |
In theory it should be possible, but I do not have experience with such classes. I usually implement such kind of things in a different way A very simple example: with Pool(ncpu) as pool:
with open(smi_filename) as f:
for mol_id, protonated_smi in pool.imap_unordered(Protonate, f):
if protonated_smi:
update_db(mol_id, protonated_smi) where There are some examples of generators working with multiprocessing - https://codereview.stackexchange.com/questions/259029/compute-the-outputs-of-a-generator-in-parallel, but those implementations are also simpler. |
I have tried to understand how iterator class work and I think I got a few ideas on what to do. The dimorphite class needs to take the input filename, and the iterator is generated by the The above code probably won't work because the My idea is that we can either change how the If we plan to change the code, we can probably make the def convert_list_to_iter_in_iter(input_list: list):
iter_list = [iter(str(element)) for element in input_list]
return iter(iter_list) then, we can use |
The idea of multiple input files is very nice, because we nevertheless will store all these data on a disk and it does not matter in how many files. We can create 2x-3x files as the number of CPUs submitted by a user in the command line (argument If the speed will be too slow, we may think to use |
You may create a package of dimorphite-dl under your account. Uploading it to the pypi is not necessary, we may specify in README to install dimorphite directly from your repository. |
By installing, do you mean through Btw, using 1 cpu to protonate 250k requires 1037s. So, I guess that using the ncpu would be sufficient if we want to scale up the protonation |
I mean |
Is Dimorphite ready to use in the branch ? |
Oh okay, but how is it different than using the dimorphite_example currently? I am forking that branch to modify the code to implement the current feature. I will try to make a pull request soon to fix the dimorphite implementation |
@Samuel-gwb, yes, but it uses one core and for large sets this will take some time. 250k molecules ~ 20 min. @Feriolet, the difference will be that we can completely remove directory and files of dimorphite from easydock package. Dimorphite will be installed as an individual package and imported similarly to the current import (only the name of the package will change) If you need more details, please ask |
But we do still need to change the dimorphite source code right? I feel that it is inconvenient for the user to install it as a separate package and then alter it by themselves. Or is there something that I missed about installing it in a different directory from easydock? Edit: nvm I didnt read that separating them will solve the issue of arg parse. Will do that! |
Yes, the source code should be modified to enable mutiprocessing and correct treatment of command line arguments. A separate package will not solve the issue with command line arguments, so this should be explicitly fixed. Installation as a package is preferable and convenient.
|
I can't install dimorphite through pip.
|
Yes, of course. Therefore, it should be converted into a Python package to enable installation from pip. This is quite easy. You may check the structure of In the |
Huh I tried creating the It works for now though. I have tried re-running it and it works. I forgot what min and max pH I used for my initial run, so I can't test if my code in my pull request actually affect the docking score. As I can't reproduce the same protonated smi and docking score for my code, maybe you can see if the code I send in the pull request is good enough. Also, since we are creating the |
Installing with You have to create your fork on github with |
Seems that successfully using dimorphite_dl by "pip git+https://github.com/Feriolet/dimorphite_dl". |
Yes, tests were passed successfully. I merged all changes to master, but I'll postpone to release a new version, because there is one critical issue which should be somehow addressed. i noticed that Dimorphite does not behave well with nitrogen centers and molecules having more than a single protonation center. Below are several test examples. In the cases 3 and 4 the nitrogen atom was not protonated as expected.
Dimorphite was designed to enumerate a list of possible forms at a given pH and it cannot predict the most favorable one. The logic behind was to enumerate all reasonable forms and dock all of them. Since we limit the number of protonated forms to 1, Dimorphite takes the first one from the list and returns it. Therefore, we have two options:
Alternative solution - pkasolver. Therefore, in the beginning I mentioned that it would be good to study and compare all available solutions (free or open-source) to choose more appropriate one(s). |
@Samuel-gwb sorry for the late reply but yes the code should have been merged. We will also update the README to include the dimorphite_dl fork as well. While the latter one seems ideal, it may not work well with small molecules with multiple protonation state as you mentioned. I have tried to protonate this molecule Yes, it would be great if we can implement multiple solutions in the future if possible. Regarding pkasolver, did you mean by this https://github.com/mayrf/pkasolver ? From what I look briefly, it seems that we would need to use all possible protonated molecules as the input and return a single molecule with the closest pKa=7. |
Yes, this is pkasolver repo. You are right, I missed the issue of too many forms for a single molecule. It may really substantially increase docking time and may not be necessity. This is another cons. |
Many thanks for the installation notes. We will include them to the README. Is now the output identical to mine, no difference in protonation states? If so, I will merge everything to master branches and add this solution with Finally, I will keep dimorphite implementation inside the code, but will remove it from the command line interface, because currently it is not useful and will confuse users only. |
Yes the protonated smiles are identical to the most recent one you showed |
One more question. Does it work on computers with GPU? Should we add |
I have not tested the GPU from scratch. It should also work given that it has the same result as CPU for the previous protonation (as in yesterday's result). I'll update you once I can test it on the GPU. Assuming that the users will follow the torch installation, there may be no need to yse |
I updated The minor issue which is remained - enumeration of stereoisomers after protonation, e.g. a new unspecified chiral center will appear in if not os.path.isfile(args.output):
create_db(args.output, args)
init_db(args.output, args.input, args.prefix)
else:
args_dict, tmpfiles = restore_setup_from_db(args.output)
# this will ignore stored values of those args which were supplied via command line
# command line args have precedence over stored ones
for arg in supplied_args:
del args_dict[arg]
args.__dict__.update(args_dict)
dask_client = create_dask_client(args.hostfile)
if args.protonation:
add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
populate_stereoisomers(args.output, args.max_stereoisomers) However, this will create an issue, that we will have records with identical Currently I tend to ignore this issue and postpone its solution for future. |
Some error use newst environment and run_dock ... --protonation pkasolver ... : RuntimeError: unable to open shared memory object </torch_4154596_2592645159_498> in read-write mode: Too many open files (24) But it's successful to use: run_dock ... --protonation dimorphite ...
|
Have you tried reinstalling the conda environment from scratch? The error seems to be caused by torch.multiprocessing, but I am not sure if the default multiprocessing can call the torch multiprocessing. Also, how many CPU did you use? |
Yes, I freshly installed a new conda environment, named as easydock_new. |
I was referring to the I tried to reinstall it from scratch again and I still can't replicate your error. Maybe you can give the full error log on your side and your environment.txt? Im not sure how to approach this error. From what I found on the internet, the error is either caused by the linux limit on how many files you can write or read (unlikely because your |
@DrrDom btw for your previous qn on GPU (if you are still interested): Both GPU and CPU gives identical protonated smiles. |
Command is as: "--protonation dimorphite " is using the same input file. For QueryModel(), I think it should be created by "pip install ..." into the miniconda3/env/easydock_pka/lib/python3.9/site-packages/pkasolver/query.py. class QueryModel:
environment is as: |
I am assuming |
I have tried installing I am now a bit lost. What about sending me the easydock Also, it would be helpful if you can show the error before this one too
|
input files also attached : I re-run the command, error a little different with both "Too many open files"
|
Yes, they are the same. typo mistakes |
Yes, it still runs without issue. Ok, what about changing the protonation function. Maybe it works for your case? def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
from pkasolver.query import QueryModel
model = QueryModel()
with contextlib.redirect_stdout(None):
pool = Pool(ncpu)
with open(output_fname, 'wt') as f:
pkasolver_output = pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname))
pool.close()
pool.join()
for smi, name in pkasolver_output:
f.write(f'{smi}\t{name}\n') |
@Samuel-gwb, I'm a little bit lost. You posted two error messages with "too many open files". One is related to Since you use You may increase the number of file descriptors opened simultaneously |
Yes that is what I thought as well. From the error, it looks like the default multiprocessing calls to torch version, which calls the default version again. I tried not to use the ulimit solution as it is surprising that accessing one cpu would cause this issue and may be used as a last resort if everything else fails |
If In that case I see two possible solutions:
@Samuel-gwb, could you test the function below? def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
import torch
from pkasolver.query import QueryModel
model = QueryModel()
with contextlib.redirect_stdout(None):
if torch.cuda.is_available() or ncpu == 1:
with open(output_fname, 'wt') as f:
for mol, mol_name in read_input(input_fname):
smi, name = _protonate_pkasolver(mol, mol_name, model=model)
f.write(f'{smi}\t{name}\n')```
else:
pool = Pool(ncpu)
with open(output_fname, 'wt') as f:
for smi, name in pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname)):
f.write(f'{smi}\t{name}\n')``` |
Yes, I use the same easydock_pka environment for different tests. And the last error message can be repeated for last several times. |
Very confused ! Then,Use default protonate_pkasolver function, error: Use Feriolet‘s version to modify "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/protonation.py", replace contents of 'def protonate_pkasolver', then: Use Pavel's version, need to modify 'smi, name = protonate...' --> 'smi, name = _protonate...' : |
Please put a bracket for the mol and mol_name. This will allow the smi, name = __protonate_pkasolver((mol, mol_name), model=model) |
Great, it works! Many thanks !
Any -c > 1 will cause "Too many open files" error.
GTP smi: |
We hope that using 1 cpu is sufficient for your use case. I honestly still can't reproduce your error, so I can't really help you too much with that. I also tried running it on Apple M1, and there is also no such issue. We can still try to tackle the
@DrrDom correct me if there is any mistake I said |
The error for ncpu > 1 is strange. This means that you do not have detectable GPU and use exclusively Wrong protonations may occur. Every protonation tool is incorrect to some extent. |
I updated |
Great !
|
Thank you!
|
I agreed with @Samuel-gwb for his 2nd point. Don't we need the Edit: nvm I think I got ur point, my bad |
You were right) Thanks to pointing me out. I indeed missed to add these packages |
maybe switch to Dimorphite-DL:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0336-9
I am in academia and I don't even have a chemaxon license anymore...
Software vendors always change their license terms one day or another...
The text was updated successfully, but these errors were encountered: