Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dependance on chemaxon is annoying #17

Open
UnixJunkie opened this issue Jan 17, 2024 · 84 comments
Open

dependance on chemaxon is annoying #17

UnixJunkie opened this issue Jan 17, 2024 · 84 comments

Comments

@UnixJunkie
Copy link

maybe switch to Dimorphite-DL:
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0336-9

I am in academia and I don't even have a chemaxon license anymore...
Software vendors always change their license terms one day or another...

@DrrDom
Copy link
Contributor

DrrDom commented Jan 17, 2024

Agree, chemaxon change of an academic license is inappropriate. I now consider different options. Previously I tested Dimophite-DL, pkasolver and chemaxon and at the first glance chemaxon performed better in complex cases. However, I would prefer a more systematic study. At least it would be reasonable to compare Dimorphite-DL and pkasolver. Unfortunately, this will take time, so it will not be fast, but we will try different alternatives. Actually, this could be an opportunity to develop something new

@DrrDom
Copy link
Contributor

DrrDom commented Jan 17, 2024

It is still possible to disable chemaxon protonation (--no_protonation) and use some external tools to protonate structures before docking

@UnixJunkie
Copy link
Author

as another hackish alternative, there would be obabel's -p option (Add Hydrogens appropriate for pH)

@DrrDom
Copy link
Contributor

DrrDom commented Mar 16, 2024

I implemented an easy way to integrate protonation tools. I also implemented dimorphite as an example in the separate branch dimorphite-example.

What I dislike with dimorphite

  • it is not a package and integration of third-party programs as an integral part make it hard to update and support. Dimorphite is pretty old and currently is not developed, but avoiding inclusion of third-party software is a general rule.
  • I did not test, but dimorphite may be slow on large sets of compounds, because it works in a single process.
  • license of dimorphite is Apache 2 and differs from easydock BSD-3. They are compatible, but if it will become an integral part, easydock should have two licenses.
  • it is impossible to use dimorphite API (function run), because it catches argparse of main program arguments. This requires fixes of dimorphite code.

I would prefer if dimorphite will be converted into a package to be installed by pip and can be run using pool on multiple cores. Ideally, if a bug with catching the main argparser using dimorphite API will be also fixed.

If someone will help with that, it will be faster to integrate dimorphote.

@Feriolet, would you be able to help with that?

@Feriolet
Copy link
Contributor

Yes, these have been some of the pain points of using Dimorphite-DL for me as well. I have tried to run 250k compounds, but the init_db ran slowly (around 7 hours). Dimirphite-DL may be the cause, but I have not tested it.

It is also annoying the argparse of Dimorphite is problematic so I cant be used it as a package, and the current solution I have uses subprocess.

I can try to fix the bug of Dimorphite_DL

@Feriolet
Copy link
Contributor

I have found the reason for the argparse error for dimorphite_dl. The main reason that the dimorphite catches the main program argument is because of the command line args = vars(parser.parse_args()) where .parse_args() collect the argument from the terminal as the default (main run_dock argument). To override the default setting and use the desired argument, we have to put the list ['--argument',value,'--argument',value] inside the .parse_args(list).

Since the **kwargs returns a dict_output, I have made a temporary function that converts the dict into a flat list:

def convert_kwargs_to_sys_argv(params: dict) -> list:
    sys_argv_list = []
    for key, value in params.items():
        # Since some argument have store_true argument, we don't need to include the 'True' value into the sys.argv
        if value is True:
            sys_argv_list.append('--'+key)
        elif value is False:
            continue
        else:
            sys_argv_list.append('--'+key)
            sys_argv_list.append(str(value))
        
    return sys_argv_list

Then, the main() function of the dimorphite_dl is modified to:

def main(params=None):
    """The main definition run when you call the script from the commandline.

    :param params: The parameters to use. Entirely optional. If absent,
                   defaults to None, in which case argments will be taken from
                   those given at the command line.
    :param params: dict, optional
    :return: Returns a list of the SMILES strings return_as_list parameter is
             True. Otherwise, returns None.
    """
    sys_argv_list = convert_kwargs_to_sys_argv(params)
    parser = ArgParseFuncs.get_args()
    args = vars(parser.parse_args(sys_argv_list))
    print(args)
    if not args["silent"]:
        print_header()

    # Add in any parameters in params.
    if params is not None:
        for k, v in params.items():
            args[k] = v

    # If being run from the command line, print out all parameters.
    if __name__ == "__main__":
        if not args["silent"]:
            print("\nPARAMETERS:\n")
            for k in sorted(args.keys()):
                print(k.rjust(13) + ": " + str(args[k]))
            print("")

    if args["test"]:
        # Run tests.
        TestFuncs.test()
    else:
        # Run protonation
        if "output_file" in args and args["output_file"] is not None:
            # An output file was specified, so write to that.
            with open(args["output_file"], "w") as file:
                for protonated_smi in Protonate(args):
                    file.write(protonated_smi + "\n")
        elif "return_as_list" in args and args["return_as_list"] == True:
            return list(Protonate(args))
        else:
            # No output file specified. Just print it to the screen.
            for protonated_smi in Protonate(args):
                print(protonated_smi)

@DrrDom
Copy link
Contributor

DrrDom commented Mar 17, 2024

Good first step. Thank you!

I checked the code of Dimorphite for parallelization and may say it can be not the most difficult part. The code of a generator class looks quite complex and I do not understand whether this complexity is necessary or not.

@Feriolet
Copy link
Contributor

Yes, it will take time for me to understand the code as it uses multiple Classes in huge chunk as well.

@Feriolet
Copy link
Contributor

Feriolet commented Mar 18, 2024

I noticed that the Protonate() class is an iterator with the __next()__ and __iter()__. Is it even possible to use multiprocessing for this type of class? Calling the Protonate(args) may immediately loop the smi list before we can multiprocess it.

EDIT: nvm it seems that a for loop calles for the next() function of an iterator. I'll see if I can call multiprocessing within/outside of the class type.

@DrrDom
Copy link
Contributor

DrrDom commented Mar 18, 2024

In theory it should be possible, but I do not have experience with such classes. I usually implement such kind of things in a different way

A very simple example:

with Pool(ncpu) as pool:
    with open(smi_filename) as f:
        for mol_id, protonated_smi in pool.imap_unordered(Protonate, f):
             if protonated_smi:
                  update_db(mol_id, protonated_smi)

where Protonate would take a line and returns mol_id and protonated_smi. f is actually a generator that avoids memory overflow.

There are some examples of generators working with multiprocessing - https://codereview.stackexchange.com/questions/259029/compute-the-outputs-of-a-generator-in-parallel, but those implementations are also simpler.

@Feriolet
Copy link
Contributor

Feriolet commented Mar 19, 2024

I have tried to understand how iterator class work and I think I got a few ideas on what to do. The dimorphite class needs to take the input filename, and the iterator is generated by the LoadSMIFile() class and stored in the args["smiles_and_data] when the ArgParseFuncs.clean_args() is called. Because of the nature of Iterator class which has the next(iterator) function, the input must be an iterator type, not iterable type.

The above code probably won't work because the Protonate() will treat f as the args input, which is not what we wanted.

My idea is that we can either change how the Protonate() treats the input by separating the args["smiles_and_data"] code from the args which means we have to add another argument Protonate.__init__(self, args, smi_iterator). But I don't know if it will break how dimorphite intends to run the code. Another ridiculous way is that we can split the smi files to n number of smi (e.g., 250k smi will have 250k separate files), which can be easily multiprocess and we don't need to modify the code too much. But again, generating 250k files is kind of ridiculous.

If we plan to change the code, we can probably make the Protonate() take the input of iter([iter(smi),iter(smi),...] where smi represents one smi string with the following function:

def convert_list_to_iter_in_iter(input_list: list):

    iter_list = [iter(str(element)) for element in input_list]
    return iter(iter_list)

then, we can use pool.imap_unordered(Protonate, iter(iter_list)) so the input as treated as iterator rather than a list/string

@DrrDom
Copy link
Contributor

DrrDom commented Mar 19, 2024

The idea of multiple input files is very nice, because we nevertheless will store all these data on a disk and it does not matter in how many files.

We can create 2x-3x files as the number of CPUs submitted by a user in the command line (argument -c), propagate this ncpu argument to the add_protonation function and use it to run multiples input files using pool and subprocess. Not a perfect solution but fast and working.

If the speed will be too slow, we may think to use dask for this task, but I hope it will be reasonably good.

@DrrDom
Copy link
Contributor

DrrDom commented Mar 19, 2024

You may create a package of dimorphite-dl under your account. Uploading it to the pypi is not necessary, we may specify in README to install dimorphite directly from your repository.

@Feriolet
Copy link
Contributor

Feriolet commented Mar 19, 2024

By installing, do you mean through git clone in the repo?

Btw, using 1 cpu to protonate 250k requires 1037s. So, I guess that using the ncpu would be sufficient if we want to scale up the protonation

@DrrDom
Copy link
Contributor

DrrDom commented Mar 19, 2024

I mean pip install git+.... So I suggest to make it similarly to easydock. Then we will be able to import it from easydockand run directly from Python even without subprocess (since we will solve the issue with command line arguments parsing).

@Samuel-gwb
Copy link

Is Dimorphite ready to use in the branch ?

@Feriolet
Copy link
Contributor

I mean pip install git+.... So I suggest to make it similarly to easydock. Then we will be able to import it from easydockand run directly from Python even without subprocess (since we will solve the issue with command line arguments parsing).

Oh okay, but how is it different than using the dimorphite_example currently? I am forking that branch to modify the code to implement the current feature. I will try to make a pull request soon to fix the dimorphite implementation

@DrrDom
Copy link
Contributor

DrrDom commented Mar 22, 2024

@Samuel-gwb, yes, but it uses one core and for large sets this will take some time. 250k molecules ~ 20 min.

@Feriolet, the difference will be that we can completely remove directory and files of dimorphite from easydock package. Dimorphite will be installed as an individual package and imported similarly to the current import (only the name of the package will change) If you need more details, please ask

@Feriolet
Copy link
Contributor

Feriolet commented Mar 22, 2024

But we do still need to change the dimorphite source code right? I feel that it is inconvenient for the user to install it as a separate package and then alter it by themselves. Or is there something that I missed about installing it in a different directory from easydock?

Edit: nvm I didnt read that separating them will solve the issue of arg parse. Will do that!

@DrrDom
Copy link
Contributor

DrrDom commented Mar 22, 2024

Yes, the source code should be modified to enable mutiprocessing and correct treatment of command line arguments. A separate package will not solve the issue with command line arguments, so this should be explicitly fixed.

Installation as a package is preferable and convenient.

  1. No license mixture needed.
  2. Maintain a separate package is simpler. It may also be further developed independently, if someone will find these changes useful.
  3. A user nevertheless installs a number of packages. This will require to copy-paste one line from README.

@Feriolet
Copy link
Contributor

I can't install dimorphite through pip.

ERROR: git+https://github.com/durrantlab/dimorphite_dl does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

@DrrDom
Copy link
Contributor

DrrDom commented Mar 22, 2024

Yes, of course. Therefore, it should be converted into a Python package to enable installation from pip.

This is quite easy. You may check the structure of easydock repository. You have to add setup.cfg, setup.py, LICENSE.txt, README.md, MANIFEST.md to the root dir of your fork of the repository, move other files in a subdirectory named for example dimorphite and add in that subdirectory file __init_.py (it may be empty).

In the MANIFEST.md you should list files which are not python scripts (*.py), but which should be supplied with the package. In dimorpfite this will be the file with patterns (at least for the first glance).
Example if the file - https://github.com/DrrDom/pmapper/blob/master/MANIFEST.in

@Feriolet
Copy link
Contributor

Huh I tried creating the setup.cfg and setup.py for the dimorphite but it does not create a library in the conda environment (unless you considered egg-link to be one). I made the dimorphite directory in my Desktop directory, then install it with pip install -e .

It works for now though. I have tried re-running it and it works. I forgot what min and max pH I used for my initial run, so I can't test if my code in my pull request actually affect the docking score. As I can't reproduce the same protonated smi and docking score for my code, maybe you can see if the code I send in the pull request is good enough.

Also, since we are creating the setup.cfg and setup.py, are we going to include them into our repository as well, or do we ask the user to do it on their own in the README file?

@DrrDom
Copy link
Contributor

DrrDom commented Mar 22, 2024

Installing with pip install -e . does not transfer files to the conda environment, but create links. This is almost like an ordinary installation. So, everything is OK.

You have to create your fork on github with dimorhite code and edit it to fix problems which interfere with easydock. Afterwards you can make PR to the original repository, If maintainers will accept it this will be excellent. If not, we can mention in easydock README to install dimorphite from pip git+https://github.com/Feriolet/dimorphite_dl. In the latter case you will become the maintainer of this specific version of dimorphite.
So, no dimorhite files should be included in easydock repo.

@Samuel-gwb
Copy link

Seems that successfully using dimorphite_dl by "pip git+https://github.com/Feriolet/dimorphite_dl".
Great work!
Update the official code according to the branches?

@DrrDom
Copy link
Contributor

DrrDom commented Apr 8, 2024

Yes, tests were passed successfully. I merged all changes to master, but I'll postpone to release a new version, because there is one critical issue which should be somehow addressed.

i noticed that Dimorphite does not behave well with nitrogen centers and molecules having more than a single protonation center. Below are several test examples. In the cases 3 and 4 the nitrogen atom was not protonated as expected.

smi			smi_protonated
Oc1ccccn1		[O-]c1ccccn1
O=c1cccc[nH]1		O=c1cccc[n-]1
NCC(=O)O		NCC(=O)[O-]
C[C@@H]1CCCN(C)C1	C[C@@H]1CCCN(C)C1

Dimorphite was designed to enumerate a list of possible forms at a given pH and it cannot predict the most favorable one. The logic behind was to enumerate all reasonable forms and dock all of them. Since we limit the number of protonated forms to 1, Dimorphite takes the first one from the list and returns it.

Therefore, we have two options:

  • use dimorphite as is (and probably get incorrect protonation states for some random molecules) - this looks very misleading and unexpected to a user
  • re-implement treating of protonation forms (similarly as stereoisomers) to consider all of them - this will increase complexity of manipulation with database records. Docking of all possible protonation forms may be also misleading in my opinion, but probably less than docking of a single incorrect form.

Alternative solution - pkasolver.
Half a year ago I adapted pkasolver to predict the most stable protonation state. I took a form with pKa closest to 7.4. pkasolver also has some drawbacks and errors in assignment of protonation states. However, to some extent it can treat molecules with multiple centers (but the model which is publicly available was trained on single-center molecules only). I have to adapt previous implementation to the new program architecture and can make a PR to enable discussion of all solutions.

Therefore, in the beginning I mentioned that it would be good to study and compare all available solutions (free or open-source) to choose more appropriate one(s).

@Feriolet
Copy link
Contributor

Feriolet commented Apr 8, 2024

@Samuel-gwb sorry for the late reply but yes the code should have been merged. We will also update the README to include the dimorphite_dl fork as well.

While the latter one seems ideal, it may not work well with small molecules with multiple protonation state as you mentioned. I have tried to protonate this molecule CCCC[C@@H](CN[C@@H](CCCC)C(=O)[N-][C@@H](CCC([NH-])=O)C(=O)[N-][C@@H](CCCNC(N)=[NH2+])C([NH-])=O)[N-]C(=O)[C@@H]([N-]C(=O)[C@@H]([N-]C(C)=O)[C@@H](C)O)[C@@H](C)CC in the past and dimorphite_dl produces over 128 protonated molecules (default limit if i remember correctly), which may significantly unnecessarily increase the computational time and resources.

Yes, it would be great if we can implement multiple solutions in the future if possible. Regarding pkasolver, did you mean by this https://github.com/mayrf/pkasolver ? From what I look briefly, it seems that we would need to use all possible protonated molecules as the input and return a single molecule with the closest pKa=7.

@DrrDom
Copy link
Contributor

DrrDom commented Apr 8, 2024

Yes, this is pkasolver repo.
It takes a single molecule and enumerates multiple protonation forms with predicted pKa and I chose the one closest to 7.4.
I'll try to adapt it quickly and make PR to show the idea and test it. With the new architecture it should be relatively easy to do.

You are right, I missed the issue of too many forms for a single molecule. It may really substantially increase docking time and may not be necessity. This is another cons.

@Samuel-gwb
Copy link

Thanks for your quick response @DrrDom @Feriolet. And apprecieate your thinkings and efforts very much!
Also like easydock very much. Really easy to do screen :-)
Hope a efficient solution.

@DrrDom
Copy link
Contributor

DrrDom commented Apr 11, 2024

Many thanks for the installation notes. We will include them to the README.

Is now the output identical to mine, no difference in protonation states? If so, I will merge everything to master branches and add this solution with contextlib.

Finally, I will keep dimorphite implementation inside the code, but will remove it from the command line interface, because currently it is not useful and will confuse users only.

@Feriolet
Copy link
Contributor

Yes the protonated smiles are identical to the most recent one you showed

@DrrDom
Copy link
Contributor

DrrDom commented Apr 11, 2024

One more question. Does it work on computers with GPU? Should we add force_cpu option or not?

@Feriolet
Copy link
Contributor

I have not tested the GPU from scratch. It should also work given that it has the same result as CPU for the previous protonation (as in yesterday's result). I'll update you once I can test it on the GPU.

Assuming that the users will follow the torch installation, there may be no need to yse force_cpu=True. I guess it can be a good option if you want to make sure that people who accidentally installed torch-cuda got the warning to only use CPU.

@DrrDom
Copy link
Contributor

DrrDom commented Apr 11, 2024

I updated easydock/pkasolver2 and pkasolver/main.

The minor issue which is remained - enumeration of stereoisomers after protonation, e.g. a new unspecified chiral center will appear in C[C@@H]1CCCN(C)C1 after protonation. I'm thinking how to do that with minimal code perturbation and maximum flexibility for the future changes. It may worth to redesign init_db and pull the function get_isomers out of it and apply it only after protonated molecules were generated.

if not os.path.isfile(args.output):
    create_db(args.output, args)
    init_db(args.output, args.input, args.prefix)
else:
    args_dict, tmpfiles = restore_setup_from_db(args.output)
    # this will ignore stored values of those args which were supplied via command line
    # command line args have precedence over stored ones
    for arg in supplied_args:
        del args_dict[arg]
    args.__dict__.update(args_dict)

dask_client = create_dask_client(args.hostfile)

if args.protonation:
    add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)

populate_stereoisomers(args.output, args.max_stereoisomers)

However, this will create an issue, that we will have records with identical smi, different stereo_id and different protonated_smi, that is very misleading and may result in many issues in future. a solution may be to introduce an additional filed to DB protonated_id and enable that a single molecule (smiles) may have several protonation states (which were not alternative protonation states in sense of dimorphite, but different stereoisomers appearing after protonation). I'm not confident with this solution, because it will complicate logic of functions and data manipulation. However, i do not see a better alternative.

Currently I tend to ignore this issue and postpone its solution for future.

@Samuel-gwb
Copy link

Samuel-gwb commented Apr 11, 2024

Some error use newst environment and run_dock ... --protonation pkasolver ... :
############################
File ".../miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 366, in reduce_storage
fd, size = storage.share_fd_cpu()

RuntimeError: unable to open shared memory object </torch_4154596_2592645159_498> in read-write mode: Too many open files (24)
#############################

But it's successful to use: run_dock ... --protonation dimorphite ...

I finally can run it without any of those issue. I have reinstalled everything from scratch and the problem goes away. i guess there are conflicting torch version that I installed and messed up with the other packages.

For future reference:

conda create -n easydock -c conda-forge python=3.9 numpy=1.20 rdkit scipy dask distributed
conda activate easydock
pip install paramiko meeko vina
pip install git+https://github.com/Feriolet/dimorphite_dl.git
pip install git+https://github.com/DrrDom/pkasolver.git@noprints
pip install git+https://github.com/ci-lab-cz/easydock.git@pkasolver2
pip install torch==1.13.1+cpu  --extra-index-url https://download.pytorch.org/whl/cpu
pip install torch-geometric==2.0.1
pip install torch_scatter==2.1.1+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install torch_sparse==0.6.17+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install torch_spline_conv==1.2.2+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html

Yes, I agree that we should use CPU for convenience and consistency across other protonation software (dimorphite_dl and chemaxon). To silent the protonation output, I have added the contextlib package to the add_protonation() function inside the database.py file:

                if program == 'chemaxon':
                    protonate_func = partial(protonate_chemaxon, tautomerize=tautomerize)
                    read_func = read_protonate_chemaxon
                elif program == 'dimorphite':
                    protonate_func = partial(protonate_dimorphite, ncpu=ncpu)
                    read_func = read_smiles
                elif program == 'pkasolver':
                    protonate_func = partial(protonate_pkasolver, ncpu=ncpu)
                    read_func = read_smiles
                else:
                    protonate_func = empty_func
                    read_func = empty_generator
                
                with contextlib.redirect_stdout(None):
                    protonate_func(input_fname=tmp.name, output_fname=output)

@Feriolet
Copy link
Contributor

Have you tried reinstalling the conda environment from scratch? The error seems to be caused by torch.multiprocessing, but I am not sure if the default multiprocessing can call the torch multiprocessing.

Also, how many CPU did you use?

@Samuel-gwb
Copy link

Yes, I freshly installed a new conda environment, named as easydock_new.
My computer has 64 cpu. Need to specify cpu, something like cpu:0?

@Feriolet
Copy link
Contributor

I was referring to the -c argument that you use to run the code.

I tried to reinstall it from scratch again and I still can't replicate your error. Maybe you can give the full error log on your side and your environment.txt? Im not sure how to approach this error.

From what I found on the internet, the error is either caused by the linux limit on how many files you can write or read (unlikely because your --protonation dimorphite works and I assume both access similar bytes of files) or it may be because of the pkasolver torch or QueryModel(). Maybe you can give us the snippet for the QueryModel() class also?

@Feriolet
Copy link
Contributor

@DrrDom btw for your previous qn on GPU (if you are still interested):

Both GPU and CPU gives identical protonated smiles.
For 100 smiles:
-c 30 protonates for 57.70 s
1 gpu protonates for 46.74s
2 gpu pool (shared gpu) protonates for 27.77s
4 gpu pool (shared gpu) protonates for 18.94s

@Samuel-gwb
Copy link

Samuel-gwb commented Apr 12, 2024

Command is as:
run_dock -i "$smi_file" -o "$output_file" --program vina --config config_vina.yml --protonati
on pkasolver -c 1 --sdf

"--protonation dimorphite " is using the same input file.

For QueryModel(), I think it should be created by "pip install ..." into the miniconda3/env/easydock_pka/lib/python3.9/site-packages/pkasolver/query.py.
I have not modified it, which is as :

class QueryModel:
def init(self):

    self.models = []

    for i in range(25):
        model_name, model_class = "GINPair", GINPairV1
        model = model_class(
            num_node_features, num_edge_features, hidden_channels=96
        )
        base_path = path.dirname(__file__)
        if torch.cuda.is_available() == False:  # If only CPU is available
            checkpoint = torch.load(
                f"{base_path}/trained_model_without_epik/best_model_{i}.pt",
                map_location=torch.device("cpu"),
            )
        else:
            checkpoint = torch.load(
                f"{base_path}/trained_model_without_epik/best_model_{i}.pt"
            )

        model.load_state_dict(checkpoint["model_state_dict"])
        model.eval()
        model.to(device=DEVICE)
        self.models.append(model)

def predict_pka_value(self, loader: DataLoader) -> np.ndarray:
    """
    ----------
    loader
        data to be predicted
    Returns
    -------
    np.array
        list of predicted pKa values
    """

    results = []
    assert len(loader) == 1
    for data in loader:  # Iterate in batches over the training dataset.
        data.to(device=DEVICE)
        consensus_r = []
        for model in self.models:
            y_pred = (
                model(
                    x_p=data.x_p,
                    x_d=data.x_d,
                    edge_attr_p=data.edge_attr_p,
                    edge_attr_d=data.edge_attr_d,
                    data=data,
                )
                .reshape(-1)
                .detach()
            )

            consensus_r.append(y_pred.tolist())
        results.extend(
            (
                float(np.average(consensus_r, axis=0)),
                float(np.std(consensus_r, axis=0)),
            )
        )
    return results

environment is as:
easydock_pka.txt

@Feriolet
Copy link
Contributor

I am assuming easydock_pka is the same as easydock_new environment?

@Feriolet
Copy link
Contributor

Feriolet commented Apr 12, 2024

I have tried installing easydock_pka (torch dependencies, easydock, dimorphite, and pkasolver are installed separately with pip because conda probably won't recognise it) and it still works from my side.

I am now a bit lost. What about sending me the easydock protonation.py file then? It should be the most udpated one right?

Also, it would be helpful if you can show the error before this one too

############################
File ".../miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 366, in reduce_storage
fd, size = storage.share_fd_cpu()

RuntimeError: unable to open shared memory object </torch_4154596_2592645159_498> in read-write mode: Too many open files (24)
#############################

@Samuel-gwb
Copy link

Samuel-gwb commented Apr 12, 2024

input files also attached :
test.zip

I re-run the command, error a little different with both "Too many open files"
Error mesage:

(easydock_pka) gwb@node01: Small_Molecule/Y73C_GTP$ ./Ensemble_RunDock.sh
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_pka/bin/run_dock", line 8, in
sys.exit(main())
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/easydock/run_dock.py", line 207, in main
add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/easydock/database.py", line 348, in add_protonation
protonate_func(input_fname=tmp.name, output_fname=output)
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/easydock/protonation.py", line 92, in protonate_pkasolver
for smi, name in pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname)):
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/reduction.py", line 198, in DupFd
return resource_sharer.DupFd(fd)
File "/home/gwb/miniconda3/envs/easydock_pka/lib/python3.9/multiprocessing/resource_sharer.py", line 48, in init
new_fd = os.dup(fd)
OSError: [Errno 24] Too many open files

@Samuel-gwb
Copy link

I am assuming easydock_pka is the same as easydock_new environment?

Yes, they are the same. typo mistakes

@Feriolet
Copy link
Contributor

Feriolet commented Apr 12, 2024

Yes, it still runs without issue.

Ok, what about changing the protonation function. Maybe it works for your case?

def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
    from pkasolver.query import QueryModel
    model = QueryModel()
    with contextlib.redirect_stdout(None):
        pool = Pool(ncpu)
        with open(output_fname, 'wt') as f:
            pkasolver_output = pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname))
            pool.close()
            pool.join()
            for smi, name in pkasolver_output:
                f.write(f'{smi}\t{name}\n')

@DrrDom
Copy link
Contributor

DrrDom commented Apr 12, 2024

@Samuel-gwb, I'm a little bit lost. You posted two error messages with "too many open files". One is related to torch, another to standard multiprocessing. Do you use the latest version easydock/pkasolver2 branch? Do you have GPU?

Since you use -c 1 I cannot imagine how you may exceed the number of opened files.

You may increase the number of file descriptors opened simultaneously ulimit -n 4096, but this looks like not a proper solution.

@Feriolet
Copy link
Contributor

Yes that is what I thought as well.

From the error, it looks like the default multiprocessing calls to torch version, which calls the default version again.
It is very interesting.

I tried not to use the ulimit solution as it is surprising that accessing one cpu would cause this issue and may be used as a last resort if everything else fails

@DrrDom
Copy link
Contributor

DrrDom commented Apr 12, 2024

If multiprocessing.pool calls multiprocessing.pool directly it should result in an error about nested processes or the like, because this is forbidden by design. If this call happens through torch, maybe this avoids this error, but causes another one.

In that case I see two possible solutions:

  1. Add force_cpu argument to QueryModel and set it to True.
  2. Detect GPUs within protonate_pkasolver function and call protonation without multiprocessing.pool

@Samuel-gwb, could you test the function below?

def protonate_pkasolver(input_fname: str, output_fname: str, ncpu: int = 1):
    import torch
    from pkasolver.query import QueryModel
    model = QueryModel()
    with contextlib.redirect_stdout(None):
        if torch.cuda.is_available() or ncpu == 1:
            with open(output_fname, 'wt') as f:
                for mol, mol_name in read_input(input_fname):
                    smi, name = _protonate_pkasolver(mol, mol_name, model=model)
                    f.write(f'{smi}\t{name}\n')```            
        else:
            pool = Pool(ncpu)
            with open(output_fname, 'wt') as f:
                for smi, name in pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname)):
                    f.write(f'{smi}\t{name}\n')```

@Samuel-gwb
Copy link

Yes, I use the same easydock_pka environment for different tests. And the last error message can be repeated for last several times.
Will try your solutions with modified pka_solver function!

@Samuel-gwb
Copy link

Samuel-gwb commented Apr 12, 2024

Very confused !
Again, freshly installed an environment, just change easydock --> easydock_test1 :
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
conda create -n easydock_test1 -c conda-forge python=3.9 numpy=1.20 rdkit scipy dask distributed
conda activate easydock_test1
pip install paramiko meeko vina
pip install git+https://github.com/Feriolet/dimorphite_dl.git
pip install git+https://github.com/DrrDom/pkasolver.git@noprints
pip install git+https://github.com/ci-lab-cz/easydock.git@pkasolver2
pip install torch==1.13.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu
pip install torch-geometric==2.0.1
pip install torch_scatter==2.1.1+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install torch_sparse==0.6.17+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install torch_spline_conv==1.2.2+pt113cpu -f https://data.pyg.org/whl/torch-1.13.1%2Bcpu.html
pip install molvs chembl_webresource_client matplotlib pytest-cov codecov svgutils cairosvg ipython
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Then,Use default protonate_pkasolver function, error:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_test1/bin/run_dock", line 8, in
sys.exit(main())
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/run_dock.py", line 207, in main
add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/database.py", line 348, in add_protonation
protonate_func(input_fname=tmp.name, output_fname=output)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/protonation.py", line 92, in protonate_pkasolver
for smi, name in pool.imap_unordered(partial(__protonate_pkasolver, model=model), read_input(input_fname)):
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage
df = multiprocessing.reduction.DupFd(fd)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/reduction.py", line 198, in DupFd
return resource_sharer.DupFd(fd)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/resource_sharer.py", line 48, in init
new_fd = os.dup(fd)
OSError: [Errno 24] Too many open files
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Use Feriolet‘s version to modify "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/protonation.py", replace contents of 'def protonate_pkasolver', then:
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_test1/bin/run_dock", line 8, in
sys.exit(main())
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/run_dock.py", line 207, in main
add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/database.py", line 348, in add_protonation
protonate_func(input_fname=tmp.name, output_fname=output)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/protonation.py", line 106, in protonate_pkasolver
for smi, name in pkasolver_output:
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/pool.py", line 537, in _handle_tasks
put(task)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 366, in reduce_storage
fd, size = storage.share_fd_cpu()
RuntimeError: unable to open shared memory object </torch_55536_471936146_501> in read-write mode: Too many open files (24)
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Use Pavel's version, need to modify 'smi, name = protonate...' --> 'smi, name = _protonate...' :
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Traceback (most recent call last):
File "/home/gwb/miniconda3/envs/easydock_test1/bin/run_dock", line 8, in
sys.exit(main())
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/run_dock.py", line 207, in main
add_protonation(args.output, program=args.protonation, tautomerize=not args.no_tautomerization, ncpu=args.ncpu)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/database.py", line 348, in add_protonation
protonate_func(input_fname=tmp.name, output_fname=output)
File "/home/gwb/miniconda3/envs/easydock_test1/lib/python3.9/site-packages/easydock/protonation.py", line 119, in protonate_pkasolver
smi, name = __protonate_pkasolver(mol, mol_name, model=model)
TypeError: __protonate_pkasolver() got multiple values for argument 'model'
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

@Feriolet
Copy link
Contributor

Feriolet commented Apr 12, 2024

Please put a bracket for the mol and mol_name. This will allow the __protonate_pkasolver to treat mol and mol_name as one variable instead of two. By not doing that, the function treats mol_name as the model, which is why the code is confused

smi, name = __protonate_pkasolver((mol, mol_name), model=model)

@Samuel-gwb
Copy link

Great, it works! Many thanks !

  1. One more thing, -c have to be 1.
    run_dock -i GTP.smi -o GTP_vina.db --program vina --config config_vina.yml --protonation pkasolver -c 1 --sdf

Any -c > 1 will cause "Too many open files" error.

  1. Another thing is that, additional nitrogen adjacent to the imidazole ring of GTP was protonated as NH-.
    I know that it will be NH2 when someone use schrodinger.

GTP smi:
OC1C(COP(=O)(OP(=O)(OP(=O)(O)O)O)O)OC(C1O)n1cnc2c1[nH]c(N)nc2=O
protonated by pkasolver -->
[NH-]c1nc(=O)c2ncn([C@H]3OC@@HC@H[C@H]3[O-])c2[n-]1

GTP_vina.sdf.txt

@Feriolet
Copy link
Contributor

  1. Yes, we have made it that multiprocessing is used when -c > 1, which is the package that is giving you the error.

We hope that using 1 cpu is sufficient for your use case. I honestly still can't reproduce your error, so I can't really help you too much with that. I also tried running it on Apple M1, and there is also no such issue. We can still try to tackle the multiprocessing issue if you wish to access more than 1 cpu, but it may be challenging as some of the obvious solutions do not work.

  1. Regarding this, easydock only give 1 protonated smiles. I am sure GTP has multiple protonation center, and it just happens the pkasolver give the [NH-] out of the many possible protonated smiles. We are unsure if we should give multiple protonated smiles per smile input, as we are still considering the stereoisomer enumeration issue after protonation.

@DrrDom correct me if there is any mistake I said

@DrrDom
Copy link
Contributor

DrrDom commented Apr 14, 2024

The error for ncpu > 1 is strange. This means that you do not have detectable GPU and use exclusively multiprocessing. I never met such en error for multiprocessing.

Wrong protonations may occur. Every protonation tool is incorrect to some extent. pkasolver model publicly available was trained on single-center molecules. Therefore, prediction for complex molecules with multiple protonation centers may be incorrect. That is why applicability of different protonation tools should be studied more thoroughly. Meanwhile we may use pkasolver as alternative to chemaxon.

@DrrDom
Copy link
Contributor

DrrDom commented Apr 15, 2024

I updated master with the most recent changes. I'll keep the issue open, because I believe we will return to it in future.
Thanks a lot to everybody who helped with that!

@Samuel-gwb
Copy link

Great !
Some tiny things:

  1. in readme, the line for pip installaltion of torch_spline_conv contains additional ''' at the end.

  2. It seems that "pip install cairosvg svgutils" is needed.
    And, at last, installation may need include "pip install ." at $easydock_home.

  3. Thus, when using cpu-based pkasolver for protonation, one need set "-c 1" ? If so, include it in readme?

@DrrDom
Copy link
Contributor

DrrDom commented Apr 16, 2024

Thank you!

  1. Fixed.
  2. Installation of easydock was described. Your suggestion is not relevant for ordinary users. This is mainly for developers, who has a clone of the repository. I'll update the pypi package soon. I expect to close another PR before officially update the version.
  3. Currently it is not necessary. Your case seems very specific. We will collect other user responses whether they have issues with that. However, it may worth to mention this issue in README to attract user attention.

@Feriolet
Copy link
Contributor

Feriolet commented Apr 17, 2024

I agreed with @Samuel-gwb for his 2nd point. Don't we need the pip install cairosvg svgutils to run the pkasolver? At least for my side it gave the import error for the cairosvg.

Edit: nvm I think I got ur point, my bad

@DrrDom
Copy link
Contributor

DrrDom commented Apr 17, 2024

You were right) Thanks to pointing me out. I indeed missed to add these packages cairosvg svgutils to the list of required ones. I'll do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants