Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_dock is running, but stops output into the output.db file #22

Closed
Samuel-gwb opened this issue Mar 1, 2024 · 22 comments
Closed

run_dock is running, but stops output into the output.db file #22

Samuel-gwb opened this issue Mar 1, 2024 · 22 comments

Comments

@Samuel-gwb
Copy link

Samuel-gwb commented Mar 1, 2024

When I try to docking a .smi file containing 10k molecules against a receptor, the size of output.db increases continually.
However, after some time (tens of minutes), the increase stopped, while run_dock still gives good output.

Then I stop the run_dock running, and then just re-run the run_dock task. The size of output.db will increase again.

This process will repeat and repeat.

How to solve this problem and continue docking until the last molecue ?

@Samuel-gwb
Copy link
Author

Update of the problem:
when I use a series of .smi files containing 200 molecules each, run_dock will go on docking for every file, with some .db be 1M size or 2 M size without corresponding .sdf output, while a few .db files be 10 M with corresponding .sdf output.

It seems that some molecules cause docking problem, thus no write into .db file, while docking for next molecule in the .smi file will go on. And after all molecules are docked, next smi file run_dock begins to work, but no .sdf file was written since .db file is incomplete.

This is annoying, as you have to check again and again, and rerun the run_dock to overcome the problem molecules.

@DrrDom
Copy link
Contributor

DrrDom commented Mar 2, 2024

This is a strange behavior, because if docking of a molecule is failed, this molecule should be simply skipped (the fields pdb_block, mol_block and others will be empty).

  1. Does the work of the script run_dock finish successfully or it returns some error?
  2. Is the issue reproducible on the same SMILES file and the same molecules are omitted?
  3. I did not understand, after restarting the script all molecules were successfully docked or not?
  4. Did you try to dock SMILES causing problem to other protein?
  5. Could you share the set of SMILES which cause such a behavior? The protein structure may be also useful to reproduce this behavior.
  6. Provide please which settings/arguments do you use to run scripts.

@DrrDom
Copy link
Contributor

DrrDom commented Mar 2, 2024

Just a note. If a molecule is failed during the first run of the script, in the second run the script will try to dock it again.

Normally we did not observed many failed docking attempts across millions of compounds. Maybe your compounds are somewhat special. Maybe there are issues in molecule preparation by meeko or conversion them back from PDBQT to MOL. You may also try to use the previous version of the software 0.2.7 - where the old pipeline was implemented.

@Samuel-gwb
Copy link
Author

Samuel-gwb commented Mar 4, 2024

Yes, if unsucessful docking can be skipped and next successful docking of molecule can be written into sdf, that will be quite fine.
Now checking according to your suggestions one by one.
1, If error messages come, no sdf would be output.
2, No. Different try gives different
3, yes, most can be docked and give a reasonable docking score.
4, no
5,6, Files used in my docking are attached, Thanks!

Attachment contains:
.smi (10 files, each containing 10 molecules)
.db
.sdf (successful one)
fail list of smi
config yml
receptor pdbqt
receptor box txt
iterative script of run_dock fo all smi files

Failed_smi.tar.gz

Following is another run using the same smi files and settings. Output message is included containing some error messages
RepeatV2_test_20Fails.tar.gz

This is a strange behavior, because if docking of a molecule is failed, this molecule should be simply skipped (the fields pdb_block, mol_block and others will be empty).

  1. Does the work of the script run_dock finish successfully or it returns some error?
  2. Is the issue reproducible on the same SMILES file and the same molecules are omitted?
  3. I did not understand, after restarting the script all molecules were successfully docked or not?
  4. Did you try to dock SMILES causing problem to other protein?
  5. Could you share the set of SMILES which cause such a behavior? The protein structure may be also useful to reproduce this behavior.
  6. Provide please which settings/arguments do you use to run scripts.

@Samuel-gwb
Copy link
Author

Samuel-gwb commented Mar 4, 2024

Yes, just as this !

Just a note. If a molecule is failed during the first run of the script, in the second run the script will try to dock it again.

I split the overall smi file into files, each containing 20 molecules. Now get 1588 sdf files and 1906 db files.
Failed rate seems to be about 318 in (1906*20). About 1/120
Is this abnormal ?

Normally we did not observed many failed docking attempts across millions of compounds. Maybe your compounds are somewhat special. Maybe there are issues in molecule preparation by meeko or conversion them back from PDBQT to MOL. You may also try to use the previous version of the software 0.2.7 - where the old pipeline was implemented.

@Feriolet
Copy link
Contributor

Feriolet commented Mar 4, 2024

Hi! perhaps it might be a gnina issue? I have tried running the smi files using vina program (I can't replicate in gnina because I am using Mac OS M1), and I did not encounter the error issue. All of the compounds so far can be docked and the sdf output is there for the dk_* files

I have also adjusted the config file to accomodate the vina program and my limited cpu (in case it plays a part in the error):

protein: PDIab_loopF240P.pdbqt
protein_setup: PDIab_vina_box.txt
exhaustiveness: 16
scoring: default
n_poses: 15
ncpu: 1 
seed: 1001

Output for your reference (pls ignore the stereo_id in the .db file, I should have used the main repo for consistency):
smi_vina_result.tar.gz

@Samuel-gwb
Copy link
Author

Thanks a lot! I also tried and it works well. The results is similar with your providing one.
In my previous test of an ATP-binding protein, gnina works better than vina. Results from gnina (after re-running) is some different from vina.

Great work !

Currently, I have not added protonation and just use --no_protonation option.
Hope your work smooth to replace chemaxon !

@DrrDom
Copy link
Contributor

DrrDom commented Mar 8, 2024

We also encountered this issue with the latest version of gnina now. @Samuel-gwb and @Feriolet, could you share your errors?
In our case it throws an error on a wrong angle - nan. I suspect, this may be an issue with RDKit conformer generator and I'm investigating it right now.

However, the error is not well reproducible. The same set of 12000 molecules may be docked without errors for some protein conformations, but for others gives an error just once, for some - two and more times.

terminate called after throwing an instance of 'numerical_error'
  what():  Numerical degeneracy encountered. Check for non-physical inputs. nan should be between -pi and pi.
*** Aborted at 1709916512 (unix time) try "date -d @1709916512" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x1d530000e589) received by PID 58761 (TID 0x2afecf016700) from PID 58761; stack trace: ***
    @     0x2afeccc10630 (unknown)
    @     0x2afecdb04387 __GI_raise
    @     0x2afecdb05a78 __GI_abort
    @     0x2afecd2b815d __gnu_cxx::__verbose_terminate_handler()
    @     0x2afecd2b61a6 __cxxabiv1::__terminate()
    @     0x2afecd2b61f1 std::terminate()
    @     0x2afecd2b6408 __cxa_throw
    @           0x591fd4 normalize_angle()
    @           0x593ee4 bfgs<>()
    @           0x590bfd quasi_newton::operator()()
    @           0x5622d9 monte_carlo::operator()()
    @           0x5737b2 parallel_mc_aux::operator()()
    @           0x56f5a4 

Overall, the behavior, where docking continues after the core dump but the database is not updated, is strange. If a molecule causes an error it should be skipped, but it somehow blocks database update. This is an issue of EasyDock already and it should be fixed.

@Feriolet
Copy link
Contributor

I can't provide the gnina error as I am not using Ubuntu right now to test the run_dock. However, if I interpret your comment right, the EasyDock issue is that the output.db for some molecules are not updated correctly even if they are docked successfully after some molecules are skipped. I have found a solution to that issue:

#### Using autodock vina

def mol_dock(mol, config):
    # Load config and temp file

    # Reproducing a random error on certain molecules
    try:
        randomnumber = random.randint(100) 
        if randomnumber > 10:
            print(randomnumber)
            raise ValueError
        else:
            print('success',randomnumber)

        with open(ligand_fname, 'wt') as f:
           f.write(ligand_pdbqt)

    # Adding an except argument
    except:
        pass
    finally:
        # close and unlink 

For some reason, adding the except line solve the issue of output.db not updating.

@DrrDom
Copy link
Contributor

DrrDom commented Mar 11, 2024

Thank you for the suggestion, but this did not help. The current version of the function I use for testing.

I added explicit catching exceptions in subprocess and treat them explicitly, and afterwards add catching of all other exceptions. The issue here is that error occurs in subprocess and not in Python code.

def mol_dock(mol, config):
    """

    :param mol: RDKit Mol of a ligand with title
    :param config: yml-file with docking settings
    :return:
    """

    output = None

    config = __parse_config(config)

    mol_id = mol.GetProp('_Name')
    boron_replacement = config["cnn_scoring"] in [None, "none"]
    ligand_pdbqt = ligand_preparation(mol, boron_replacement=boron_replacement)
    if ligand_pdbqt is None:
        return mol_id, None

    output_fd, output_fname = tempfile.mkstemp(suffix='_output.pdbqt', text=True)
    ligand_fd, ligand_fname = tempfile.mkstemp(suffix='_ligand.pdbqt', text=True)

    try:
        with open(ligand_fname, 'wt') as f:
            f.write(ligand_pdbqt)

        cmd = f'{config["script_file"]} --receptor {config["protein"]} --ligand {ligand_fname} --out {output_fname} ' \
              f'--config {config["protein_setup"]} --exhaustiveness {config["exhaustiveness"]} ' \
              f'--seed {config["seed"]} --scoring {config["scoring"]} ' \
              f'--cpu {config["ncpu"]} --addH {config["addH"]} --cnn_scoring {config["cnn_scoring"]} ' \
              f'--cnn {config["cnn"]} --num_modes {config["n_poses"]}'
        start_time = timeit.default_timer()
        subprocess.run(cmd, shell=True, check=True)  # this will trigger CalledProcessError and skip next lines
        dock_time = round(timeit.default_timer() - start_time, 1)

        score, pdbqt_out = __get_pdbqt_and_score(output_fname)
        mol_block = pdbqt2molblock(pdbqt_out.split('MODEL')[1], mol, mol_id)

        output = {'docking_score': score,
                  'pdb_block': pdbqt_out,
                  'mol_block': mol_block,
                  'dock_time': dock_time}

    except subprocess.CalledProcessError as e:
        sys.stderr.write(f'Error caused by docking of {mol_id}\n')
        sys.stderr.write(str(e))
        sys.stderr.write('STDERR output:\n')
        sys.stderr.write(e.stderr)
        sys.stderr.flush()
        output = None

    except:
        pass

    finally:
        os.close(output_fd)
        os.close(ligand_fd)
        os.unlink(ligand_fname)
        os.unlink(output_fname)

    return mol_id, output

Now I think that the issue may be in releasing file resources in the finally block. However, I do not understand how this may block database update with results returned from other processes running in parallel.

@Feriolet
Copy link
Contributor

I see. Does the gnina command has std.err == None ? I tried to create a subprocess error, but the line sys.stderr.write(e.stderr) give a TypeError: write() argument must be str, not None error. I have to add the argument check_output=True, text=True for the subprocess line in order to make it work.

Using autodock vina as a reference as I don't have gnina:

def mol_dock(mol, config):
    """

    :param mol: RDKit Mol of a ligand with title
    :param config: yml-file with docking settings
    :return:
    """
    output = None

    config = __parse_config(config)

    mol_id = mol.GetProp('_Name')
    ligand_pdbqt = ligand_preparation(mol, boron_replacement=True)
    if ligand_pdbqt is None:
        return mol_id, None

    output_fd, output_fname = tempfile.mkstemp(suffix='_output.json', text=True)
    ligand_fd, ligand_fname = tempfile.mkstemp(suffix='_ligand.pdbqt', text=True)

    try:
        with open(ligand_fname, 'wt') as f:
            f.write(ligand_pdbqt)

        p = os.path.realpath(__file__)
        python_exec = sys.executable

        randomnum = random.randint(100)
        if randomnum > 50:
            cmd = 'vina make error'
        else:
            cmd = f'{python_exec} {os.path.dirname(p)}/vina_dock_cli.py -l {ligand_fname} -p {config["protein"]} ' \
                f'-o {output_fname} --center {" ".join(map(str, config["center"]))} ' \
                f'--box_size {" ".join(map(str, config["box_size"]))} ' \
                f'-e {config["exhaustiveness"]} --seed {config["seed"]} --nposes {config["n_poses"]} -c {config["ncpu"]}'
        start_time = timeit.default_timer()
        subprocess.run(cmd, shell=True, check=True, capture_output=True, text=True) # not including capture_output and text output does not update output.db if an error occurs (and if e.stderr == None)
        dock_time = round(timeit.default_timer() - start_time, 1)

        with open(output_fname) as f:
            res = f.read()
            if res:
                res = json.loads(res)
                mol_block = pdbqt2molblock(res['poses'].split('MODEL')[1], mol, mol_id)
                output = {'docking_score': res['docking_score'],
                          'pdb_block': res['poses'],
                          'mol_block': mol_block,
                          'dock_time': dock_time}

    except subprocess.CalledProcessError as e:
        sys.stderr.write(f'Error caused by docking of {mol_id}\n')
        sys.stderr.write(str(e))
        sys.stderr.write('STDERR output:\n')
        sys.stderr.write(e.stderr)
        sys.stderr.flush()
        output = None

    except:
        pass

    finally:
        os.close(output_fd)
        os.close(ligand_fd)
        os.unlink(ligand_fname)
        os.unlink(output_fname)
    return mol_id, output
Screenshot 2024-03-11 at 9 08 26 PM Screenshot 2024-03-11 at 9 08 49 PM

@DrrDom
Copy link
Contributor

DrrDom commented Mar 11, 2024

It works! So, the solution could be sys.stderr.write(str(e.stderr)). I had to explicitly cast type of e.stderr to str.
Thanks a lot! I'll update repo later today.

@DrrDom
Copy link
Contributor

DrrDom commented Mar 11, 2024

I pushed the commit fixing this freezing of database update. Please check it on your files and settings. You can check the latest version from the repo. Now problematic molecules should be skipped as expected. This fix will be included in the next release soon.

It is important to use the latest version also due to fixed extraction of docking scores from gnina docking outputs. There was a bug in the main repo and scores extracted previously were smina-like scores.

@Samuel-gwb
Copy link
Author

I tried gnina with updated code, and still meet an error. This time, the command stopped after output error. Before, it go on docking next molecule but not update the .db file.

easydock_error.txt

@Feriolet
Copy link
Contributor

Feriolet commented Mar 12, 2024

Can you try redoing it with the updated repo? The error you sent has the old code string_with_score = pdbqt_out.split('MODEL')[1].split('\n')[1] . The updated get_pdbqt_score should now be:

def __get_pdbqt_and_score(ligand_out_fname):
    with open(ligand_out_fname) as f:
        pdbqt_out = f.read()
    match = re.search(r'REMARK CNNaffinity\s+([\d.]+)', pdbqt_out)
    if match:
        score = round(float(match.group(1)), 3)
    else:
        match = re.search(r'REMARK minimizedAffinity\s+(-?[\d.]+)', pdbqt_out)
        score = round(float(match.group(1)), 3)

    return score, pdbqt_out

@Samuel-gwb
Copy link
Author

I checked. gnina_dock.py should contain the very codes above.

Attachments contain input files and gnina_dock.py (named as gnina_dock_py.txt) files.

Just now, I have not "pip install easydock" after git pull / update. After pip install, the gnina docking go on when meeting problem molecules, while .db file is not updated, seems similar with problem several days ago.

test.tar.gz
gnina_dock_py.txt

@Feriolet
Copy link
Contributor

To confirm, there is only one easydock version on your computer right? I want to rule out the possibility that the easydock use the old version which cause the not updated .db file instead of the new version

@Samuel-gwb
Copy link
Author

Yes, only one environment.
In easydock/easydock directory, you can find the vina_dock.py and gnina_dock.py is the latest one:

-rw-rw-r-- 1 gwb gwb 17399 3月 1 10:15 database.py
-rw-rw-r-- 1 gwb gwb 91 3月 1 10:15 auxiliary.py
-rwxrwxr-x 1 gwb gwb 5684 3月 12 07:55 vina_dock.py
-rw-rw-r-- 1 gwb gwb 3060 3月 12 07:55 gnina_dock.py

Then "pip install easydock" in the main directory.

@Feriolet
Copy link
Contributor

Feriolet commented Mar 12, 2024

That's odd. I have tried to run it on my end and it does not produce any issue. The only issue I encountered is when I use more memory than I have. Here is the result on my end. I have used binary version of gnina. Even when I encountered the out of memory error, the child process is not terminated, and the .db file still updates even when some molecules are skipped.

STDERR output:
terminate called after throwing an instance of 'std::runtime_error'
  what():  out of memory
*** Aborted at 1710213751 (unix time) try "date -d @1710213751" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x3ed0013c651) received by PID 1295953 (TID 0x7fbdda1c7000) from PID 1295953; stack trace: ***
    @     0x7fbde4569420 (unknown)
    @     0x7fbde3fbc00b gsignal
    @     0x7fbde3f9b859 abort
    @     0x7fbde437bee6 (unknown)
    @     0x7fbde438df8c (unknown)
    @     0x7fbde438dff7 std::terminate()
    @     0x7fbde438e258 __cxa_throw
    @           0x75cde2 caffe::SyncedMemory::mutable_gpu_data()
    @           0x5c1e92 caffe::Blob<>::mutable_gpu_data()
    @           0x78f93a caffe::BatchNormLayer<>::Forward_gpu()
    @           0x727541 caffe::Net<>::ForwardFromTo()
    @           0x727666 caffe::Net<>::Forward()
    @           0x50b950 CNNScorer::score()
    @           0x49927e get_cnn_info()
    @           0x4a0435 do_search()
    @           0x4a23bd main_procedure()
    @           0x4a2cfe threads_at_work()
    @           0x4ba7e3 boost::_bi::list6<>::operator()<>()
    @           0x81f24e thread_proxy
    @     0x7fbde455d609 start_thread
    @     0x7fbde4098353 clone
    @                0x0 (unknown)
Aborted (core dumped)

gnina_output.tar.gz

@Samuel-gwb
Copy link
Author

Oh.
I'll try a fresh git clone and environment install to have a check !
Thanks for your so many efforts!

@Samuel-gwb
Copy link
Author

Hi @Feriolet ,
After remove previous easydock installation and pip install the latest one, it's ok !
Cheers !

@DrrDom
Copy link
Contributor

DrrDom commented Mar 12, 2024

Thank you very much! We finally overcome this issue and I hope it will not appear again)

@DrrDom DrrDom closed this as completed Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants