Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running charcoal on GTDB: using snakemake to run really large charcoal runs #199

Open
taylorreiter opened this issue Oct 29, 2021 · 4 comments
Labels
documentation Improvements or additions to documentation

Comments

@taylorreiter
Copy link
Member

I'm trying to run charcoal on all of GTDB rs202. I have it running in an srun session on farm, but it kept filling /tmp…so I tried to use the --default-resources flag to change TMPDIR to /scratch/tereiter…but that seems to not work.

$ python -m charcoal run inputs/charcoal_conf/gtdb_rs202_genomes.conf -j 16 clean --nolock \
     --use-conda --latency-wait 15 --rerun-incomplete -k --default-resources 'tmpdir=/scratch/tereiter'
** read 258406 provided lineages
** config file checks PASSED!
** from here on out, it's all snakemake...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       clean
        258406  clean_contigs
        1       combine_hit_list
        256564  compare_taxonomy_single
        256547  contigs_sig
        256562  make_contigs_taxonomy_json
        256548  prefetch_all
        1284629
Select jobs to execute...
InputFunctionException in line 297 of /home/tereiter/github/2020-charcoal-paper/.snakemake/conda/6a946a6e41b403242a5405ce20e387aa/lib/python3.9/site-packages/charcoal/Snakefile:
Error:
  SyntaxError: invalid syntax (<string>, line 1)
Wildcards:
  filename=GCA_002441395.1_genomic.fna.gz
Traceback:

Error in snakemake invocation: Command '['snakemake', '-s', '/home/tereiter/github/2020-charcoal-paper/.snakemake/conda/6a946a6e41b403242a5405ce20e387aa/lib/python3.9/site-packages/charcoal/Snakefile', '--use-conda', '-j', '1', '-j', '16', 'clean', '--nolock', '--use-conda', '--latency-wait', '15', '--rerun-incomplete', '-k', '--default-resources', 'tmpdir=/scratch/tereiter', '--configfile', '/home/tereiter/github/2020-charcoal-paper/.snakemake/conda/6a946a6e41b403242a5405ce20e387aa/lib/python3.9/site-packages/charcoal/conf/defaults.conf', '/home/tereiter/github/2020-charcoal-paper/.snakemake/conda/6a946a6e41b403242a5405ce20e387aa/lib/python3.9/site-packages/charcoal/conf/system.conf', 'inputs/charcoal_conf/gtdb_rs202_genomes.conf']' returned non-zer

I get the same error when I put the tmpdir not in quotes:

$ python -m charcoal run inputs/charcoal_conf/gtdb_rs202_genomes.conf -j 16 clean --nolock \
     --use-conda --latency-wait 15 --rerun-incomplete -k --default-resources tmpdir=/scratch/tereiter

However, this works, until it fills tmp:

$ python -m charcoal run inputs/charcoal_conf/gtdb_rs202_genomes.conf -j 16 clean --nolock \
     --use-conda --latency-wait 15 --rerun-incomplete -k 

I’m contemplating spamming bml like below and hoping that works…but wanted to see if you had any hot takes

python -m charcoal run inputs/charcoal_conf/gtdb_rs202_genomes.conf \
   -j 500 clean --nolock --use-conda --latency-wait 15 \
   --rerun-incomplete -k --default-resources tmpdir=/scratch/tereiter \
   --cluster "sbatch -t 480 -J char -p bml -n 1 -N 1 -c 1 --mem=16Gb"
@ctb
Copy link
Member

ctb commented Nov 1, 2021

tl;dr so far:

  • tempdir shouldn't be filling up due to charcoal; it looks like snakemake is putting stuff there.
  • the --default-resources thing looks like a bug in snakemake to me, has no clear source in the charcoal Snakefile

I'll have to dig into this more.

details

I got a similar error with an updated version of snakemake, over in #200:

% python -m charcoal run demo/demo.conf -j 4 --default-resources 'tmpdir=/scratch/ctbrown'
** read 2 provided lineages
** config file checks PASSED!
** from here on out, it's all snakemake...
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
InputFunctionException in line 337 of /home/ctbrown/charcoal/charcoal/Snakefile:
Error:
  WorkflowError:
    Failed to evaluate DefaultResources value '/scratch/ctbrown'.
        String arguments may need additional quoting. Ex: --default-resources "tmpdir='/home/user/tmp'".
Wildcards:
  filename=GCF_000005845-subset.fa.gz
Traceback:

Error in snakemake invocation: Command '['snakemake', '-s', '/home/ctbrown/charcoal/charcoal/Snakefile', '--use-conda', '-j', '1', '-j', '4', '--default-resources', 'tmpdir=/scratch/ctbrown', '--configfile', '/home/ctbrown/charcoal/charcoal/conf/defaults.conf', '/home/ctbrown/charcoal/charcoal/conf/system.conf', 'demo/demo.conf']' returned non-zero exit status 1.

But ... I have no idea what's going on, as charcoal doesn't explicitly use TMPDIR for anything,

If I had to guess, there's bugaroni somewhere in snakemake's interactions with charcoal rules due to this statement in the snakemake docs,

The tmpdir resource automatically leads to setting the TMPDIR variable for shell commands, scripts, wrappers and notebooks.
in which snakemake is trying to do something clever in wrapping the shell: block.

I dug into what is created in the temp directory currently, and I got this:

(base) ctbrown@bm5:~$ ls -lat /tmp | head
total 10072
drwxrwxrwt 1460 root     root     679936 Nov  1 07:06 .
drwx------    2 ctbrown  ctbrown    4096 Nov  1 07:06 tmpc4ubcozrsnakemake-runti
me-source-cache

and

(base) ctbrown@bm5:~$ ls -la /tmp/tmpc4ubcozrsnakemake-runtime-source-cache/
total 704
drwx------    2 ctbrown ctbrown   4096 Nov  1 07:06 .
drwxrwxrwt 1460 root    root    679936 Nov  1 07:06 ..
-rw-rw-r--    1 ctbrown ctbrown    358 Nov  1 07:06 243097586a3d1f901fb796b7e018
dfa30973f3f555ad5ff80daff1b80a777026
-rwxrwxr-x    1 ctbrown ctbrown      0 Nov  1 07:06 243097586a3d1f901fb796b7e018
dfa30973f3f555ad5ff80daff1b80a777026.lock
-rw-rw-r--    1 ctbrown ctbrown  22327 Nov  1 07:06 8f549f4cc30a55f65af98023cd4d
1e73f45c5582800b8832fcd7b19f40615e74
-rwxrwxr-x    1 ctbrown ctbrown      0 Nov  1 07:06 8f549f4cc30a55f65af98023cd4d
1e73f45c5582800b8832fcd7b19f40615e74.lock
-rw-rw-r--    1 ctbrown ctbrown    336 Nov  1 07:06 99745d730084e46388ea4659a1f8
7c3b8b772a36c437e6d4b710102ccb6f5e89
-rwxrwxr-x    1 ctbrown ctbrown      0 Nov  1 07:06 99745d730084e46388ea4659a1f8
7c3b8b772a36c437e6d4b710102ccb6f5e89.lock

to me it all looks like snakemake stuff.

@taylorreiter
Copy link
Member Author

I've deleted them all now, but i was getting files like ad083b9d9e2f4318ba015e6fa2aaf8e3-pulp.mps and ad083b9d9e2f4318ba015e6fa2aaf8e3-pulp.sol, each of which were around 60Mb i think. I'm pretty sure these were caused by charcoal or by snakemake running charcoal, although there is a chance that they were written by some other process i was running on the nodes...but I tried to check and thought I had narrowed it down to charcoal

@ctb
Copy link
Member

ctb commented Nov 1, 2021

oh, I'm sure they're created by snakemake, they're just not charcoal specific (as in, snakemake is creating them without any charcoal-specific configuration or instruction).

the pulp makes me think it has to do with the scheduler that snakemake uses, which I think uses (used?) pulp.

A quick google search suggests that this may be a related issue - snakemake/snakemake#1003 - and suggests a workaround, which is to set TMPDIR explicitly in the environment. Can you give that a try?

@taylorreiter
Copy link
Member Author

This worked!

@taylorreiter taylorreiter added the documentation Improvements or additions to documentation label Jun 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants