Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with a large Pseudomonas dataset #299

Open
RyanCFink opened this issue Jan 16, 2024 · 4 comments
Open

Issue with a large Pseudomonas dataset #299

RyanCFink opened this issue Jan 16, 2024 · 4 comments

Comments

@RyanCFink
Copy link

About the dataset

We have two datasets, one with 8046 Pseudomonas genomes, and
another, a subset of the first, with 6347 Pseudomonas aeruginosa genomes.
For the issues described here we will focus on the latter, but the issues are common to both.

About the issue

As part of our workflow, we create subsets based on the core and accessory distances, as infered by PopPUNK.
In this dataset we found that many of the distances for the genomes are 0 for both the accessory and the core genomes.

Here are the command we used to extract the distance table:

  1. poppunk --create-db
poppunk \
    --create-db \
    --output poppunk_db \
    --r-files poppunk_samplesheet.tsv  \
    --threads 24
  1. poppunk_extract_distances.py
poppunk_extract_distances.py \
    --distances poppunk_db/poppunk_db.dists \
    --output poppunk_db_distances.tsv
  1. We also fit to the dbscan model:
poppunk \
    --fit-model dbscan \
    --ref-db poppunk_db \
    --output poppunk_db \
    --threads 24 \
    --k-step 3 --min-k 17 --max-k 41 --plot-fit 5

Here is the resulting figure:

  1. We then tried to subset the genomes based on their "similarity" (1-distance). Here are the genomes retained for a few different thresholds:

You can see the number of retained genomes is very small compared to the input size (6347), this is something we've only encountered in this dataset.

Other datasets of similar size have yielded more sensible subsets, e.g. an 11373 genome dataset resulted in these subsets (with a few more thresholds displayed to show how it scales):

Some things we've tried

  • Running the PopPUNK QC prior to fitting the model didn't work, most type isolates we chose got removed during the QC and the QC failed:
  Graph-tools OpenMP parallelisation enabled: with 1 threads
  Running sequence QC
  Running QC on sketches
  93 samples failed
  Running distance QC
  Selected type isolate for distance QC is SAMD00012124
  5282 samples failed
  Traceback (most recent call last):
    File "/usr/local/bin/poppunk", line 11, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.9/site-packages/PopPUNK/__main__.py", line 413, in main
      raise RuntimeError('Type isolate ' + qc_dict['type_isolate'] + \
  RuntimeError: Type isolate SAMD00012124 not found in isolates after QC; check name of type isolate and QC options
  • Adding some of our genomes to the BacPop Pseudomonas database using poppunk_assign. This seemingly didn't change anything on the original PopPUNK db.

Before adding to the database:

After:

  • Running an even smaller subset with only 100 random Pseudomonas aeruginosa genomes. This led to similar issues.
@johnlees
Copy link
Member

johnlees commented Jan 17, 2024

Regarding the QC options, could you try with:

--max-pi-dist 1 --max-a-dist 1 --max-zero-dist 20 --length-sigma 5 --prop-n 0.1

It would be useful to know from the output (which I think you should still get) the reasons isolates are being removed

Regarding query assignment: this does not update the distance plot but you should still get clusters assignments out. Are you having problems/errors in this output? A flag that can be helpful in these cases is --serial which means that assignments are made one by one, so if there are problematic input sequences they only cause issues in their own assignments. See also: https://poppunk.bacpop.org/troubleshooting.html#most-all-of-my-samples-merge-when-i-run-a-query

@RyanCFink
Copy link
Author

We tried the extra flags you provided when running the QC, this time it ran to completion,
but the underlying issue remains the same:

PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
        (with backend: sketchlib v2.0.1
         sketchlib: /usr/local/lib/python3.9/site-packages/pp_sketchlib.cpython-39-x86_64-linux-gnu.so)

Graph-tools OpenMP parallelisation enabled: with 1 threads
Running sequence QC
Running QC on sketches
93 samples failed
Running distance QC
0 samples failed
Removing 93 sequences
Recalculating random matches with strand_preserved = False
Calculating random match chances using Monte Carlo

Done

We were wondering, since you used a similar Pseudomonas dataset in the original PopPUNK paper and in bacpop, if you could share:

  • If there were any options different from the default when creating your pseudomonas db, and if so, if you could share
    the commands that were used.
  • If you performed any cleanup on the genomes prior to running PopPUNK (e.g. removing plasmids)

@johnlees
Copy link
Member

@samhorsfield96 I think you created the pseudomonas DB? Do you remember/know if there was anything special you did?

We wouldn't have changed the sequence files (e.g. removing plasmids), but we would have removed isolates. It still looks to me like you have a few isolates that need to be removed to get this to work. The other thing that would be useful is to find those which are giving zero core distance. You can do that with a command like:

sketchlib query dist poppunk_db | awk '$3 == 0 && $4 > 0.1 {print $1}' | sort | uniq -c | sort -k2,2 -n -r

Which should give you the isolates with the most 'bad' distances from most to least

@samhorsfield96
Copy link
Collaborator

Hi both,

The database on the bacpop website was generated using:

poppunk --create-db --output db --r-files input.txt --qc-filter prune --length-sigma 1 --plot-fit 3 --min-k 17 --max-k 41 --sketch-size 100000

From memory I had to increase the sketch size and k-mer ranges to increase resolution at smaller core distances. I also reduced the genome size standard deviation boundary to remove likely contaminated genomes.

Let me know if you have any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants