Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small fixes to database pruning and updating #328

Merged
merged 70 commits into from
Oct 9, 2024
Merged

Conversation

nickjcroucher
Copy link
Collaborator

  • Fix to --remove-samples, which did not work due to missing parentheses
  • Updated QC mode to also prune samples out of the graph with prune_graph - e.g. for trimming a database, or for removing inappropriate samples only identified after visualisation (I assume networks code will be optimised in the future, so this is a little rough and ready)
  • Updated unwords.py to no longer use sets - https://stackoverflow.com/questions/15837729/random-choice-from-set

@johnlees
Copy link
Member

I'll have a look at this next week. We also need to co-ordinate with #322 which I had almost finished to fix the latter issue too, maybe I can just remove it from there.

Also, was #304 supposed to have been merged at some point, or does this replace that?

@nickjcroucher
Copy link
Collaborator Author

Apologies, hadn't seen #322 - I don't get notifications of issues as I used to for some reason. These were just the changes I needed to make to prepare data for @absternator. #304 is more a rumination on the nature of time and mortality at this point, but I'll pull across the changes from master into it when I get a chance.

PopPUNK/__main__.py Show resolved Hide resolved
PopPUNK/__main__.py Show resolved Hide resolved
@@ -753,7 +753,7 @@ def assign_query_hdf5(dbFuncs,
storePickle(combined_seq, combined_seq, True, None, dists_out)

# Clique pruning
if model.type != 'lineage':
if model.type != 'lineage' and os.path.isfile(ref_file_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this one against #322 – I think I remember changing it there perhaps. Might be best just to remove this change for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it changes the behaviour if there is no reference file, as in the current GPS database - I suppose it just depends what we want the behaviour to be in such a situation.

PopPUNK/bgmm.py Outdated Show resolved Hide resolved
@@ -333,7 +334,7 @@ def fit(self, X, max_components):
else:
y = self.assign(self.subsampled_X, max_batch_size = self.max_batch_size)
self.within_label = findWithinLabel(self.means, y)
self.between_label = findWithinLabel(self.means, y, 1)
self.between_label = findBetweenLabel_bgmm(self.means, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change in behaviour? This would change some of the examples I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This harmonises the behaviour of BGMM and DBSCAN - identifying the between-strain cluster as that containing the largest number of points. Name is a bit ugly.

PopPUNK/network.py Outdated Show resolved Hide resolved
PopPUNK/network.py Outdated Show resolved Hide resolved
PopPUNK/plot.py Outdated Show resolved Hide resolved
PopPUNK/utils.py Outdated
Comment on lines 545 to 546
adj (float)
Distance by which to shift the interception point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this now used? Also which distance, as in up or down the y-axis?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used in

x_max_start, y_max_start = decisionBoundary(mean0, gradient, adj = -1*min_move)
to harmonise behaviour of unconstrained with other refinement modes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better definition in 2054e6c.

PopPUNK/visualise.py Outdated Show resolved Hide resolved
PopPUNK/plot.py Outdated Show resolved Hide resolved
@nickjcroucher
Copy link
Collaborator Author

nickjcroucher commented Oct 8, 2024

Thank you for the comments @johnlees! I think I've addressed most of them - would it now make sense:

  • merge in this and Add fast update function #322
  • set up an issue for updating the network functions, as suggested here, and to enable GPU-based fast pruning (yes, the can would be kicked down the road)
  • set up a new branch to test whether we can get beebop running on the full database using the modifications you suggested, so we can keep @absternator busy?

@johnlees
Copy link
Member

johnlees commented Oct 8, 2024

One final response in my comments, just on checking input is sorted

Thank you for the comments @johnlees! I think I've addressed most of them - would it now make sense:

* merge in this and [unword consonants as list #322](https://github.com/bacpop/PopPUNK/pull/322)

Let's merge this first. I still wanted to test a couple of things on that branch (should have time in a couple of weeks), then I can just resolve and merge problems on that one myself when it's ready.

* set up an issue for updating the network functions, as suggested here, and to enable GPU-based fast pruning (yes, the can would be kicked down the road)

Yep, good plan

* set up a new branch to test whether we can get beebop running on the full database using the modifications you suggested, so we can keep @absternator busy?

Also makes sense!

@nickjcroucher nickjcroucher mentioned this pull request Oct 9, 2024
3 tasks
@nickjcroucher nickjcroucher merged commit 2943e5f into master Oct 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants