Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: read sig.gz_{n} sigs from zipfiles #136

Merged
merged 5 commits into from
Oct 9, 2023
Merged

Conversation

bluegenes
Copy link
Contributor

as noted in sourmash-bio/sourmash#2749 (comment), sourmash zipfiles store repeated md5sums as .sig.gz_{n} instead of _{n}.sig.gz. This PR allows us to read those files as well, rather than skipping over them.

Copy link
Collaborator

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, ok, but ... I worry we're now introducing different behavior b/t pyo3_branchwater and sourmash :)

Two questions:

  • how does manysketch handle this situation? Are multiple sketches created and saved?
  • does pyo3_branchwater currently pay any attention to manifests when reading?

Anyway, I think it's fine to go ahead and merge, but let's also create an issue to re-examine this in the future :)

@bluegenes
Copy link
Contributor Author

bluegenes commented Oct 6, 2023

  • how does manysketch handle this situation? Are multiple sketches created and saved?

manysketch writes multiple signatures, using format!("signatures/{}_{}.sig.gz", md5sum_str, count) for duplicates. I had thought this was replicating what we do in sourmash, didn't realize we were appending the count after the gz.

relevant utils.rs code for writing sigs from manysketch

 for (sig, param) in sigs.iter().zip(params.iter()) {
                        let md5sum_str = sig.md5sum();
                        let count = md5sum_occurrences.entry(md5sum_str.clone()).or_insert(0);
                        *count += 1;
                        let sig_filename = if *count > 1 {
                            format!("signatures/{}_{}.sig.gz", md5sum_str, count)
                        } else {
                            format!("signatures/{}.sig.gz", md5sum_str)
                        };
                        write_signature(sig, &mut zip, options, &sig_filename);
                        manifest_rows.push(make_manifest_row(
                            sig,
                            &filename,
                            &sig_filename,
                            param.scaled,
                            param.num,
                            param.track_abundance,
                            param.is_dna,
                            param.is_protein,
                        ));
  • does pyo3_branchwater currently pay any attention to manifests when reading?

no, it currently looks through the sig files themselves. I expect to switch to using manifests as part of #134 or follow ups.

Hmm, the optimal solution might be to use the manifest to load the same file for duplicates (not storing the duplicates at all). What are the challenges associated with that solution?

@ctb
Copy link
Collaborator

ctb commented Oct 6, 2023

Hmm, the optimal solution might be to use the manifest to load the same file for duplicates (not storing the duplicates at all). What are the challenges associated with that solution?

ooh! I like it ;). Not ready to commit to it on sourmash yet, but hot take is it's a leading contender!

@bluegenes
Copy link
Contributor Author

bluegenes commented Oct 6, 2023

Note: some of our utilities may only keep *sig.gz files, leaving duplicates behind

sig extract does not keep duplicates:

sourmash sig extract -k 21 mm.dna.zip -o mm.dna-k21.zip

== This is sourmash version 4.8.4. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 428 total that matched ksize & molecule type
extracted 328 signatures from 1 file(s)

sourmash sig grep does keep duplicate sigs

sourmash sig grep mammarenavirus spillover.dna.zip -k 21 -o mm.dna.zip

== This is sourmash version 4.8.4. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

saving matching signatures to 'mm.dna.zip'
loaded 33566 total that matched ksize & molecule type
extracted 428 signatures from 1 file(s)

I confirmed this is not a reporting difference by listing files in *zip/signatures dir.

@ctb
Copy link
Collaborator

ctb commented Oct 9, 2023

(just to be clear, I think you can/should merge this :)

@bluegenes bluegenes merged commit ed28743 into main Oct 9, 2023
1 check passed
@bluegenes bluegenes deleted the load-all-from-zip branch October 9, 2023 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants