Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting file names when saving SBT #145

Open
luizirber opened this issue Mar 10, 2017 · 11 comments
Open

Conflicting file names when saving SBT #145

luizirber opened this issue Mar 10, 2017 · 11 comments
Labels

Comments

@luizirber
Copy link
Member

SBT allows repeatedly inserting the same leaf:

from sourmash_lib.sbt import SBT, GraphFactory
from sourmash_lib.sbtmh import SigLeaf
from sourmash_lib import signature

sig = next(signature.load_signatures('demo/SRR2060939_1.fastq.gz.sig'))
a = SigLeaf('name', sig)
tree = SBT(factory)
tree = SBT(GraphFactory(31, 1e5, 4)
tree.add_node(leaf)
tree.save('test')

The generated JSON

{
    "version": 2,
    "nodes": {
        "0": {
            "name": "internal.0",
            "filename": ".sbt.test/test.internal.0.sbt"
        },
        "1": {
            "name": "internal.1",
            "filename": ".sbt.test/test.internal.1.sbt"
        },
        "2": {
            "name": "name",
            "filename": ".sbt.test/test.name.sbt",
            "metadata": "name"
        },
        "3": {
            "name": "name",
            "filename": ".sbt.test/test.name.sbt",
            "metadata": "name"
        },
        "4": {
            "name": "name",
            "filename": ".sbt.test/test.name.sbt",
            "metadata": "name"
        }
    },
    "d": 2
} 

A graphical representation:
tree

Note that we have a DAG in this case, not a tree anymore... The figure is misleading, in fact there memory representation will be more like this:
tree
but the content of each name node will come from the same signature.

This is the content of .sbt.test:

.sbt.test/
├── test.internal.0.sbt
├── test.internal.1.sbt
└── test.name.sbt

The question is: is this the behavior we expect?

@luizirber luizirber added the sbt label Mar 10, 2017
@phiweger
Copy link

Is this the reason for the SBT behaviour in #133? I.e., that sbt_search returns the same signature multiple times?

@luizirber
Copy link
Member Author

@viehwegerlib it's possible (@ctb didn't send me the files for testing yet =P)

@phiweger
Copy link

shame >> @ctb

;)

@phiweger
Copy link

@luizirber should I send you the testfiles? (would need some address)

@ctb
Copy link
Contributor

ctb commented Mar 21, 2017

I think this is not the behavior we expect, at least not from the command line :). Perhaps the Python API should balk at overwriting an existing file?

Also, if two signatures are identical, we should not insert them twice (from the command line) but rather spit out a warning.

@phiweger
Copy link

phiweger commented Mar 21, 2017

+1

@ctb
Copy link
Contributor

ctb commented Apr 15, 2020

This appears to still happen:

% sourmash index -k 31 blah podar-ref/1.fa.sig podar-ref/1.fa.sig
...
== This is sourmash version 3.2.3.dev4+g0362ac3e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading 2 files into SBT

loaded 2 sigs; saving SBT under "blah"

% jq . < blah.sbt.json
{
  "d": 2,
  "version": 4,
  "storage": {
    "backend": "FSStorage",
    "args": {
      "path": ".sbt.blah"
    }
  },
  "factory": {
    "class": "GraphFactory",
    "args": [
      1,
      100000,
      4
    ]
  },
  "nodes": {
    "0": {
      "filename": "internal.0",
      "name": "internal.0",
      "metadata": {
        "min_n_below": 1478
      }
    },
    "1": {
      "filename": "c11126d0591db94cd3d1c8568499375f",
      "name": "c11126d0591db94cd3d1c8568499375f",
      "metadata": "c11126d0591db94cd3d1c8568499375f"
    },
    "2": {
      "filename": "c11126d0591db94cd3d1c8568499375f",
      "name": "c11126d0591db94cd3d1c8568499375f",
      "metadata": "c11126d0591db94cd3d1c8568499375f"
    }
  }
}
% ls -al .sbt.blah/
total 160
drwxr-xr-x    4 t  staff    128 Apr 15 07:36 .
drwxr-xr-x  322 t  staff  10304 Apr 15 07:36 ..
-rw-r--r--    1 t  staff  26036 Apr 15 07:36 c11126d0591db94cd3d1c8568499375f
-rw-r--r--    1 t  staff  50042 Apr 15 07:36 internal.0

@ctb
Copy link
Contributor

ctb commented Apr 15, 2020

However, search does not find it more than once --

% sourmash search podar-ref/1.fa.sig blah

== This is sourmash version 3.2.3.dev4+g0362ac3e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting default query k=31.
loaded query: CP001941.1 Aciduliprofundum bo... (k=31, DNA)
loaded 1 databases.

1 matches:
similarity   match
----------   -----
100.0%       CP001941.1 Aciduliprofundum boonei T469, complete genome

This is due to code introduced in #556 (the Index base class refactor) that collapses repeat md5s across a database search - see code.

@ctb
Copy link
Contributor

ctb commented Apr 15, 2020

See also #884 which fixes a separate problem I introduced 👀 where the filename was based not on signature md5sum but rather on signature name, so if you had two different signatures with the same name, one overwrote the other.

@ctb
Copy link
Contributor

ctb commented Jul 3, 2020

is this now resolved by #994?

@luizirber
Copy link
Member Author

luizirber commented Jul 3, 2020

#994 solves the "storing duplicates" part, but not the "you are adding a duplicate, is this really what you want?" problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants