-
Notifications
You must be signed in to change notification settings - Fork 8
Preprocessing Script Options
Running collate.py
will process an assembly graph file so that it can be visualized, producing a SQLite3 .db
file that can be loaded in the viewer interface to visualize the assembly graph.
usage: mgsc [-h] -i INPUTFILE -o OUTPUTPREFIX [-d OUTPUTDIRECTORY] [-w]
[-maxn MAXNODECOUNT] [-maxe MAXEDGECOUNT] [-ub USERBUBBLEFILE]
[-ubl] [-up USERPATTERNFILE] [-upl] [-spqr] [-b BICOMPONENTFILE]
[-sp] [-pg] [-px] [-nbdf] [-npdf]
Prepares an assembly graph file for visualization, generating a database file
that can be loaded in the MetagenomeScope viewer interface.
optional arguments:
-h, --help show this help message and exit
-i INPUTFILE, --inputfile INPUTFILE
input assembly graph filename (LastGraph, GFA, or
MetaCarvel GML)
-o OUTPUTPREFIX, --outputprefix OUTPUTPREFIX
output file prefix for .db files; also used for most
auxiliary files
-d OUTPUTDIRECTORY, --outputdirectory OUTPUTDIRECTORY
directory in which all output files will be stored;
defaults to current working directory (this directory
will be created if it does not exist, but if the
directory cannot be created then an error will be
raised)
-w, --overwrite overwrite output files (if this isn't passed, and a
non-auxiliary file would need to be overwritten, an
error will be raised)
-maxn MAXNODECOUNT, --maxnodecount MAXNODECOUNT
connected components with more nodes than this value
will not be laid out or available for display in the
viewer interface (default 7999, must be at least 1)
-maxe MAXEDGECOUNT, --maxedgecount MAXEDGECOUNT
connected components with more edges than this value
will not be laid out or available for display in the
viewer interface (default 7999, must be at least 1)
-ub USERBUBBLEFILE, --userbubblefile USERBUBBLEFILE
file describing pre-identified bubbles in the graph,
in the format of MetaCarvel's bubbles.txt output: each
line of the file is formatted as (source ID) (tab)
(sink ID) (tab) (all node IDs in the bubble, including
source and sink IDs, all separated by tabs). See the
MetaCarvel documentation for more details on this
format.
-ubl, --userbubblelabelsused
use node labels instead of IDs in the pre-identified
bubbles file specified by -ub
-up USERPATTERNFILE, --userpatternfile USERPATTERNFILE
file describing any pre-identified structural patterns
in the graph: each line of the file is formatted as
(pattern type) (tab) (all node IDs in the pattern, all
separated by tabs). If (pattern type) is "Bubble" or
"Frayed Rope", then the pattern will be represented in
the visualization as a Bubble or Frayed Rope,
respectively; otherwise, the pattern will be
represented as a generic "misc. user-specified
pattern," and colorized accordingly in the
visualization.
-upl, --userpatternlabelsused
use node labels instead of IDs in the pre-identified
misc. patterns file specified by -up
-spqr, --computespqrdata
compute data for the SPQR "decomposition modes" in
MetagenomeScope; necessitates a few additional system
requirements (see MetagenomeScope's installation
instructions wiki page for details)
-b BICOMPONENTFILE, --bicomponentfile BICOMPONENTFILE
file containing bicomponent information for the
assembly graph (this argument is only used if -spqr is
passed, and is not required even in that case; the
needed files will be generated if -spqr is passed and
this option is not passed)
-sp, --structuralpatterns
save .txt files containing node information for all
structural patterns identified in the graph
-pg, --preservegv save all .gv (DOT) files generated for nontrivial
(i.e. containing more than one node, or at least one
edge or node group) connected components
-px, --preservexdot save all .xdot files generated for nontrivial
connected components
-nbdf, --nobackfilldotfiles
produces .gv (DOT) files without cluster "backfilling"
for each nontrivial connected component in the graph;
use of this argument doesn't impact the .db file
produced by this script -- it just demonstrates the
functionality in layout linearization provided by
cluster "backfilling"
-npdf, --nopatterndotfiles
produces .gv (DOT) files without any structural
pattern information embedded; as with -nbdf, this
doesn't actually impact the .db file -- it just
provides a frame of reference for the impact
clustering can have on dot's layouts
The script will always produce a .db
file. Certain arguments (-pg
, -px
, -nbdf
, -sp
) can be passed to produce more output files; for a thorough description of these arguments, see the command-line argument descriptions.
The script will also generate a few types of auxiliary files containing various information about the structure of the assembly graph. These files are:
-
*_links
, where*
is the output prefix passed via-o
. Only one of these files will be generated per execution ofcollate.py
. This file indicates all the edges in the assembly graph. If you pass in-b
and the input assembly graph has unoriented contigs, then this file will not be generated (since it would be equivalent to the _single_links file in that case). -
*_single_links
, where*
is the output prefix passed via-o
. This file will only be generated if the input assembly graph has unoriented contigs. In terms of currently supported input filetypes, this means that this file will only be generated when the input assembly graph is of type LastGraph or GFA. -
*_bicmps
, where*
is the output prefix passed via-o
. Only one of these files will be generated per execution ofcollate.py
. This file indicates the various separation pairs contained within the assembly graph (see Nijkamp et al. for a brief overview of separation pairs and their usage in bubble detection). It's possible to pass an existing version of this file using-b
to the script, to prevent having to do the work of creating the file again. -
component_D.info
, whereD
is an integer greater than 0. There will be one of these files created for every biconnected component contained within the assembly graph: these files indicate the contents of the SPQR tree defined for their corresponding biconnected component. -
spqrD.gml
, whereD
is an integer greater than 0. These files correspond tocomponent_D.info
files: they indicate the connections between the metanodes of a SPQR tree.
The script requires all component_D.info
and spqrD.gml
files to be
removed from the output directory before it generates more of them.
If -w
is enabled, then all existing files with corresponding names in the
output directory will be deleted; however, if -w
is not enabled, then an
error will be raised.
Similarly, if files exist in the output directory with filenames overlapping
those of the *_links
and *_bicmps
files, then those files will be
either deleted (if -w
is enabled) or an error will be raised (if -w
is not
enabled).
-
-i
The input assembly graph file to be used.- See the MetagenomeScope README for an up-to-date list of input assembly graph filetypes supported.
-
-o
The file prefix to be used for all files generated (with the exception of some SPQR files). As an example, given the argument-o prefix
, the fileprefix.db
would be generated. If .gv and/or .xdot files are created (depending on the-pg
or-px
arguments, respectively), then those files will be numbered according to the relative size rank (in nodes) of their respective connected component within the assembly graph. -
-d
This optional argument specifies the name of the directory in which all output files will be stored. If this argument is not indicated, then all files will be generated in the current working directory.- If the specified directory here does not already exist, then the preprocessing script will create it. In the case that the directory cannot be created (i.e. there exists a file in the current working directory with the same name as the specified directory), an error will be raised.
-
-pg
This optional argument produces DOT files (suffix .gv) in the output directory. As an example, given the arguments-o prefix
and-pg
for an assembly graph with 3 connected components, the filesprefix.db
,prefix_1.gv
,prefix_2.gv
, andprefix_3.gv
would be created (whereprefix_1.gv
indicates the largest connected component by number of nodes,prefix_2.gv
indicates the next largest connected component, and so on). -
-px
This optional argument produces .xdot files in the output directory. These files are labelled in an identical fashion to.gv
files, with the only difference in naming being the file suffix (.xdot instead of .gv). -
-sp
This optional argument will produce .txt files in the output directory describing the nodes contained in the various types of structural patterns identified in the assembly graph.- Each file will be named
sp_clustertypes.txt
, whereclustertypes
is one of (bubbles
,frayed_ropes
,chains
,cyclic_chains
,misc_patterns
). - Files will only be created for structural pattern types that were identified in the graph; so if an input assembly graph only contains chains (and no bubbles, frayed ropes, cyclic chains, etc.) then only a file named
sp_chains.txt
will be produced.
- Each file will be named
-
-w
This optional argument allows the overwriting of output files (.db/.xdot/.gv/links/single_links/bicmps/.info/spqr.gml/structural pattern .txt files). If this argument is not given, then:- An error will be raised if writing a .db file would cause another .db file to be overwritten.
- A warning will be displayed if writing to a .gv or .xdot file would cause another .gv/.xdot file to be overwritten. In this case, the .gv/.xdot file in question simply would not be saved.
- Note that the presence of files in the
output directory that are conflicting-named folders (e.g. a
directory named
e_coli.db/
in the output directory while attempting to produce a file namede_coli.db
) will cause an error/warning to be raised regardless of whether or not-w
is set. - See this page for details on how this option works, and a few possible boundary conditions.
-
-b
This optional argument lets you pass in an existing file indicating the separation pairs in the graph (to be used in the detection of complex bubbles) to the script. -
-ub
This optional argument lets you pass in a file describing pre-identified bubbles in the input graph, which will be automatically highlighted and grouped (as with "normal" bubbles discovered by MetagenomeScope).- The format of this file should match MetaCarvel's
bubbles.txt
output file: each line of the file should be formatted as(source contig ID)\t(sink contig ID)\t(all node IDs in the bubble, including source and sink IDs, all separated by tabs)
. - As with normal MetagenomeScope-identified bubbles, the same contig can't be contained in multiple bubbles. Bubbles specified in the input file here are processed starting from the first line and going down; any bubbles containing already-"used" contigs will be skipped.
- The contigs contained in these bubbles should at least be contiguous in some fashion. (This will eventually be validated.)
- The format of this file should match MetaCarvel's
-
-ubl
If this optional argument is passed -- and if-ub
is passed -- then the pre-identified bubbles file specified by-ub
will be processed looking for contig labels instead of IDs. -
-up
Like-ub
, this optional argument lets you pass in a file describing pre-identified miscellaneous patterns in the input graph, which will be automatically highlighted and colored.- Each line of the file should be formatted as
(pattern type)\t(all node IDs in the pattern, separated by tabs)
.-
(pattern type)
can be any string not containing a tab or newline. It's the name of the pattern, seen when it is selected in the viewer interface.
-
- User-defined bubbles, if present, will be processed by MetagenomeScope before user-defined misc. patterns. So if a contig is present in multiple "groups" for whatever reason, the user-defined bubble will be given higher priority.
- The same contig can't be contained in multiple misc. patterns. As with the user-specified bubbles file, misc. patterns are processed starting from the first line and going down; any patterns containing already-"used" contigs will be skipped.
- Each line of the file should be formatted as
-
-upl
If this optional argument is passed -- and if-up
is passed -- then the pre-identified misc. patterns file specified by-up
will be processed looking for contig labels instead of IDs. -
-nbdf
If this optional argument is passed, DOT files for each nontrivial standard mode connected component (see the note below) that don't use backfilling for node groups will be generated in the output directory. That is, these files (all with the suffix_nobackfill.gv
) will contain all node groups in their respective connected component represented as "clusters" in Graphviz.- This differs from the normal way we lay out graphs in standard mode using Graphviz, in which node groups are laid out separately and represented as rectangular nodes in the overall graph layout; these node groups are later "backfilled" to contain their children nodes. Since this makes all node groups "atomic" -- they're represented as nodes, so
dot
doesn't route any edges through them -- this has the effect of linearizing the graph in many cases. - Using the
-pg
option will produce DOT files that do use backfilling. This is nice if you'd like to compare components' layouts with and without backfilling. - Note that this option doesn't actually change the way the .db file is created -- that'll still use backfilling, regardless of this option. All passing
-nbdf
does is create extra DOT files in the output directory.
- This differs from the normal way we lay out graphs in standard mode using Graphviz, in which node groups are laid out separately and represented as rectangular nodes in the overall graph layout; these node groups are later "backfilled" to contain their children nodes. Since this makes all node groups "atomic" -- they're represented as nodes, so
Graphviz seems to round input node dimensions to the nearest point value (where an inch is defined as 72 points). See this issue for details on the rounding process.
We don't use these rounded dimensions
in the viewer interface, although the rounded dimensions will persist in
.xdot
files and when Graphviz performs layout on/draws the .gv
files
produced via -pg
.
This results in a very slight discrepancy in node sizes between the viewer
interface and Graphviz' drawings.
-
Controls
(Work in progress)
-
Viewer Interface Tutorial