Skip to content

Preprocessing Script Options

Marcus Fedarko edited this page Sep 9, 2019 · 7 revisions

Running collate.py will process an assembly graph file so that it can be visualized, producing a SQLite3 .db file that can be loaded in the viewer interface to visualize the assembly graph.

Usage

usage: mgsc [-h] -i INPUTFILE -o OUTPUTPREFIX [-d OUTPUTDIRECTORY] [-w]
            [-maxn MAXNODECOUNT] [-maxe MAXEDGECOUNT] [-ub USERBUBBLEFILE]
            [-ubl] [-up USERPATTERNFILE] [-upl] [-spqr] [-b BICOMPONENTFILE]
            [-sp] [-pg] [-px] [-nbdf] [-npdf]

Prepares an assembly graph file for visualization, generating a database file
that can be loaded in the MetagenomeScope viewer interface.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUTFILE, --inputfile INPUTFILE
                        input assembly graph filename (LastGraph, GFA, or
                        MetaCarvel GML)
  -o OUTPUTPREFIX, --outputprefix OUTPUTPREFIX
                        output file prefix for .db files; also used for most
                        auxiliary files
  -d OUTPUTDIRECTORY, --outputdirectory OUTPUTDIRECTORY
                        directory in which all output files will be stored;
                        defaults to current working directory (this directory
                        will be created if it does not exist, but if the
                        directory cannot be created then an error will be
                        raised)
  -w, --overwrite       overwrite output files (if this isn't passed, and a
                        non-auxiliary file would need to be overwritten, an
                        error will be raised)
  -maxn MAXNODECOUNT, --maxnodecount MAXNODECOUNT
                        connected components with more nodes than this value
                        will not be laid out or available for display in the
                        viewer interface (default 7999, must be at least 1)
  -maxe MAXEDGECOUNT, --maxedgecount MAXEDGECOUNT
                        connected components with more edges than this value
                        will not be laid out or available for display in the
                        viewer interface (default 7999, must be at least 1)
  -ub USERBUBBLEFILE, --userbubblefile USERBUBBLEFILE
                        file describing pre-identified bubbles in the graph,
                        in the format of MetaCarvel's bubbles.txt output: each
                        line of the file is formatted as (source ID) (tab)
                        (sink ID) (tab) (all node IDs in the bubble, including
                        source and sink IDs, all separated by tabs). See the
                        MetaCarvel documentation for more details on this
                        format.
  -ubl, --userbubblelabelsused
                        use node labels instead of IDs in the pre-identified
                        bubbles file specified by -ub
  -up USERPATTERNFILE, --userpatternfile USERPATTERNFILE
                        file describing any pre-identified structural patterns
                        in the graph: each line of the file is formatted as
                        (pattern type) (tab) (all node IDs in the pattern, all
                        separated by tabs). If (pattern type) is "Bubble" or
                        "Frayed Rope", then the pattern will be represented in
                        the visualization as a Bubble or Frayed Rope,
                        respectively; otherwise, the pattern will be
                        represented as a generic "misc. user-specified
                        pattern," and colorized accordingly in the
                        visualization.
  -upl, --userpatternlabelsused
                        use node labels instead of IDs in the pre-identified
                        misc. patterns file specified by -up
  -spqr, --computespqrdata
                        compute data for the SPQR "decomposition modes" in
                        MetagenomeScope; necessitates a few additional system
                        requirements (see MetagenomeScope's installation
                        instructions wiki page for details)
  -b BICOMPONENTFILE, --bicomponentfile BICOMPONENTFILE
                        file containing bicomponent information for the
                        assembly graph (this argument is only used if -spqr is
                        passed, and is not required even in that case; the
                        needed files will be generated if -spqr is passed and
                        this option is not passed)
  -sp, --structuralpatterns
                        save .txt files containing node information for all
                        structural patterns identified in the graph
  -pg, --preservegv     save all .gv (DOT) files generated for nontrivial
                        (i.e. containing more than one node, or at least one
                        edge or node group) connected components
  -px, --preservexdot   save all .xdot files generated for nontrivial
                        connected components
  -nbdf, --nobackfilldotfiles
                        produces .gv (DOT) files without cluster "backfilling"
                        for each nontrivial connected component in the graph;
                        use of this argument doesn't impact the .db file
                        produced by this script -- it just demonstrates the
                        functionality in layout linearization provided by
                        cluster "backfilling"
  -npdf, --nopatterndotfiles
                        produces .gv (DOT) files without any structural
                        pattern information embedded; as with -nbdf, this
                        doesn't actually impact the .db file -- it just
                        provides a frame of reference for the impact
                        clustering can have on dot's layouts

Script output

The script will always produce a .db file. Certain arguments (-pg, -px, -nbdf, -sp) can be passed to produce more output files; for a thorough description of these arguments, see the command-line argument descriptions.

The script will also generate a few types of auxiliary files containing various information about the structure of the assembly graph. These files are:

  • *_links, where * is the output prefix passed via -o. Only one of these files will be generated per execution of collate.py. This file indicates all the edges in the assembly graph. If you pass in -b and the input assembly graph has unoriented contigs, then this file will not be generated (since it would be equivalent to the _single_links file in that case).
  • *_single_links, where * is the output prefix passed via -o. This file will only be generated if the input assembly graph has unoriented contigs. In terms of currently supported input filetypes, this means that this file will only be generated when the input assembly graph is of type LastGraph or GFA.
  • *_bicmps, where * is the output prefix passed via -o. Only one of these files will be generated per execution of collate.py. This file indicates the various separation pairs contained within the assembly graph (see Nijkamp et al. for a brief overview of separation pairs and their usage in bubble detection). It's possible to pass an existing version of this file using -b to the script, to prevent having to do the work of creating the file again.
  • component_D.info, where D is an integer greater than 0. There will be one of these files created for every biconnected component contained within the assembly graph: these files indicate the contents of the SPQR tree defined for their corresponding biconnected component.
  • spqrD.gml, where D is an integer greater than 0. These files correspond to component_D.info files: they indicate the connections between the metanodes of a SPQR tree.

The script requires all component_D.info and spqrD.gml files to be removed from the output directory before it generates more of them. If -w is enabled, then all existing files with corresponding names in the output directory will be deleted; however, if -w is not enabled, then an error will be raised.

Similarly, if files exist in the output directory with filenames overlapping those of the *_links and *_bicmps files, then those files will be either deleted (if -w is enabled) or an error will be raised (if -w is not enabled).

Command-line argument descriptions

  • -i The input assembly graph file to be used.

    • See the MetagenomeScope README for an up-to-date list of input assembly graph filetypes supported.
  • -o The file prefix to be used for all files generated (with the exception of some SPQR files). As an example, given the argument -o prefix, the file prefix.db would be generated. If .gv and/or .xdot files are created (depending on the -pg or -px arguments, respectively), then those files will be numbered according to the relative size rank (in nodes) of their respective connected component within the assembly graph.

  • -d This optional argument specifies the name of the directory in which all output files will be stored. If this argument is not indicated, then all files will be generated in the current working directory.

    • If the specified directory here does not already exist, then the preprocessing script will create it. In the case that the directory cannot be created (i.e. there exists a file in the current working directory with the same name as the specified directory), an error will be raised.
  • -pg This optional argument produces DOT files (suffix .gv) in the output directory. As an example, given the arguments -o prefix and -pg for an assembly graph with 3 connected components, the files prefix.db, prefix_1.gv, prefix_2.gv, and prefix_3.gv would be created (where prefix_1.gv indicates the largest connected component by number of nodes, prefix_2.gv indicates the next largest connected component, and so on).

    • See the note below about node dimension scaling in the layouts and drawings Graphviz produces from .gv files.
    • Also see the note below about components containing one node and no other elements.
  • -px This optional argument produces .xdot files in the output directory. These files are labelled in an identical fashion to .gv files, with the only difference in naming being the file suffix (.xdot instead of .gv).

    • See the note below about node dimension scaling that produces .xdot files.
    • Also see the note below about components containing one node and no other elements.
  • -sp This optional argument will produce .txt files in the output directory describing the nodes contained in the various types of structural patterns identified in the assembly graph.

    • Each file will be named sp_clustertypes.txt, where clustertypes is one of (bubbles, frayed_ropes, chains, cyclic_chains, misc_patterns).
    • Files will only be created for structural pattern types that were identified in the graph; so if an input assembly graph only contains chains (and no bubbles, frayed ropes, cyclic chains, etc.) then only a file named sp_chains.txt will be produced.
  • -w This optional argument allows the overwriting of output files (.db/.xdot/.gv/links/single_links/bicmps/.info/spqr.gml/structural pattern .txt files). If this argument is not given, then:

    • An error will be raised if writing a .db file would cause another .db file to be overwritten.
    • A warning will be displayed if writing to a .gv or .xdot file would cause another .gv/.xdot file to be overwritten. In this case, the .gv/.xdot file in question simply would not be saved.
    • Note that the presence of files in the output directory that are conflicting-named folders (e.g. a directory named e_coli.db/ in the output directory while attempting to produce a file named e_coli.db) will cause an error/warning to be raised regardless of whether or not -w is set.
    • See this page for details on how this option works, and a few possible boundary conditions.
  • -b This optional argument lets you pass in an existing file indicating the separation pairs in the graph (to be used in the detection of complex bubbles) to the script.

  • -ub This optional argument lets you pass in a file describing pre-identified bubbles in the input graph, which will be automatically highlighted and grouped (as with "normal" bubbles discovered by MetagenomeScope).

    • The format of this file should match MetaCarvel's bubbles.txt output file: each line of the file should be formatted as (source contig ID)\t(sink contig ID)\t(all node IDs in the bubble, including source and sink IDs, all separated by tabs).
    • As with normal MetagenomeScope-identified bubbles, the same contig can't be contained in multiple bubbles. Bubbles specified in the input file here are processed starting from the first line and going down; any bubbles containing already-"used" contigs will be skipped.
    • The contigs contained in these bubbles should at least be contiguous in some fashion. (This will eventually be validated.)
  • -ubl If this optional argument is passed -- and if -ub is passed -- then the pre-identified bubbles file specified by -ub will be processed looking for contig labels instead of IDs.

  • -up Like -ub, this optional argument lets you pass in a file describing pre-identified miscellaneous patterns in the input graph, which will be automatically highlighted and colored.

    • Each line of the file should be formatted as (pattern type)\t(all node IDs in the pattern, separated by tabs).
      • (pattern type) can be any string not containing a tab or newline. It's the name of the pattern, seen when it is selected in the viewer interface.
    • User-defined bubbles, if present, will be processed by MetagenomeScope before user-defined misc. patterns. So if a contig is present in multiple "groups" for whatever reason, the user-defined bubble will be given higher priority.
    • The same contig can't be contained in multiple misc. patterns. As with the user-specified bubbles file, misc. patterns are processed starting from the first line and going down; any patterns containing already-"used" contigs will be skipped.
  • -upl If this optional argument is passed -- and if -up is passed -- then the pre-identified misc. patterns file specified by -up will be processed looking for contig labels instead of IDs.

  • -nbdf If this optional argument is passed, DOT files for each nontrivial standard mode connected component (see the note below) that don't use backfilling for node groups will be generated in the output directory. That is, these files (all with the suffix _nobackfill.gv) will contain all node groups in their respective connected component represented as "clusters" in Graphviz.

    • This differs from the normal way we lay out graphs in standard mode using Graphviz, in which node groups are laid out separately and represented as rectangular nodes in the overall graph layout; these node groups are later "backfilled" to contain their children nodes. Since this makes all node groups "atomic" -- they're represented as nodes, so dot doesn't route any edges through them -- this has the effect of linearizing the graph in many cases.
    • Using the -pg option will produce DOT files that do use backfilling. This is nice if you'd like to compare components' layouts with and without backfilling.
    • Note that this option doesn't actually change the way the .db file is created -- that'll still use backfilling, regardless of this option. All passing -nbdf does is create extra DOT files in the output directory.

A note about node dimensions

Graphviz seems to round input node dimensions to the nearest point value (where an inch is defined as 72 points). See this issue for details on the rounding process.

We don't use these rounded dimensions in the viewer interface, although the rounded dimensions will persist in .xdot files and when Graphviz performs layout on/draws the .gv files produced via -pg. This results in a very slight discrepancy in node sizes between the viewer interface and Graphviz' drawings.