Skip to content

Autocycler table

Ryan Wick edited this page Oct 30, 2024 · 20 revisions

Basics

The autocycler table command generates a TSV line from the various metrics stored in YAML files during an Autocycler assembly.

When conducting many automated Autocycler assemblies, you can use Autocycler cluster to build a TSV file containing metrics for each assembly. This can allow you to identify any samples which have assembled poorly and warrant further investigation.

Example command

For this example, I assume you have conducted many Autocycler assemblies, where each sample's assembly is in a directory that starts with SAM:

autocycler table > metrics.tsv  # create the TSV header
for sample in SAM*; do
    autocycler table -a "$sample" -n "$sample" >> metrics.tsv  # append a TSV row
done

When you run autocycler table, you will likely need to change the SAM* glob to whatever will catch your samples.

Full usage

Usage: autocycler table [OPTIONS]

Options:
  -a, --autocycler_dir <AUTOCYCLER_DIR>  Autocycler directory (if absent, a header line will be output)
  -n, --name <NAME>                      Sample name [default: blank]
  -f, --fields <FIELDS>                  Comma-delimited list of YAML fields to include [default:
                                         "input_reads, pass_cluster_count, fail_cluster_count,
                                         overall_clustering_score, untrimmed_cluster_size,
                                         untrimmed_cluster_distance, trimmed_cluster_size,
                                         trimmed_sequence_length_mad, consensus_assembly_total_length,
                                         consensus_assembly_total_unitigs,
                                         consensus_assembly_fully_resolved"]
  -s, --sigfigs <SIGFIGS>                Significant figures to use for floating point numbers
                                         [default: 3]
  -h, --help                             Print help
  -V, --version                          Print version

Default fields

  • input_read_count: the number of reads in the input read set.
  • input_read_bases: the total number of bases in the input read set.
  • input_read_n50: the N50 length for the input read set.
  • pass_cluster_count: the number of clusters which passed QC, ideally matching the number of sequences in the genome.
  • fail_cluster_count: the number of clusters which failed QC, lower is better.
  • overall_clustering_score: a relative metric of how well the input assembly contigs clustered. Ranges from 0–1, with higher values being better.
  • untrimmed_cluster_size: the number of sequences in each QC-pass cluster before trimming. Ideally, these will be close to the number of input assemblies.
  • untrimmed_cluster_distance: the maximum pairwise distance between sequences for each QC-pass cluster before trimming. Lower is better, as this indicates tighter clusters.
  • trimmed_cluster_size: the number of sequences in each QC-pass cluster after trimming. Ideally, these will be close to the number of input assemblies, but will likely be lower than the untrimmed cluster sizes (because outlier sequences are discarded during trimming).
  • trimmed_sequence_length_mad: the median absolute deviation of sequence lengths for each QC-pass cluster after trimming. Lower is better, as this indicates consistent sequence lengths within the cluster.
  • consensus_assembly_bases: the total number of bases in the final consensus assembly.
  • consensus_assembly_unitigs: the number of unitigs in the final consensus assembly. Ideally, this will match the pass_cluster_count metric.
  • consensus_assembly_fully_resolved: whether or not each cluster has resolved to a single sequence. 'True' is good and 'false' is bad.

See the Metrics page for a full list of available fields.

Example output

Here is an example output table for some S. aureus genomes which had a smooth Autocycler assembly:

name       input_read_count  input_read_bases  input_read_n50  pass_cluster_count  fail_cluster_count  overall_clustering_score  untrimmed_cluster_size  untrimmed_cluster_distance   trimmed_cluster_size  trimmed_cluster_median  trimmed_cluster_mad  consensus_assembly_bases  consensus_assembly_unitigs  consensus_assembly_fully_resolved
wildtype   206111            1881436150        20927           3                   0                   0.858                     [24,12,9]               [0.000173,0.000676,0.00164]  [17,11,7]             [2879034,4439,3125]     [1,0,0]              2886598                   3                           true
walKT389A  170651            975514879         16573           3                   3                   0.820                     [24,11,11]              [0.00155,0.00359,0.00547]    [18,9,10]             [2879034,4439,3125]     [1,0,0]              2886598                   3                           true
IMAL014    304441            1863791162        18223           3                   2                   0.841                     [24,11,12]              [0.00239,0.000676,0.000989]  [16,10,10]            [2879035,4439,3125]     [1,0,0]              2886599                   3                           true
IMAL031    212199            1814081856        19308           3                   0                   0.857                     [24,13,10]              [0.00396,0.00161,0.00]       [17,11,10]            [2879034,4439,3125]     [0,0,0]              2886598                   3                           true
IMAL058    105653            1057335967        24664           3                   2                   0.827                     [24,11,9]               [0.000550,0.00669,0.00260]   [16,9,7]              [2879033,4439,3125]     [1,0,0]              2886597                   3                           true
IMAL065    198975            2093837921        20387           3                   1                   0.843                     [24,7,8]                [0.000405,0.00184,0.00128]   [17,5,7]              [2879034,4439,3125]     [0,0,0]              2886598                   3                           true
IMAL070    76391             697918832         20037           3                   3                   0.832                     [24,13,8]               [0.00180,0.00207,0.00128]    [16,10,7]             [2879033,4439,3125]     [1,0,0]              2886598                   3                           true

Notes

  • Some fields will be in multiple files and will be combined into a list. For example, the trimmed_sequence_length_mad metric is stored in 2_trimmed.yaml files which are made for each cluster. Since a single assembly can have multiple clusters, there can be multiple trimmed_sequence_length_mad metrics for an assembly (one value per cluster).
  • Only top-level metrics can be used with Autocycler table. For example, the filename metric is nested under input_assembly_details so cannot be used with Autocycler table.
  • Some metrics contain nested metrics, e.g. input_assembly_details. This can be used with Autocycler table, but all of their nested data will be combined into a single string, which can be clunky in TSV format.
  • If a YAML file is missing from the Autocycler directory, that's okay, but any metric from that file will be blank.
Clone this wiki locally