Skip to content

Autocycler table

Ryan Wick edited this page Oct 29, 2024 · 20 revisions

Basics

The autocycler table command generates a TSV line from the various metrics stored in YAML files during an Autocycler assembly.

When conducting many automated Autocycler assemblies, you can use Autocycler cluster to build a TSV file containing metrics for each assembly. This can allow you to identify any samples which have assembled poorly and warrant further investigation.

Example command

For this example, I assume you have conducted many Autocycler assemblies, where each sample's assembly is in a directory that starts with SAM:

autocycler table > metrics.tsv  # create the TSV header
for sample in SAM*; do
    autocycler table -a "$sample" -n "$sample" >> metrics.tsv  # append a TSV row
done

When you run autocycler table, you will likely need to change the SAM* glob to whatever will catch your samples.

Full usage

Usage: autocycler table [OPTIONS]

Options:
  -a, --autocycler_dir <AUTOCYCLER_DIR>  Autocycler directory (if absent, a header line will be output)
  -n, --name <NAME>                      Sample name [default: blank]
  -f, --fields <FIELDS>                  Comma-delimited list of YAML fields to include [default:
                                         "input_reads, pass_cluster_count, fail_cluster_count,
                                         overall_clustering_score, untrimmed_cluster_size,
                                         untrimmed_cluster_distance, trimmed_cluster_size,
                                         trimmed_sequence_length_mad, consensus_assembly_total_length,
                                         consensus_assembly_total_unitigs,
                                         consensus_assembly_fully_resolved"]
  -s, --sigfigs <SIGFIGS>                Significant figures to use for floating point numbers
                                         [default: 3]
  -h, --help                             Print help
  -V, --version                          Print version

Default fields

  • input_reads: information on the full long-read set.
  • pass_cluster_count: the number of clusters which passed QC, ideally matching the number of sequences in the genome.
  • fail_cluster_count: the number of clusters which failed QC, lower is better.
  • overall_clustering_score: a relative metric of how well the input assembly contigs clustered. Ranges from 0–1, with higher values being better.
  • untrimmed_cluster_size: the number of sequences in each QC-pass cluster before trimming. Ideally, these will be close to the number of input assemblies.
  • untrimmed_cluster_distance: the maximum pairwise distance between sequences for each QC-pass cluster before trimming. Lower is better, as this indicates tighter clusters.
  • trimmed_cluster_size: the number of sequences in each QC-pass cluster after trimming. Ideally, these will be close to the number of input assemblies, but will likely be lower than the untrimmed cluster sizes (because outlier sequences are discarded during trimming).
  • trimmed_sequence_length_mad: the median absolute deviation of sequence lengths for each QC-pass cluster after trimming. Lower is better, as this indicates consistent sequence lengths within the cluster.
  • consensus_assembly_total_length: the total number of bases in the final consensus assembly.
  • consensus_assembly_total_unitigs: the number of unitigs in the final consensus assembly. Ideally, this will match the pass_cluster_count metric.
  • consensus_assembly_fully_resolved: whether or not each cluster has resolved to a single sequence. 'True' is good and 'false' is bad.

See the Metrics page for a full list of available fields.

Notes

  • Some fields will be in multiple files and will be combined into a list. For example, the trimmed_sequence_length_mad metric is stored in 2_trimmed.yaml files which are made for each cluster. Since a single assembly can have multiple clusters, there can be multiple trimmed_sequence_length_mad metrics for an assembly (one value per cluster).
  • Only top-level metrics can be used with Autocycler table. For example, the filename metric is nested under input_assembly_details so cannot be used with Autocycler table.
  • Some metrics contain nested metrics, e.g. input_assembly_details. This can be used with Autocycler table, but all of their nested data will be combined into a single string, which can be clunky in TSV format.
  • If a YAML file is missing from the Autocycler directory, that's okay, but any metric from that file will be blank.
Clone this wiki locally