-
Notifications
You must be signed in to change notification settings - Fork 4
Autocycler table
The autocycler table
command generates a TSV line from the various metrics stored in YAML files during an Autocycler assembly.
When conducting many automated Autocycler assemblies, you can use Autocycler cluster to build a TSV file containing metrics for each assembly. This can allow you to identify any samples which have assembled poorly and warrant further investigation.
For this example, I assume you have conducted many Autocycler assemblies, where each sample's assembly is in a directory that starts with SAM
:
autocycler table > metrics.tsv # create the TSV header
for sample in SAM*; do
autocycler table -a "$sample" -n "$sample" >> metrics.tsv # append a TSV row
done
When you run autocycler table
, you will need to change the SAM*
glob to whatever will catch your samples.
Usage: autocycler table [OPTIONS]
Options:
-a, --autocycler_dir <AUTOCYCLER_DIR> Autocycler directory (if absent, a header line will be output)
-n, --name <NAME> Sample name [default: blank]
-f, --fields <FIELDS> Comma-delimited list of YAML fields to include [default:
"input_reads, pass_cluster_count, fail_cluster_count,
overall_clustering_score, untrimmed_cluster_size,
untrimmed_cluster_distance, trimmed_cluster_size,
trimmed_sequence_length_mad, consensus_assembly_total_length,
consensus_assembly_total_unitigs,
consensus_assembly_fully_resolved"]
-s, --sigfigs <SIGFIGS> Significant figures to use for floating point numbers
[default: 3]
-h, --help Print help
-V, --version Print version
-
input_read_count
: the number of reads in the input read set. -
input_read_bases
: the total number of bases in the input read set. -
input_read_n50
: the N50 length for the input read set. -
pass_cluster_count
: the number of clusters which passed QC, ideally matching the number of sequences in the genome. -
fail_cluster_count
: the number of clusters which failed QC, lower is better. -
overall_clustering_score
: a relative metric of how well the input assembly contigs clustered. Ranges from 0–1, with higher values being better. -
untrimmed_cluster_size
: the number of sequences in each QC-pass cluster before trimming. Ideally, these will be close to the number of input assemblies. -
untrimmed_cluster_distance
: the maximum pairwise distance between sequences for each QC-pass cluster before trimming. Lower is better, as this indicates tighter clusters. -
trimmed_cluster_size
: the number of sequences in each QC-pass cluster after trimming. Ideally, these will be close to the number of input assemblies, but will likely be lower than the untrimmed cluster sizes (because outlier sequences are discarded during trimming). -
trimmed_sequence_length_mad
: the median absolute deviation of sequence lengths for each QC-pass cluster after trimming. Lower is better, as this indicates consistent sequence lengths within the cluster. -
consensus_assembly_bases
: the total number of bases in the final consensus assembly. -
consensus_assembly_unitigs
: the number of unitigs in the final consensus assembly. Ideally, this will match thepass_cluster_count
metric. -
consensus_assembly_fully_resolved
: whether or not each cluster has resolved to a single sequence. 'True' is good and 'false' is bad.
See the Metrics page for a full list of available fields.
Here is an example output table for some S. aureus genomes which had a smooth Autocycler assembly:
name input_read_count input_read_bases input_read_n50 pass_cluster_count fail_cluster_count overall_clustering_score untrimmed_cluster_size untrimmed_cluster_distance trimmed_cluster_size trimmed_cluster_median trimmed_cluster_mad consensus_assembly_bases consensus_assembly_unitigs consensus_assembly_fully_resolved
wildtype 206111 1881436150 20927 3 0 0.858 [24,12,9] [0.000173,0.000676,0.00164] [17,11,7] [2879034,4439,3125] [1,0,0] 2886598 3 true
walKT389A 170651 975514879 16573 3 3 0.820 [24,11,11] [0.00155,0.00359,0.00547] [18,9,10] [2879034,4439,3125] [1,0,0] 2886598 3 true
IMAL014 304441 1863791162 18223 3 2 0.841 [24,11,12] [0.00239,0.000676,0.000989] [16,10,10] [2879035,4439,3125] [1,0,0] 2886599 3 true
IMAL031 212199 1814081856 19308 3 0 0.857 [24,13,10] [0.00396,0.00161,0.00] [17,11,10] [2879034,4439,3125] [0,0,0] 2886598 3 true
IMAL058 105653 1057335967 24664 3 2 0.827 [24,11,9] [0.000550,0.00669,0.00260] [16,9,7] [2879033,4439,3125] [1,0,0] 2886597 3 true
IMAL065 198975 2093837921 20387 3 1 0.843 [24,7,8] [0.000405,0.00184,0.00128] [17,5,7] [2879034,4439,3125] [0,0,0] 2886598 3 true
IMAL070 76391 697918832 20037 3 3 0.832 [24,13,8] [0.00180,0.00207,0.00128] [16,10,7] [2879033,4439,3125] [1,0,0] 2886598 3 true
- Some fields will be in multiple files and will be combined into a list. For example, the
trimmed_sequence_length_mad
metric is stored in2_trimmed.yaml
files which are made for each cluster. Since a single assembly can have multiple clusters, there can be multipletrimmed_sequence_length_mad
metrics for an assembly (one value per cluster). - Only top-level metrics can be used with Autocycler table. For example, the
filename
metric is nested underinput_assembly_details
so cannot be used with Autocycler table. - Some metrics contain nested metrics, e.g.
input_assembly_details
. This can be used with Autocycler table, but all of their nested data will be combined into a single string, which can be clunky in TSV format. - If a YAML file is missing from the Autocycler directory, that's okay, but any metric from that file will be blank.
- Step 1: Autocycler subsample
- Step 2: Generating input assemblies
- Step 3: Autocycler compress
- Step 4: Autocycler cluster
- Step 5: Autocycler trim
- Step 6: Autocycler resolve
- Step 7: Autocycler combine