-
Notifications
You must be signed in to change notification settings - Fork 6
Autocycler table
Ryan Wick edited this page Oct 29, 2024
·
20 revisions
The autocycler table
command generates a TSV line from the various metrics stored in YAML files during an Autocycler assembly.
When conducting many automated Autocycler assemblies, you can use Autocycler cluster to build a TSV file containing metrics for each assembly. This can allow you to identify any samples which have assembled poorly and warrant further investigation.
For this example, I assume you have conducted many Autocycler assemblies, where each sample's assembly is in a directory that starts with SAM
:
autocycler table > metrics.tsv # create the TSV header
for sample in SAM*; do
autocycler table -a "$sample" -n "$sample" >> metrics.tsv # append a TSV row
done
When you run autocycler table
, you will likely need to change the SAM*
glob to whatever will catch your samples.
Usage: autocycler table [OPTIONS]
Options:
-a, --autocycler_dir <AUTOCYCLER_DIR> Autocycler directory (if absent, a header line will be output)
-n, --name <NAME> Sample name [default: blank]
-f, --fields <FIELDS> Comma-delimited list of YAML fields to include [default:
"input_reads, pass_cluster_count, fail_cluster_count,
overall_clustering_score, untrimmed_cluster_size,
untrimmed_cluster_distance, trimmed_cluster_size,
trimmed_sequence_length_mad, consensus_assembly_total_length,
consensus_assembly_total_unitigs,
consensus_assembly_fully_resolved"]
-s, --sigfigs <SIGFIGS> Significant figures to use for floating point numbers
[default: 3]
-h, --help Print help
-V, --version Print version
-
input_reads
: information on the full long-read set. -
pass_cluster_count
: the number of clusters which passed QC, ideally matching the number of sequences in the genome. -
fail_cluster_count
: the number of clusters which failed QC, lower is better. -
overall_clustering_score
: a relative metric of how well the input assembly contigs clustered. Ranges from 0–1, with higher values being better. -
untrimmed_cluster_size
: the number of sequences in each QC-pass cluster before trimming. Ideally, these will be close to the number of input assemblies. -
untrimmed_cluster_distance
: the maximum pairwise distance between sequences for each QC-pass cluster before trimming. Lower is better, as this indicates tighter clusters. -
trimmed_cluster_size
: the number of sequences in each QC-pass cluster after trimming. Ideally, these will be close to the number of input assemblies, but will likely be lower than the untrimmed cluster sizes (because outlier sequences are discarded during trimming). -
trimmed_sequence_length_mad
: the median absolute deviation of sequence lengths for each QC-pass cluster after trimming. Lower is better, as this indicates consistent sequence lengths within the cluster. -
consensus_assembly_total_length
: the total number of bases in the final consensus assembly. -
consensus_assembly_total_unitigs
: the number of unitigs in the final consensus assembly. Ideally, this will match thepass_cluster_count
metric. -
consensus_assembly_fully_resolved
: whether or not each cluster has resolved to a single sequence. 'True' is good and 'false' is bad.
See the Metrics page for a full list of available fields.
- Some fields will be in multiple files and will be combined into a list. For example, the
trimmed_sequence_length_mad
metric is stored in2_trimmed.yaml
files which are made for each cluster. Since a single assembly can have multiple clusters, there can be multipletrimmed_sequence_length_mad
metrics for an assembly (one value per cluster). - Only top-level metrics can be used with Autocycler table. For example, the
filename
metric is nested underinput_assembly_details
so cannot be used with Autocycler table. - Some metrics contain nested metrics, e.g.
input_assembly_details
. This can be used with Autocycler table, but all of their nested data will be combined into a single string, which can be clunky in TSV format. - If a YAML file is missing from the Autocycler directory, that's okay, but any metric from that file will be blank.
- Step 1: Autocycler subsample
- Step 2: Generating input assemblies
- Step 3: Autocycler compress
- Step 4: Autocycler cluster
- Step 5: Autocycler trim
- Step 6: Autocycler resolve
- Step 7: Autocycler combine