Skip to content

Library Format Specification

Sebastian Benjamin edited this page Sep 7, 2021 · 2 revisions

JSON Format

A nimble library is a valid JSON file. The top-level of the file is an array containing two JSON objects.

Aligner Configuration

The first object contains the aligner configuration:

{
  "score_threshold": number,
  "score_filter": number,
  "num_mismatches": number,
  "discard_multiple_matches": boolean,
  "intersect_level: number",
  "group_on": string,
  "discard_multi_hits": number,
  "require_valid_pair": boolean,
  "data_type": string,
  "filters": array<string>
}
  • score_threshold: controls the score an alignment needs to reach to be considered a match. For perfect matches, set this value equal to the length of the reads being aligned to the reference library.

  • score_filter: sets a lower boundary on the number of matches needed on a reference before it is reported. For instance, if you set "score_filter": 25, no reference with less than 25 matches will be reported in the output.

  • num_mismatches: sets the allowable number of mismatches during alignment.

  • discard_multiple_matches: flag for whether a read that matches multiple references should be counted. If true, a read that matches multiple references will count toward the scores of all of those references. If false, the read's matches are discarded.

  • intersect_level: controls logic behind how to count matches during alignment. There are three intersect levels. intersect_level: 0 takes the best matches from either the read or reverse read, determined by alignment score. intersect_level: 1 takes the intersection between the read and reverse read -- if there is no intersection, it defaults to the best match. intersect_level: 2 takes the intersection and reports no match if there is no intersection.

  • group_on: if this is set to the name of a header in the reference metadata file, the output results.tsv will be filtered to that level of specificity. For instance, if you've added a column with lineage information under a header called "lineage", setting "group_on": "lineage" will report lineage-level information, rather than the default case of allele-level information. If a single read matches onto the group_on category more than once during alignment (for instance, if a read matches multiple alleles in the same lineage and you're grouping on lineage), it will only count as one match. If group_on is unset, allele-level information is returned.

Reference Metadata

The second object is the reference metadata:

{
  "headers": ["reference_genome", "sequence_name", "nt_length", "sequence", ...]
  "columns": [[...], [...], [...], [...], ...]
}

This object contains a headers field and a columns field. headers is an array of strings that label the corresponding column in the columns field. The aligner must have at least reference_genome, sequence_name nt_length, and sequence headers, along with their corresponding columns.

  • reference_genome: string data about which genome the read is from
  • sequence_name: name of the read
  • nt_length: length of the sequence data
  • sequence: RNA string

The columns field is a multidimensional array of strings. Each sub-array corresponds to a header in the headers field.

To add another header/column pair (e.g. to add per-allele lineage or locus information), add a string to the headers array and add a column to the corresponding index in the columns field. However, you shouldn't need to directly edit this object -- nimble generate has several convenient options for adding additional metadata to libraries.

Clone this wiki locally