-
Notifications
You must be signed in to change notification settings - Fork 0
Library Format Specification
A nimble library is a valid JSON file. The top-level of the file is an array containing two JSON objects.
The first object contains the aligner configuration:
{
"score_threshold": number,
"score_filter": number,
"num_mismatches": number,
"discard_multiple_matches": boolean,
"intersect_level: number",
"group_on": string,
"discard_multi_hits": number,
"require_valid_pair": boolean,
"data_type": string,
"filters": array<string>
}
-
score_threshold
: controls the score an alignment needs to reach to be considered a match. For perfect matches, set this value equal to the length of the reads being aligned to the reference library. -
score_filter
: sets a lower boundary on the number of matches needed on a reference before it is reported. For instance, if you set"score_filter": 25
, no reference with less than 25 matches will be reported in the output. -
num_mismatches
: sets the allowable number of mismatches during alignment. -
discard_multiple_matches
: flag for whether a read that matches multiple references should be counted. Iftrue
, a read that matches multiple references will count toward the scores of all of those references. Iffalse
, the read's matches are discarded. -
intersect_level
: controls logic behind how to count matches during alignment. There are three intersect levels.intersect_level: 0
takes the best matches from either the read or reverse read, determined by alignment score.intersect_level: 1
takes the intersection between the read and reverse read -- if there is no intersection, it defaults to the best match.intersect_level: 2
takes the intersection and reports no match if there is no intersection. -
group_on
: if this is set to the name of a header in the reference metadata file, the outputresults.tsv
will be filtered to that level of specificity. For instance, if you've added a column with lineage information under a header called "lineage", setting"group_on": "lineage"
will report lineage-level information, rather than the default case of allele-level information. If a single read matches onto thegroup_on
category more than once during alignment (for instance, if a read matches multiple alleles in the same lineage and you're grouping on lineage), it will only count as one match. Ifgroup_on
is unset, allele-level information is returned.
The second object is the reference metadata:
{
"headers": ["reference_genome", "sequence_name", "nt_length", "sequence", ...]
"columns": [[...], [...], [...], [...], ...]
}
This object contains a headers
field and a columns
field. headers
is an array of strings that label the corresponding column in the columns
field. The aligner must have at least reference_genome
, sequence_name
nt_length
, and sequence
headers, along with their corresponding columns.
-
reference_genome
: string data about which genome the read is from -
sequence_name
: name of the read -
nt_length
: length of the sequence data -
sequence
: RNA string
The columns
field is a multidimensional array of strings. Each sub-array corresponds to a header in the headers
field.
To add another header/column pair (e.g. to add per-allele lineage or locus information), add a string to the headers
array and add a column to the corresponding index in the columns
field. However, you shouldn't need to directly edit this object -- nimble generate has several convenient options for adding additional metadata to libraries.