- The IgBLAST cache is now disabled by default. We assume that, in most cases, datasets will not be
re-run with the exact same parameters, and then it only fills up the disk. Delete your cache with
rm -r ~/.cache/igdiscover
to reclaim the space. To enable the cache, create a file~/.config/igdiscover.conf
with the contentsuse_cache: true
. - If you choose to enable the cache, results from the PEAR merging step will also be cached in it.
- Added detection of chimeras to the (pre-)germline filters. Any novel allele that can be explained
as a chimera of two unmodified reference alleles is marked in the
new_V_germline.tab
file. This is a bit sensitive, so the candidate is currently not discarded. - Two additional files
annotated_V_germline.tab
andannotated_V_pregermline.tab
are created in each iteration during the germline filtering step. These are identical to thecandidates.tab
file, except that they contain awhy_filtered
column that describes why a sequence was filtered. See the :ref:`documentation for this feature <annotated_v_tab>`.
When computing a consensus sequence, allow some sequences to be truncated in the 3' end. Many of the discovered novel V alleles were truncated by one nucleotide in the 3' end because IgBLAST does not always extend the alignment to the end of the V sequence. If these slightly too short V sequences were in the majority, their consensus would lead to a truncated sequence as well. The new consensus algorithm allows for this effect at the 3' end and can therefore more often than previously find the full sequence. Example:
TACTGTGCGAGAGA (seq 1) TACTGTGCGAGAGA (seq 2) TACTGTGCGAGAG- (seq 3) TACTGTGCGAG--- (seq 4) TACTGTGCGAG--- (seq 5) TACTGTGCGAGAG (previous consensus) TACTGTGCGAGAGA (new consensus)
Add a column
database_changes
to thenew_V_germline.tab
file that describes how the novel sequence differs from the database sequence. Example:93C>T; 114A>G
Allow filtering by
CDR3_shared_ratio
and do so by default (needs documentation)Cache the edit distance when computing the distance matrix. Speeds up the
discover
command slightly.discover
: Use more than six CPU cores if availableigblast
: Print progress every minute
- Implemented allele ratio filtering for J gene discovery
- J genes are discovered as part of the pipeline (previously, one needed
to run the
discoverj
script manually) - In each iteration, dendrograms are now created not only for V genes, but
also for D and J genes. The file names are
dendrogram_D.pdf
,dendrogram_J.pdf
- The V dendrograms are now in
dendrogram_V.pdf
(no longerV_dendrogram.pdf
). This puts all the dendrograms together when looking at the files in the iteration directory. - The
V_usage.tab
andV_usage.pdf
files are no longer created. Instead,expressed_V.tab
andexpressed_V.pdf
are created. These contain similar information, but an allele-ratio filter is used to filter out artifacts. - Similarly,
expressed_D.tab
andexpressed_J.tab
and their.pdf
counterparts are created in each iteration. - Removed
parse
subcommand (functionality is in theigblast
subcommand) - New CDR3 detection method (only heavy chain sequences): CDR3 start/end coordinates are pre-computed using the database V and J sequences. Increases detection rate to 99% (previously less than 90%).
- Remove the ability to check discovered genes for required motifs. This has never worked well.
- Add a column
clonotypes
to thecandidates.tab
that tries to count how many clonotypes are associated with a single candidate (using only exact occurrences). This is intended to replace theCDR3s_exact
column. - Add an
exact_ratio
to the germline filtering options. This checks the ratio between the exact V occurrence counts (exact
column) between alleles. - Germline filtering option
allele_ratio
was renamed toclonotypes_ratio
- Implement a cache for IgBLAST results. When the same dataset is re-analyzed, possibly with different parameters, the cached results are used instead of re-running IgBLAST, which saves a lot of time. If the V/D/J database or the IgBLAST version has changed, results are not re-used.
- Add a
barcodes_exact
column to the candidates table. It gives the number of unique barcode sequences that were used by the sequences in the set of exact sequences. Also, add a configuration settingbarcode_consensus
that can turn off consensus taking of barcode groups, which needs to be set tofalse
forbarcodes_exact
to work. - Add a
Ds_exact
column to candidates table. - Add a
D_coverage
configuration option. - The pre-processing filtering step no longer reads in the full table of IgBLAST assignments, but filters the table piece by piece. Memory usage for this step therefore does not depend anymore on the dataset size and should always be below 1 GB.
- The functionality of the
parse
subcommand has been integrated into theigblast
subcommand. This means thatigdiscover igblast
now directly outputs a result table (assigned.tab
). This makes it easier to use that subcommand directly instead of only via the workflow. - The
igblast
subcommand now always runsmakeblastdb
by itself and deletes the BLAST database afterwards. This reduces clutter and ensures the database is always up to date. - Remove the
library_name
configuration setting. Instead, thelibrary_name
is now always the same as the name of analysis directory.
- Add an “allele ratio” criterion to the germline filter to further reduce
the number of false positives. The filter is activated by default and can
be configured through the
allele_ratio
setting in the configuration file. :ref:`See the documentation for how it works <allele-ratio>`. - Ignore the CDR3-encoding bases whenever comparing two V gene sequences.
- Avoid finding 5'-truncated V genes by extending found hits towards the 5' end.
- By default, candidate sequences are no longer merged if they are nearly
identical. That is, the
differences
setting within the two germline filter configuration sections is now set to zero by default. Previously, we believed the merging would remove some false positives, but it turns out we also miss true positives. It also seems that with the other changes in this version we also no longer get the particular false positives the setting was supposed to catch. - Implement an experimental
discoverj
script for J gene discovery. It is curently not run automatically as part ofigdiscover run
. Seeigdiscover discoverj --help
for how to run it manually. - Add a
config
subcommand, which can be used to change the configuration file from the command-line. - Add a
V_CDR3_start
column to theassigned.tab
/filtered.tab
tables. It describes where the CDR3 starts within the V sequence. - Similarly, add a
CDR3_start
column to thenew_V_germline.tab
file describing where the CDR3 starts within a discovered V sequence. It is computed by using the most common CDR3 start of the sequences within the cluster. - Rename the
compose
subcommand togermlinefilter
. - The
init
subcommand automatically fixes certain problems in the input database (duplicate sequences, empty records, duplicate sequence names). Previously, it would complain, but the user would have to fix the problems themselves. - Move source code to GitHub
- Set up automatic code testing (continuous integration) via Travis
- Many documentation improvements
- The FASTA files of the input V/D/J gene lists now need to be
named
V.fasta
,D.fasta
andJ.fasta
. The species name is no longer part of the file name. This should reduce confusion when working with species not supported by IgBLAST. - The
species:
configuration setting in the configuration can (and should) now be left empty. Its only use was that it is passed to IgBLAST, but since IgDiscover provides IgBLAST with its own V/D/J sequences anyway, it does not seem to make a difference. - A “cross-mapping” detection has been added, which should reduce the number of false positives. :ref:`See the documentation for an explanation <cross-mapping>`.
- Novel sequences identical to a database sequence no longer get the
_S1234
suffix. - No longer trim trim the initial
G
run in sequences (due to RACE) by default. It is now a configuration setting. - Add
cdr3_location
configuration setting: It allows to set whether to use a CDR3 in addition to the barcode for grouping sequences. - Create a
groups.tab.gz
file by default (describing the de-barcoded groups) - The pre-processing filter is now configurable. See the
preprocessing_filter
section in the configuration file. - Many improvements to the documentation
- Extended and fixed unit tests. These are now run via a CI system.
- Statistics in JSON format are written to
stats/stats.json
. - IgBLAST 1.5.0 output can now be parsed. Parsing is also faster by 25%.
- More helpful warning message when no sequences were discovered in an iteration.
- Drop support for Python 3.3.
- V sequences of the input database are now whitelisted by default.
The meaning of the
whitelist
configuration option has changed: If set tofalse
, those sequences are no longer whitelisted. To whitelist additional sequences, create awhitelist.fasta
file as before. - Sequences with stop codons are now filtered out by default.
- Use more stringent germline filtering parameters by default.
- It is now possible to install and run IgDiscover on OS X. Appropriate Conda packages are available on bioconda.
- Add column
has_stop
tocandidates.tab
, which indicates whether the candidate sequence contains a stop codon. - Add a configuration option that makes it possible to disable the 5' motif
check by setting
check_motifs: false
(thelooks_like_V
column is ignored in this case). - Make it possible to whitelist known sequences: If a found gene candidate
appears in that list, the sequence is included in the list of discovered
sequences even when it would otherwise not pass filtering criteria. To enable
this, just add a
whitelist.fasta
file to the project directory before starting the analysis. - The criteria for germline filter and pre-germline filter are now configurable:
See
germline_filter
andpre_germline_filter
sections in the configuration file. - Different runs of IgDiscover with the same parameters on the same input files
will now give the same results. See the
seed
parameter in the configuration, also on how to get non-reproducible results as before. - Both the germline and pre-germline filter are now applied in each iteration.
Instead of the
new_V_database.fasta
file, two files namednew_V_germline.fasta
andnew_V_pregermline.fasta
are created. - The
compose
subcommand now outputs a filtered version of thecandidates.tab
file in addition to a FASTA file. The table contains columns closest_whitelist, which is the name of the closest whitelist sequence, and whitelist_diff, which is the number of differences to that whitelist sequence.
- Optionally, sequences are not renamed in the
assigned.tab
file, but retain their original name as in the FASTA or FASTQ file. Setrename: false
in the configuration file to get this behavior. - Started an “advanced” section in the manual.
- IgDiscover can now also detect kappa and lambda light chain V genes (VK, VL)