Skip to content

0.4.6

Compare
Choose a tag to compare
@akikuno akikuno released this 17 May 07:41
· 144 commits to main since this release

💥 Breaking

  • Update the log file Commit Detail

    • Add the version of DAJIN2 to the log file to track the version of the analysis.
    • Rename the log file to DAJIN2_log_<current time>.txt from <current time>_DAJIN2.log to enabling open the file in any text editor.
  • Update mutation_extractor.is_dissimilar_loci Commit Detail

    • Rename to is_dissimilar_loci from identify_dissimilar_loci to explicitly indicate that a boolean is returned.
    • Changed to use cosine distance instead of cosine similarity to make "difference from control" more intuitive.
    • Added a condition to ensure that the cosine distance is not dependent on the specific index: Calculate the cosine distance for 10 bases starting from the neighbor of the corresponding indel, and add the condition that the cosine distances of these adjacent 10 bases should be similar.
  • Update preprocess.insertions_to_fasta.py which detects unintended insertion alleles. Commit Detail

    • clustering_insertions: To accelerate MeanShift clustering, set bin_seeding=True. Additionally, because clustering decoys without variation becomes extremely slow, we have switched to using decoys that include slight variations.
    • extract_unique_insertions: Within unintended insertion alleles, alleles similar to the intended allele provided by the user are now excluded.
      • The similarity is defined as there being differences of more than 10 bases
  • Update preprocess.insertions_to_fasta.clustering_insertions to consider the length of each insertion sequence during clustering. This allows two alleles, such as N,(30-base Insertion) and (30-base Insertion),N, to be weighted with different scores as [(1, 30), (30, 1)], enabling correct clustering. Commit Detail

  • Update preprocess.homopolymer_handler: Scaling data to [0, 1] for cosine similarity, normalizing to match scales due to significant differences in mutation rates between samples and controls. Commit Detail

📝 Documentation

  • Add the descriptions about required Python version supporting from 3.8 to 3.10 due to a Bioconda issue to the README.md. Commit Detail

  • Enhance the descriptions in GitHub Issue templates to clarify their purpose. Commit Detail

🔧 Maintenance

  • Move DAJIN2_VERSION to utils.config.py from main.py to make it easier to recognize its location. Commit Detail

  • Update io.read_csv to return a list[dict[str, str]], not list[str] to align the output format with read_xlsx. Commit Detail

  • Update utils.input_validator and preprocess.genome_fetcher to temporarily disable SSL certificate verification, allowing access to UCSC servers. Commit Detail

  • Add an example of flox knockin design to the examples Commit Detail

  • Update preprocess.insertions_to_fasta.py: The label names for the insertions were not starting from 1, so they have been revised to begin at 1. Commit Detail

  • Change installer from pip to conda to install mappy in macos-latest (macos-14-arm64) in Github Action Commit Detail

🚀 Performance

  • Update consensus.similarity_searcher to cache onehot encoded controls to avoid redundant computations and increase processing speed. Commit Detail

🐛 Bug Fixes

  • Debug clustering.strand_bias_handler Commit Detail

    • For positive_strand_counts_by_labels: dict, there was a bug that caused an error and halted execution when accessing a non-existent key. It has been fixed to output 0 instead.
    • Created a wrapper function annotate_strand_bias_by_labels for outputting strand bias. Fixed a bug where the second and subsequent arguments were not being correctly passed when reallocating clusters with strand bias.
  • Fix preprocess.knockin_handler to correctly identify the flox knock-in sites as deletions not present in the control. Commit Detail

  • Bug fix of reallocate_insertion_within_deletion Commit Detail

    • In the script that considers the region between two deletions as an insertion sequence, the size of the other deletion was not taken into account. Even if there was a single base deletion, the entire sequence between the deletions was considered as an insertion sequence. Therefore, the region between two deletions is now defined only if the size of both deletions is equal to or greater than the specified threshold (default = 3).