- get_dependencies_version: Check versions of TADbit and all dependencies, as well and retrieves system info. May be used to ensure reproducibility.
- generate_rnd_tads: Generates random TADs over a chromosome of a given size according to a given distribution of lengths of TADs.
- generate_shuffle_tads: Returns a shuffle version of a given list of TADs
- randomization_test: Return the probability that original alignment is better than an alignment of randomized boundaries.
Specific class of TADs, used only within Alignment objects. It is directly inheriting from python dict. a TAD these keys:
- 'start': position of the TAD
- 'end': position of the TAD
- 'score': of the prediction of boundary
- 'brk': same as 'end'
- 'pos': in the alignment (column number)
- 'exp': Experiment this TAD belongs to
- 'index': of this TAD within all TADs in the Experiment
Alignment of TAD borders
- draw [1]: Draw alignments as a plot.
- get_column: Get a list of column responding to a given characteristic.
- itercolumns: Iterate over columns in the alignment
- iteritems: Iterate over experiment names and aligned boundaries
- itervalues: Iterate over experiment names and aligned boundaries
- write_alignment: Print alignment of TAD boundaries between different experiments. Alignments are displayed with colors according to the TADbit confidence score for each boundary.
- align: Align Topologically Associating Domain borders. Supports multiple-alignment by building a consensus TAD sequence and aligning each experiment to it.
- consensusize: Given two alignments returns a consensus alignment. Used for the generation of multiple alignments
- needleman_wunsch: Align two lists of TAD boundaries using a Needleman-Wunsh implementation
- reciprocal: Method based on reciprocal closest boundaries (bd). bd1 will be aligned with bd2 (closest boundary from bd1) if and only if bd1 is the closest boundary of bd2 too (and of course if the distance between bd1 and bd2 is lower than max_dist).
- find_closest_reciprocal: Function to check the needleman_wunsch algorithm.
- load_chromosome: Load a Chromosome object from a file. A Chromosome object can be saved with the save_chromosome function.
dict of Alignment
Modified getitem, setitem, and append in order to be able to search alignments by index or by name.
linked to a Chromosome
Inherited from python built in list, modified for TADbit Experiment.
Mainly, getitem, setitem, and append were modified in order to be able to search for experiments by index or by name, and to add experiments simply using Chromosome.experiments.append(Experiment).
The whole ExperimentList object is linked to a Chromosome instance (Chromosome).
A Chromosome object designed to deal with Topologically Associating Domains predictions from different experiments, in different cell types for a given chromosome of DNA, and to compare them.
- add_experiment: Add a Hi-C experiment to Chromosome
- align_experiments: Align the predicted boundaries of two different experiments. The resulting alignment will be stored in the self.experiment list.
- find_tad: Call the tadbit function to calculate the position of Topologically Associated Domain boundaries
- get_experiment: Fetch an Experiment according to its name. This can also be done directly with Chromosome.experiments[name].
- get_tad_hic: Retrieve the Hi-C data matrix corresponding to a given TAD.
- iter_tads: Iterate over the TADs corresponding to a given experiment.
- save_chromosome: Save a Chromosome object to a file (it uses load from the pickle). Once saved, the object can be loaded with load_chromosome.
- set_max_tad_size: Change the maximum size allowed for TADs. It also applies to the computed experiments.
- tad_density_plot [1]: Draw an summary of the TAD found in a given experiment and their density in terms of relative Hi-C interaction count.
- visualize [1]: Visualize the matrix of Hi-C interactions of a given experiment
- load_experiment_from_reads: Loads an experiment object from TADbit-generated read files, that are lists of pairs of reads mapped to a reference genome.
Hi-C experiment.
- filter_columns [1]: Call filtering function, to remove artifactual columns in a given Hi-C matrix. This function will detect columns with very low interaction counts. Filtered out columns will be stored in the dictionary Experiment._zeros.
- get_hic_matrix: Return the Hi-C matrix.
- get_hic_zscores: Normalize the Hi-C raw data. The result will be stored into the private Experiment._zscore list.
- load_hic_data: Add a Hi-C experiment to the Chromosome object.
- load_norm_data: Add a normalized Hi-C experiment to the Chromosome object.
- load_tad_def: Add the Topologically Associated Domains definition detection to Slice
- model_region [2]: Generates of three-dimensional models using IMP, for a given segment of chromosome.
- normalize_hic: Normalize the Hi-C data. This normalization step does the same of the tadbit function (default parameters), It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight. The weight of a given cell in column i and row j corresponds to the square root of the product of the sum of column i by the sum of row j. normalization is done according to this formula:
- optimal_imp_parameters [2]: Find the optimal set of parameters to be used for the 3D modeling in IMP.
- print_hic_matrix: Return the Hi-C matrix as string
- set_resolution: Set a new value for the resolution. Copy the original data into Experiment._ori_hic and replace the Experiment.hic_data with the data corresponding to new data (compare_condition).
- view [1]: Visualize the matrix of Hi-C interactions
- write_interaction_pairs: Creates a tab separated file with all the pairwise interactions.
- write_json: Save hic matrix in the json format, read by TADkit.
- write_tad_borders [2]: Print a table summarizing the TADs found by tadbit. This function outputs something similar to the R function.
This may also hold the print/write-to-file matrix functions
- add_sections: Add genomic coordinate to HiC_data object by getting them from a FASTA file containing chromosome sequences. Orders matters.
- add_sections_from_fasta: Add genomic coordinate to HiC_data object by getting them from a FASTA file containing chromosome sequences
- cis_trans_ratio: Counts the number of interactions occurring within chromosomes (cis) with respect to the total number of interactions
- find_compartments [1] [2]: Search for A/B compartments in each chromosome of the Hi-C matrix. Hi-C matrix is normalized by the number interaction expected at a given distance, and by visibility (one iteration of ICE). A correlation matrix is then calculated from this normalized matrix, and its first eigenvector is used to identify compartments. Changes in sign marking boundaries between compartments. Result is stored as a dictionary of compartment boundaries, keys being chromosome names.
- find_compartments_beta [1] [2]: Search for A/B compartments in each chromosome of the Hi-C matrix. Hi-C matrix is normalized by the number interaction expected at a given distance, and by visibility (one iteration of ICE). A correlation matrix is then calculated from this normalized matrix, and its first eigenvector is used to identify compartments. Changes in sign marking boundaries between compartments. Result is stored as a dictionary of compartment boundaries, keys being chromosome names.
- get_hic_data_as_csr: Returns a scipy sparse matrix in Compressed Sparse Row format of the Hi-C data in the dictionary
- get_matrix: returns a matrix.
- load_biases: Load biases, decay and bad columns from pickle file
- save_biases: Save biases, decay and bad columns in pickle format (to be loaded by the function load_hic_data_from_bam)
- sum: Sum Hi-C data matrix WARNING: parameters are not meant to be used by external users
- write_compartments [2]: Write compartments to a file.
- write_cooler: writes the hic_data to a cooler file.
- write_coord_table: writes a coordinate table to a file.
- write_matrix: writes the matrix to a file.
- yield_matrix: Yields a matrix line by line. Bad row/columns are returned as null row/columns.
- eq_reads: Compare reads accounting for multicontacts
- get_intersection: Merges the two files corresponding to each reads sides. Reads found in both files are merged and written in an output file. Dealing with multiple contacts: - a pairwise contact is created for each possible combnation of the multicontacts. - if no other fragment from this read are mapped than, both are kept - otherwise, they are merged into one longer (as if they were mapped in the positive strand)
- gt_reads: Compare reads accounting for multicontacts
- merge_2d_beds: Merge two result files (file resulting from get_intersection or from the filtering) into one.
- merge_bams: Merge two bam files with samtools into one.
- correlate_matrices [1] [2]: Compare the interactions of two Hi-C matrices at a given distance, with Spearman rank correlation. Also computes the SCC reproducibility score as in HiCrep (see https://doi.org/10.1101/gr.220640.117).
- eig_correlate_matrices [1] [2]: Compare the interactions of two Hi-C matrices using their 6 first eigenvectors, with Pearson correlation
- fragment_size [1]: Plots the distribution of dangling-ends lengths
- get_reproducibility: Compute reproducibility score similarly to HiC-spector (https://doi.org/10.1093/bioinformatics/btx152)
- hic_map [1] [2]: function to retrieve data from HiC-data object. Data can be stored as a square matrix, or drawn using matplotlib
- insert_sizes [1]: Deprecated function, use fragment_size
- plot_distance_vs_interactions [1]: Plot the number of interactions observed versus the genomic distance between the mapped ends of the read. The slope is expected to be around -1, in logarithmic scale and between 700 kb and 10 Mb (according to the prediction of the fractal globule model).
- plot_genomic_distribution [1] [2]: Plot the number of reads in bins along the genome (or along a given chromosome).
- plot_iterative_mapping [1]: Plots the number of reads mapped at each step of the mapping process (in the case of the iterative mapping, each step is mapping process with a given size of fragments).
- plot_strand_bias_by_distance [1]: Classify reads into four categories depending on the strand on which each of its end is mapped, and plots the proportion of each of these categories in function of the genomic distance between them. Only full mapped reads mapped on two diferent restriction fragments (still same chromosome) are considered. The four categories are: - Both read-ends mapped on the same strand (forward) - Both read-ends mapped on the same strand (reverse) - Both read-ends mapped on the different strand (facing), like extra-dangling-ends - Both read-ends mapped on the different strand (opposed), like extra-self-circles
- apply_filter [2]: Create a new file with reads filtered
- filter_reads [2]: Filter mapped pair of reads in order to remove experimental artifacts (e.g. dangling-ends, self-circle, PCR artifacts
- fast_fragment_mapping: Maps FASTQ reads to an indexed reference genome with the knowledge of the restriction enzyme used (fragment-based mapping).
- full_mapping: Maps FASTQ reads to an indexed reference genome. Mapping can be done either without knowledge of the restriction enzyme used, or for experiments performed without one, like Micro-C (iterative mapping), or using the ligation sites created from the digested ends (fragment-based mapping).
- transform_fastq: Given a FASTQ file it can split it into chunks of a given number of reads, trim each read according to a start/end positions or split them into restriction enzyme fragments
- map_re_sites: map all restriction enzyme (RE) sites of a given enzyme in a genome. Position of a RE site is defined as the genomic coordinate of the first nucleotide after the first cut (genomic coordinate starts at 1). In the case of HindIII the genomic coordinate is this one: 123456 789
- iupac2regex: Convert target sites with IUPAC nomenclature to regex pattern
- religateds: returns the resulting list of all possible sequences after religation of two digested and repaired ends.
- repaired: returns the resulting sequence after reparation of two digested and repaired ends, marking dangling ends.
- identify_re: Search most probable restriction enzyme used in the Hi-C experiment. Uses binomial test and some heuristics.
- map_re_sites_nochunk: map all restriction enzyme (RE) sites of a given enzyme in a genome. Position of a RE site is defined as the genomic coordinate of the first nucleotide after the first cut (genomic coordinate starts at 1). In the case of HindIII the genomic coordinate is this one: 123456 789
- load_impmodel_from_cmm: Loads an IMPmodel object using an cmm file of the form:
- load_impmodel_from_xyz: Loads an IMPmodel object using an xyz file of the form:
A container for the IMP modeling results.
- objective_function [1]: This function plots the objective function value per each Monte-Carlo step.
A container for the IMP modeling results.
- accessible_surface [1]: Calculates a mesh surface around the model (distance equal to input radius) and checks if each point of this mesh could be replaced by an object (i.e. a protein) of a given radius Outer part of the model can be excluded from the estimation of accessible surface, as the occupancy outside the model is unknown (see superradius option).
- center_of_mass: Gives the center of mass of a model
- contour: Total length of the model
- cube_side: Calculates the side of a cube containing the model.
- cube_volume: Calculates the volume of a cube containing the model.
- distance: Calculates the distance between one point of the model and an external coordinate
- inaccessible_particles: Gives the number of loci/particles that are accessible to an object (i.e. a protein) of a given size.
- longest_axe: Gives the distance between most distant particles of the model
- min_max_by_axis: Calculates the minimum and maximum coordinates of the model
- persistence_length: Calculates the persistence length (Lp) of given section of the model. Persistence length is calculated according to [Bystricky2004] :
- radius_of_gyration: Calculates the radius of gyration or gyradius of the model Defined as:
- shortest_axe: Minimum distance between two particles in the model
- view_model [1]: Visualize a selected model in the three dimensions. (either with Chimera or through matplotlib).
- write_cmm [2]: Save a model in the cmm format, read by Chimera (http://www.cgl.ucsf.edu/chimera). Note: If none of model_num, models or cluster parameter are set, ALL the models will be written.
- write_xyz [2]: Writes a xyz file containing the 3D coordinates of each particle in the model. Outfile is tab separated column with the bead number being the first column, then the genomic coordinate and finally the 3 coordinates X, Y and Z Note: If none of model_num, models or cluster parameter are set, ALL the models will be written.
- write_xyz_babel [2]: Writes a xyz file containing the 3D coordinates of each particle in the model using a file format compatible with babel (http://openbabel.org/wiki/XYZ_%28format%29). Outfile is tab separated column with the bead number being the first column, then the genomic coordinate and finally the 3 coordinates X, Y and Z Note: If none of model_num, models or cluster parameter are set, ALL the models will be written.
- load_structuralmodels: Loads StructuralModels from a file (generated with save_models).
This class contains three-dimensional models generated from a single Hi-C data. They can be reached either by their index (integer representing their rank according to objective function value), or by their IMP random intial number (as string).
- accessibility [1] [2]: Calculates a mesh surface around the model (distance equal to input radius) and checks if each point of this mesh could be replaced by an object (i.e. a protein) of a given radius Outer part of the model can be excluded from the estimation of accessible surface, as the occupancy outside the model is unkown (see superradius option).
- align_models: Three-dimensional aligner for structural models.
- angle_between_3_particles: Calculates the angle between 3 particles. Given three particles A, B and C, the angle g (angle ACB, shown below):
- average_model: Builds and returns an average model representing a given group of models
- centroid_model: Estimates and returns the centroid model of a given group of models.
- cluster_analysis_dendrogram [1]: Representation of the clustering results. The length of the leaves if proportional to the final objective function value of each model. The branch widths are proportional to the number of models in a given cluster (or group of clusters, if it is an internal branch).
- cluster_models: This function performs a clustering analysis of the generated models based on structural comparison. The result will be stored in StructuralModels.clusters Clustering is done according to a score of pairwise comparison calculated as:
- contact_map [1] [2]: Plots a contact map representing the frequency of interaction (defined by a distance cutoff) between two particles.
- correlate_with_real_data [1]: Plots the result of a correlation between a given group of models and original Hi-C data.
- deconvolve [1]: This function performs a deconvolution analysis of a given froup of models. It first clusters models based on structural comparison (dRMSD), and then, performs a differential contact map between each possible pair of cluster.
- define_best_models: Defines the number of top models (based on the objective function) to keep. If keep_all is set to True in generate_3d_models or in model_region, then the full set of models (n_models parameter) will be used, otherwise only the n_keep models will be available.
- density_plot [1] [2]: Plots the number of nucleotides per nm of chromatin vs the modeled region bins.
- dihedral_angle: Calculates the dihedral angle between 2 planes formed by 5 particles (one common to both planes).
- fetch_model_by_rand_init: Models are stored according to their objective function value (first best), but in order to reproduce a model, we need its initial random number. This method helps to fetch the model corresponding to a given initial random number stored under StructuralModels.models[N]['rand_init'].
- get_contact_matrix: Returns a matrix with the number of interactions observed below a given cutoff distance.
- get_persistence_length [1] [2]: Calculates the persistence length (Lp) of given section of the model. Persistence length is calculated according to [Bystricky2004] :
- infer_unrestrained_particle_coords: if a given particle (and direct neighbors) have no restraints. Infer the coordinates by linear interpolation using closest particles with restraints.
- interactions [1] [2]: Plots, for each particle, the number of interactions (particles closer than the given cut-off). The value given is the average for all models.
- median_3d_dist [1]: Computes the median distance between two particles over a set of models
- model_consistency [1] [2]: Plots the particle consistency, over a given set of models, vs the modeled region bins. The consistency is a measure of the variability (or stability) of the modeled region (the higher the consistency value, the higher stability).
- objective_function_model [1]: This function plots the objective function value per each Monte-Carlo step
- particle_coordinates: Returns the mean coordinate of a given particle in a group of models.
- save_models [2]: Saves all the models in pickle format (python object written to disk).
- view_centroid: shortcut for view_models(tool='plot', show='highlighted', highlight='centroid')
- view_models [1]: Visualize a selected model in the three dimensions (either with Chimera or through matplotlib).
- walking_angle [1] [2]: Plots the angle between successive loci in a given model or set of models. In order to limit the noise of the measure angle is calculated between 3 loci, between each are two other loci. E.g. in the scheme bellow, angle are calculated between loci A, D and G.
- walking_dihedral [1] [2]: Plots the dihedral angle between successive planes. A plane is formed by 3 successive loci.
- zscore_plot [1]: Generate 3 plots. Two heatmaps of the Z-scores used for modeling, one of which is binary showing in red Z-scores higher than upper cut-off; and in blue Z-scores lower than lower cut-off. Last plot is an histogram of the distribution of Z-scores, showing selected regions. Histogram also shows the fit to normal distribution.
- parse_bed: simple BED and BEDgraph parser that only checks for the fields 1, 2, 3 and 5 (or 1, 2 and 3 if 5 not availbale).
- parse_mappability_bedGraph: parse BEDgraph containing mappability. GEM mappability file obtained with: gem-indexer -i hg38.fa -o hg38 gem-mappability -I hg38.gem -l 50 -o hg38.50mer -T 8 gem-2-wig -I hg38.gem -i hg38.50mer.mappability -o hg38.50mer wigToBigWig hg38.50mer.wig hg38.50mer.sizes hg38.50mer.bw bigWigToBedGraph hg38.50mer.bw hg38.50mer.bedGraph
- cooler_file: Cooler file wrapper.
- close: Copy remaining buffer to file, index the pixelsand complete information
- create_bins: Write bins to cooler file.
- prepare_matrix: Prepare matrix datasets to be written as chunks.
- write_bins: Write the bins table.
- write_indexes: Write the indexes from existing bins and pixels.
- write_info: Write the file description and metadata attributes.
- write_iter: Write bin1, bin2, value to buffer. When the chunk number changes the buffer is written to the h5py file.
- write_regions: Write the regions table.
- write_weights: Write the weights in the bins table.
- is_cooler: Check if file is a cooler and contains the wanted resolution
- parse_cooler: Read matrix stored in cooler
- parse_header: Read matrix header stored in cooler
- rlencode: Run length encoding. Based on http://stackoverflow.com/a/32681075, which is based on the rle function from R. Parameters ---------- x : 1D array_like Input array to encode dropna: bool, optional Drop all runs of NaNs. Returns ------- start positions, run lengths, run values
- parse_fasta: Parse a list of fasta files, or just one fasta. WARNING: The order is important
- get_gc_content: Get GC content by bins of a given size. Ns are nottaken into account in the calculation, only the number of Gs and Cs over As, Ts, Gs and Cs
- bed2D_to_BAMhic: function adapted from Enrique Vidal <[email protected]> scipt to convert 2D beds into compressed BAM format. Gets the _both_filled_map.tsv contacts from TADbit (and the corresponding filter files) and outputs a modified indexed BAM with the following fields: - read ID - filtering flag (see codes in header) - chromosome ID of the first pair of the contact - genomic position of the first pair of the contact - MAPQ set to 0 - pseudo CIGAR with sequence length and info about current copy (P: first copy, S: second copy) - chromosome ID of the second pair of the contact - genomic position of the second pair of the contact - mapped length of the second pair of the contact - sequence is missing () - quality is missing (*) - TC tag indicating single (1) or multi contact (3 6
- get_biases_region: Retrieve biases, decay, and bad bins from a dictionary, and re-index it according to a region of interest.
- get_filters: get all filters
- read_matrix: Read and checks a matrix from a file (using autoreader) or a list.
- load_hic_data_from_bam:
- load_hic_data_from_reads:
- abc_reader: Read matrix stored in 3 column format (bin1, bin2, value)
- autoreader: Auto-detect matrix format of HiC data file.
- is_asymmetric: Helper functions for the autoreader.
- is_asymmetric_dico: Helper functions for the optimal_reader
- optimal_reader: Reads a matrix generated by TADbit. Can be slower than autoreader, but uses almost a third of the memory
- symmetrize: Make a matrix symmetric by summing two halves of the matrix
- symmetrize_dico: Make an HiC_data object symmetric by summing two halves of the matrix
Exception to handle failed autoreader.
- parse_map: Parse map files Keep a summary of the results into 2 tab-separated files that will contain 6 columns: read ID, Chromosome, position, strand (either 0 or 1), mapped sequence lebgth, position of the closest upstream RE site, position of the closest downstream RE site. The position of reads mapped on reverse strand will be computed from the end of the read (original position + read length - 1)
- parse_gem_3c: Parse gem 3c sam file using pysam tools.
- parse_sam: Parse sam/bam file using pysam tools. Keep a summary of the results into 2 tab-separated files that will contain 6 columns: read ID, Chromosome, position, strand (either 0 or 1), mapped sequence lebgth, position of the closest upstream RE site, position of the closest downstream RE site
- parse_tads: Parse a tab separated value file that contains the list of TADs of a given experiment. This file might have been generated whith the print_result_R or with the R binding for tadbit
- core_nw: Core of the fast Needleman-Wunsch algorithm that aligns matrices
- core_nw_long: Core of the long Needleman-Wunsch algorithm that aligns matrices
- optimal_cmo: Calculates the optimal contact map overlap between 2 matrices TODO: make the selection of number of eigen vectors automatic or relying on the summed contribution (e.g. select the EVs that sum 80% of the info)
- virgin_score: Fill a matrix with zeros, except first row and first column filled with multiple values of penalty.
- batch_tadbit [2]: Use tadbit on directories of data files. All files in the specified directory will be considered data file. The presence of non data files will cause the function to either crash or produce aberrant results. Each file has to contain the data for a single unit/chromosome. The files can be separated in sub-directories corresponding to single experiments or any other organization. Data files that should be considered replicates have to start with the same characters, until the character sep. For instance, all replicates of the unit 'chr1' should start with 'chr1_', using the default value of sep. The data files are read through read.delim. You can pass options to read.delim through the list read_options. For instance if the files have no header, use read_options=list(header=FALSE) and if they also have row names, read_options=list(header=FALSE, row.names=1). Other arguments such as max_size, n_CPU and verbose are passed to tadbit. NOTE: only used externally, not from Chromosome
- tadbit: The TADbit algorithm works on raw chromosome interaction count data. The normalization is neither necessary nor recommended, since the data is assumed to be discrete counts. TADbit is a breakpoint detection algorithm that returns the optimal segmentation of the chromosome under BIC-penalized likelihood. The model assumes that counts have a Poisson distribution and that the expected value of the counts decreases like a power-law with the linear distance on the chromosome. This expected value of the counts at position (i,j) is corrected by the counts at diagonal positions (i,i) and (j,j). This normalizes for different restriction enzyme site densities and 'mappability' of the reads in case a bin contains repeated regions.
- colorize: Colorize with ANSII colors a string for printing in shell. this acording to a given number between 0 and 10
- nicer: writes resolution number for human beings.
- add_subplot_axes: from https://stackoverflow.com/questions/17458580/embedding-small-plots-inside-subplots-in-matplotlib/35966183
- augmented_dendrogram [1]:
- chimera_view [1]: Open a list of .cmm files with Chimera (http://www.cgl.ucsf.edu/chimera) to view models.
- color_residues: Function to color residues from blue to red.
- compare_models: Plots the difference of contact maps of two group of structural models.
- pcolormesh_45deg [1]: Draw triangular matrix
- plot_2d_optimization_result [1]: A grid of heatmaps representing the result of the optimization. The maps will be divided in different pages depending on the 'scale' and 'kbending' values. In each page there will be different maps depending the 'maxdist' values. Each map has 'upfreq' values along the x-axes, and 'lowfreq' values along the y-axes.
- plot_3d_model [1]: Given a 3 lists of coordinates (x, y, z) plots a three-dimentional model using matplotlib
- plot_3d_optimization_result: Displays a three dimensional scatter plot representing the result of the optimization.
- plot_HiC_matrix [1]: Plot HiC matrix with histogram of values inside color bar.
- tad_border_coloring: Colors TAD borders from blue to red (bad to good score). TAD are displayed in scale of grey, from light to dark grey (first to last particle in the TAD)
- tad_coloring: Colors TADs from blue to red (first to last TAD). TAD borders are displayed in scale of grey, from light to dark grey (again first to last border)
- quality_plot [1]: Plots the sequencing quality of a given FASTQ file. If a restrinction enzyme (RE) name is provided, can also represent the distribution of digested and undigested RE sites and estimate an expected proportion of dangling-ends. Proportion of dangling-ends is inferred by counting the number of times a dangling-end site, is found at the beginning of any of the reads (divided by the number of reads).
- magic_open: To read uncompressed zip gzip bzip2 or tar.xx files
- which: stackoverflow: http://stackoverflow.com/questions/377017/test-if-executable-exists-in-python
- get_free_space_mb: Return folder/drive free space (in bytes) Based on stackoverflow answer: http://stackoverflow.com/questions/51658/cross-platform-space-remaining-on-volume-using-python
- is_fastq: Check if a given file is in fastq format
- wc: Pythonic way to count lines
- hic_filtering_for_modelling [1]: Call filtering function, to remove artifactual columns in a given Hi-C matrix. This function will detect columns with very low interaction counts; and columns with NaN values (in this case NaN will be replaced by zero in the original Hi-C data matrix). Filtered out columns will be stored in the dictionary Experiment._zeros.
- filter_by_mean [1]: fits the distribution of Hi-C interaction count by column in the matrix to a polynomial. Then searches for the first possible
- filter_by_zero_count:
- filter_by_cis_percentage [1]: Define artifactual columns with either too low or too high counts of interactions by compraing their percentage of cis interactions (inter-chromosomal).
- best_path: Viterbi algorithm with backpointers
- gaussian_prob: of x to follow the gaussian with given E https://en.wikipedia.org/wiki/Normal_distribution
- iterative: Implementation of iterative correction Imakaev 2012
- expected: Computes the expected values by averaging observed interactions at a given distance in a given HiC matrix.
- zscore: Calculates the log10, Z-score of a given list of values.
- calinski_harabasz: Implementation of the CH score [CalinskiHarabasz1974], that has shown to be one the most accurate way to compare clustering methods [MilliganCooper1985] [Tibshirani2001]. The CH score is:
- mad: Median Absolute Deviation: a "Robust" version of standard deviation. Indices variability of the sample. https://en.wikipedia.org/wiki/Median_absolute_deviation
- mean_none: Calculates the mean of a list of values without taking into account the None
- newton_raphson: Newton-Raphson method as defined in: http://www.maths.tcd.ie/~ryan/TeachingArchive/161/teaching/newton-raphson.c.html used to search for the persistence length of a given model.
- right_double_mad: Double Median Absolute Deviation: a 'Robust' version of standard deviation. Indices variability of the sample. http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers
Simple linear interpolation, to be used when the one from scipy is not available.
- angle_between_3_points: Calculates the angle between 3 particles Given three particles A, B and C, the angle g (angle ACB, shown below):
- build_mesh: Main function for the calculation of the accessibility of a model.
- calc_eqv_rmsd: Calculates the RMSD, dRMSD, the number of equivalent positions and a score combining these three measures. The measure are done between a group of models in a one against all manner.
- dihedral: Calculates dihedral angle between 4 points in 3D (array with x,y,z)
- fast_square_distance: Calculates the square distance between two coordinates.
- find_angle_rotation_improve_x: Finds the rotation angle needed to face the longest edge of the molecule
- generate_circle_points: Returns list of 3d coordinates of points on a circle using the Rodrigues rotation formula. see Murray, G. (2013). Rotation About an Arbitrary Axis in 3 Dimensions for details
- generate_sphere_points: Returns list of 3d coordinates of points on a sphere using the Golden Section Spiral algorithm.
- get_center_of_mass: get the center of mass of a given object with list of x, y, z coordinates
- mass_center: Transforms coordinates according to the center of mass
- mmp_score [1]:
- rotate_among_y_axis: Rotate and object with a list of x, y, z coordinates among its center of mass
- square_distance: Calculates the square distance between two particles.
[1] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45) functions generating plots |
[2] | (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25) functions writing text files |