Introduction
Installation
Install from source
Pip install
Requirements
EEL/smFISH data life cycle in the Linnarsson Lab
Data organisation
Data flow
pysmFISH processing pipeline requirements
Setup the processing folder
Experiment configuration file
Dataset file
Pipeline
Pipeline running scheme
Overview processing steps
Create folders step
Collect commit step
QC the experiment-name.config.yaml file
Start processing cluster
Local cluster
HTCondor
Unmanaged cluster
Parsing graph step
Parsing introduction
Parsing overview processing graph
Parsing graph steps
Prepare processing dataset step
Create analysis configuration file from dataset step
Determine tiles organization step
Create running functions step
Processing barcoded EEL graph step
EEL processing steps
Remove overlapping dots from data stitched using the microscope coords
QC registration error between different rounds of hybridization
Global stitching and overlapping dots removal
Global stitching graph step
Global stitching dots removal step
Processing fresh tissue step
Segmentation, registration and generation of cell/gene matrix
Processing serial smFISH
Processing serial smFISH steps
Pipeline runs
run_parsing_only
run_required_steps
run_full
test_run_after_editing
test_run_short
test_run_decoding
test_run_from_registration
test_run_Lars_mouse_atlas
How to run the pipeline
Run with papermill
Run with jupyter lab
Connect to dask dashboard
Jupyter lab notebooks
Download and organize test data
Jupyter lab examples
Template running pysmFISH pipeline
Setup processing environment
Create dark image from standalone file
Local processing cluster activation test
Convert excel codebook to parquet
Visualize raw counting
Test filtering and counting
Process fresh tissue
Stitching counts EEL
Create data for fishScale
QC Registration
pysmFISH
is a python package used to analyse data generated by the Linnarsson Lab automated systems (called ROBOFISH).
The data can be analysed on a local
computer or on a HPC
cluster. The cluster can be not managed
or managed byHTCondor
and must have a shared file system.
Install from source
Pip install
You can install pysmFISH with pip
or from source
CORRECT THE LINK TO THE PACKAGE
# Create your conda env
conda create -n pysmFISH-env python=3.8.5
conda activate pysmFISH-env
pip install --upgrade pip # not always required
# Create the directory that will contain the locally installed copy of the package
mkdir run_code_here
cd run_code_here
# Clone the package
git clone https://github.com/linnarsson-lab/pysmFISH_auto.git
# currently the version
# Install the package
cd pysmFISH_auto
pip install --use-feature=in-tree-build .
# Install the kernel to run with papermill
python -m ipykernel install --user --name pysmFISH-env --display-name 'pysmFISH-env'
# Create your conda env
conda create -n pysmFISH-env python=3.8.5
conda activate pysmFISH-env
pip install pysmFISH
# Install the kernel to run with papermill
python -m ipykernel install --user --name pysmFISH-env --display-name 'pysmFISH-env'
All the requirements are included in the setup.py file.
ON PREMISES CLUSTER (MONOD)
The cluster has a shared file system
Each experiment is labelled with a specific tag and
DRIVE 1 (fish):
-
current_folder: symlink to the storage drive where all the raw data are saved. The data are stored in a different drive that is not backed up but on a RAID
-
processing_folder:
- experiment_name: symlink to the experiment to process
- config_db: contains the configuration files required for start the processing
- codebooks: contains all the files with the codebooks used in the experiments
- probes_sets: contains all the fasta files with the probes used in the analysis
- fish_projects: contains folders named according to the project. Each folder will contain the symlink to the experiment folders associated to the specific experiment.
-
preprocessed_data_backup: contains folder for each experiment with preprocessed data and relevant metadata used for processing. IMPORTANT: This folder is backed up in a diffent physical location (ex. outside Karolinska)
DRIVE 2 (datb):
- sl: parent directory with the data of the
Linnarsson Lab
The drive is not backed up but is on aRAID
- fish_rawdata: Contains all the experiment folders with the fish data.
- config_db: symlink to the config_db folder on
DRIVE 1 (fish)
that contains the configuration files required for start the processing - codebooks: symlink to the codebooks folder on
DRIVE 1 (fish)
that contains all the files with the codebooks used in the experiments - probes_sets: symlink to the config_db folder on
DRIVE 1 (fish)
that contains all the fasta files with the probes used in the analysis - experiment_name: experiment to process
- config_db: symlink to the config_db folder on
- fish_rawdata: Contains all the experiment folders with the fish data.
DRIVE N (datN):
- sl: parent directory with the data of the
Linnarsson Lab
The drive is not backed up but is on aRAID
- fish_rawdata: Contains all the experiment folders with the fish data.
- config_db: symlink to the config_db folder on
DRIVE 1 (fish)
that contains the configuration files required for start the processing - codebooks: symlink to the codebooks folder on
DRIVE 1 (fish)
that contains all the files with the codebooks used in the experiments - probes_sets: symlink to the config_db folder on
DRIVE 1 (fish)
that contains all the fasta files with the probes used in the analysis - experiment_name: experiment to process
- config_db: symlink to the config_db folder on
- fish_rawdata: Contains all the experiment folders with the fish data.
-
Microscope: The raw data are collected in a folder named according to the experiment
-
The experiment folders containing the raw data are transferred from the microscope to the
/fish/current_folder
on monod. This is a symlink that point to a/datX/sl/fish_rawdata
folder on a drive where the raw data are stored. When the drive is full the/fish/current_folder
symlink is modified to point to a different drive. Important: the/datX/sl/fish_rawdata
is not backed up but is on a RAID.Do not change the symlink while processing some data otherwise the processing will fail because it won't be able to find the data.
-
The location of the current folder (ex.
/datX/sl/fish_rawdata
) must contain symlinks to theconfig_db/ probes_set/ codebooks
folders in thefish
drive. -
The data are processed
-
The preprocessed data and preliminary results are transferred to
/fish/preprocessed_data_backup/Experiment_name
for secure storage together with the metadata required for the processing. -
The remaining data generated by intermediate steps are deleted
- Setup the processing folder
- Experiment configuration file
- Dataset file
- If the processing start from parsing raw data generated by a robofish machine a specific set of requirements need to be satisfied (described below)
The data and the configuration files must be organised according to the following tree:
PROCESING FOLDER:
required subfolders
-
config_db: contains the configuration files required for start the processing
- analysis_config.yaml: parameters required for running the analysis.
- ROBOFISH1_dark_img.npy: camera noise images to use for filtering if a camera noise image has not been generated in the specific experiment.
- ROBOFISH2_dark_img.npy
- ROBOFISH3_dark_img.npy
- UNDEFINED.npy -
codebooks: contains all the files with the codebooks used in the analysis of barcoded experiments
- gene_hammming_16_11_6_positive_bits.parquet
- gene_hammming_16_11_6_positive_bits.xslx -
probes_sets: contains all the fasta files with the probes used in the analysis
- HE.fasta
experiments subfolders
many experiment folders can be present in the processing folder at each single time
- EXP20200922_run_smFISH: folder with data to process, the details of the organization of the folder with experiments is described below
- EXP20200922_run_smFISH_transfer_to_monod_completed.txt: matching empty text file used to confirm that the data from the machine had been fully transferred to the processing server.
Run the processing_env->setup_de_novo_env
command inside the processing folder. The output will look like:
- config_db: contains the configuration files required for start the processing
- codebooks: contains all the files with the codebooks used in the analysis of barcoded experiments
- probes_sets: contains all the fasta files with the probes used in the analysis
Run the processing_env->create_general_analysis_config
The output will look like:
- config_db: contains the configuration files required for start the processing
- analysis_config.yaml: parameters required for running the analysis (analysis_config example). - codebooks: contains all the files with the codebooks used in the analysis of barcoded experiments
- probes_sets: contains all the fasta files with the probes used in the analysis
Copy the fasta files that will be used in the processing inside the probes_sets
folder inside the processing folder. Update the folder content with new files when a new probe set is required for the processing. The output will look like:
- config_db: contains the configuration files required for start the processing
- analysis_config.yaml: parameters required for running the analysis. - codebooks: contains all the files with the codebooks used in the analysis of barcoded experiments
- probes_sets: contains all the fasta files with the probes used in the analysis
- HE.fasta
The reference codebooks (codebook example) are generated as .xlsx
files with the following columns:
Barcode, Index, Group, Fluorophore, Tail1, Tail2, Tail3, Tail4, Tail5, Tail6, Gene, Pool
. First copy the .xlsx
file into the codebooks
folder then run the conversion command processing_env->convert_codebook
The processing codebook will have only the Barcode, Gene
columns and will be stored as .parquet
file. The barcode is stored as bytes (``np.int8`) in the parquet file. The output will look like:
- config_db: contains the configuration files required for start the processing
- analysis_config.yaml: parameters required for running the analysis. - codebooks: contains all the files with the codebooks used in the analysis of barcoded experiments
- gene_hammming_16_11_6_positive_bits.parquet
- gene_hammming_16_11_6_positive_bits.xslx - probes_sets: contains all the fasta files with the probes used in the analysis
- HE.fasta
The dark image file is acquired at the end of the experiment as .nd2
file labeled Blank*.nd2
. The processing functions will use this file for the analysis. However, if the file has not been acquired is necessary to have a MACHINE-NAME_dark_img.npy
file in the config_db
folder. Therefore save a copy of an already generated dark image for each of the machines that will generate data inside the config_db
folder. The output will look like:
-
config_db: contains the configuration files required for start the processing
- analysis_config.yaml: parameters required for running the analysis.
- ROBOFISH1_dark_img.npy: camera noise images to use for filtering if a camera noise image has not been generated in the specific experiment.
- ROBOFISH2_dark_img.npy
- ROBOFISH3_dark_img.npy
- UNDEFINED.npy -
codebooks: contains all the files with the codebooks used in the analysis of barcoded experiments
- gene_hammming_16_11_6_positive_bits.parquet
- gene_hammming_16_11_6_positive_bits.xslx -
probes_sets: contains all the fasta files with the probes used in the analysis
- HE.fasta
The folder containing the data to process must be moved in the processing folder. This is independent from the status of the processing.
EXAMPLE
Structure of an experimental folder generated by ROBOFISH. It is a two color channels (Cy5/Cy3) eel experiment using Europium as reference channel
Reference images collecte at 10X and 40X:
-
Reference Nuclei 10X
20210613_175632_037__ChannelCy3_Nuclei_Seq0000.nd2 -
Reference nuclei images 40X
20210613_175632_037__ChannelCy3_Nuclei_Seq0001.nd2 -
Reference Europium beads images 40X
20210613_175632_037__ChannelEuropium_Cy3_Seq0002.nd2
Rounds images (1 file/channel with all FOVs):
Count0000X must be present in the name
20210615_231228_197__Count00001_ChannelCy3_Seq0002.nd2
20210615_231228_197__Count00001_ChannelCy5_Seq0000.nd2
20210615_231228_197__Count00001_ChannelEuropium_Seq0001.nd2
20210615_231228_197__Count00002_ChannelCy3_Seq0005.nd2
20210615_231228_197__Count00002_ChannelCy5_Seq0003.nd2
20210615_231228_197__Count00002_ChannelEuropium_Seq0004.nd2
20210615_231228_197__Count00003_ChannelCy3_Seq0008.nd2
20210615_231228_197__Count00003_ChannelCy5_Seq0006.nd2
20210615_231228_197__Count00003_ChannelEuropium_Seq0007.nd2
20210615_231228_197__Count00004_ChannelCy3_Seq0011.nd2
20210615_231228_197__Count00004_ChannelCy5_Seq0009.nd2
20210615_231228_197__Count00004_ChannelEuropium_Seq0010.nd2
20210615_231228_197__Count00005_ChannelCy3_Seq0014.nd2
20210615_231228_197__Count00005_ChannelCy5_Seq0012.nd2
20210615_231228_197__Count00005_ChannelEuropium_Seq0013.nd2
20210615_231228_197__Count00006_ChannelCy3_Seq0017.nd2
20210615_231228_197__Count00006_ChannelCy5_Seq0015.nd2
20210615_231228_197__Count00006_ChannelEuropium_Seq0016.nd2
20210615_231228_197__Count00007_ChannelCy3_Seq0020.nd2
20210615_231228_197__Count00007_ChannelCy5_Seq0018.nd2
20210615_231228_197__Count00007_ChannelEuropium_Seq0019.nd2
20210615_231228_197__Count00008_ChannelCy3_Seq0023.nd2
20210615_231228_197__Count00008_ChannelCy5_Seq0021.nd2
20210615_231228_197__Count00008_ChannelEuropium_Seq0022.nd2
20210615_231228_197__Count00009_ChannelCy3_Seq0026.nd2
20210615_231228_197__Count00009_ChannelCy5_Seq0024.nd2
20210615_231228_197__Count00009_ChannelEuropium_Seq0025.nd2
20210615_231228_197__Count00010_ChannelCy3_Seq0029.nd2
20210615_231228_197__Count00010_ChannelCy5_Seq0027.nd2
20210615_231228_197__Count00010_ChannelEuropium_Seq0028.nd2
20210615_231228_197__Count00011_ChannelCy3_Seq0032.nd2
20210615_231228_197__Count00011_ChannelCy5_Seq0030.nd2
20210615_231228_197__Count00011_ChannelEuropium_Seq0031.nd2
20210615_231228_197__Count00012_ChannelCy3_Seq0035.nd2
20210615_231228_197__Count00012_ChannelCy5_Seq0033.nd2
20210615_231228_197__Count00012_ChannelEuropium_Seq0034.nd2
20210615_231228_197__Count00013_ChannelCy3_Seq0038.nd2
20210615_231228_197__Count00013_ChannelCy5_Seq0036.nd2
20210615_231228_197__Count00013_ChannelEuropium_Seq0037.nd2
20210615_231228_197__Count00014_ChannelCy3_Seq0041.nd2
20210615_231228_197__Count00014_ChannelCy5_Seq0039.nd2
20210615_231228_197__Count00014_ChannelEuropium_Seq0040.nd2
20210615_231228_197__Count00015_ChannelCy3_Seq0044.nd2
20210615_231228_197__Count00015_ChannelCy5_Seq0042.nd2
20210615_231228_197__Count00015_ChannelEuropium_Seq0043.nd2
20210615_231228_197__Count00016_ChannelCy3_Seq0047.nd2
20210615_231228_197__Count00016_ChannelCy5_Seq0045.nd2
20210615_231228_197__Count00016_ChannelEuropium_Seq0046.nd2
Configuration files matching the round images:
Count00001_JJEXP20210613_SL001_Section1_C1H01.pkl
Count00002_JJEXP20210613_SL001_Section1_C1H02.pkl
Count00003_JJEXP20210613_SL001_Section1_C1H03.pkl
Count00004_JJEXP20210613_SL001_Section1_C1H04.pkl
Count00005_JJEXP20210613_SL001_Section1_C1H05.pkl
Count00006_JJEXP20210613_SL001_Section1_C1H06.pkl
Count00007_JJEXP20210613_SL001_Section1_C1H07.pkl
Count00008_JJEXP20210613_SL001_Section1_C1H08.pkl
Count00009_JJEXP20210613_SL001_Section1_C1H09.pkl
Count00010_JJEXP20210613_SL001_Section1_C1H10.pkl
Count00011_JJEXP20210613_SL001_Section1_C1H11.pkl
Count00012_JJEXP20210613_SL001_Section1_C1H12.pkl
Count00013_JJEXP20210613_SL001_Section1_C1H13.pkl
Count00014_JJEXP20210613_SL001_Section1_C1H14.pkl
Count00015_JJEXP20210613_SL001_Section1_C1H15.pkl
Count00016_JJEXP20210613_SL001_Section1_C1H16.pkl
Experiment configuration file:
JJEXP20210613_SL001_Section1_config.yaml (see below for content and required
content)
Images of the crosses used to register the slide after EEL:
Left_cross.nd2
Right_cross.nd2
Coords of the acquired fovs
- Coords of initial grid of FOVS (multipoints_775.xml)
- Add to the coords of the crosses to the initial grid (multipoints_775_Crosses.xml)
- Correct the coords after registration of the crosses (multipoints_775_Crosses_Adjusted.xml)
- Removed extra fovs (multipoints_631_Final.xml)
ROBOFISH logs 2021-Jun-15_17-58-05_ROBOFISH2.log 2021-Jun-15_21-24-21_ROBOFISH2.log
# Count00014_JJEXP20210613_SL001_Section1_C1H14.pkl
{'round_code': 'C1H14',
'experiment_name': 'JJEXP20210613_SL001_Section1',
'Description': 'GBM SL001 with HG1 and HG2 pools',
'Protocols_io': 'https://www.protocols.io/edit/eel-t92er8e',
'chamber': 'chamber1',
'Machine': 'ROBOFISH2',
'Operator': 'operator4',
'Timestamp_robofish': '2021-06-18 13-09-14',
'hybridization_fname': 'Unknown-at-dict-generation-time',
'hybridization_number': 14,
'Hyb_time_A': 0.16,
'Hyb_time_B': 'None',
'Hyb_time_C': 'None',
'Hybmix_volume': 500,
'Imaging_temperature': 20.0,
'Fluidic_Program': 'EEL_barcoded',
'Readout_temperature': 22.0,
'Staining_temperature': 37.0,
'Start_date': '20210613',
'Target_cycles': 16,
'Species': 'Homo sapiens',
'Sample': 'SL001',
'Strain': 'None',
'Age': 'None',
'Tissue': 'Glioblastoma',
'Orientation': 'None',
'RegionImaged': 'None',
'SectionID': 'None',
'Position': 'None',
'Experiment_type': 'eel-barcoded',
'Chemistry': 'EELV2_corev2',
'Probes_FASTA': {
'Probes_Atto425': 'None',
'Probes_Cy3': 'HG2.fasta',
'Probes_Cy5': 'HG2.fasta',
'Probes_Cy7': 'None',
'Probes_DAPI': 'None',
'Probes_FITC': 'None',
'Probes_TxRed': 'None',
'Probes_Europium': 'None'},
'Barcode': 'True',
'Barcode_length': 16,
'Codebooks': {
'Codebook_DAPI': 'None',
'Codebook_Atto425': 'None',
'Codebook_FITC': 'None',
'Codebook_Cy3': 'codebookHG2_20210508.parquet',
'Codebook_TexasRed': 'None',
'Codebook_Cy5': 'gene_hGBM20201124.parquet',
'Codebook_Cy7': 'None',
'Codebook_Europium': 'None'},
'Multicolor_barcode': 'False',
'Stitching_type': 'both-beads',
'StitchingChannel': 'Europium',
'Overlapping_percentage': '8',
'channels': {'Code': 'C1H14',
'Chamber': 1,
'Hybridization': 'Hybridization14',
'DAPI': 'None',
'Atto425': 'None',
'FITC': 'None',
'Cy3': 'EELCy3-14',
'TxRed': 'None',
'Cy5': 'EEL647-14',
'Cy7': 'None',
'QDot': 'None',
'BrightField': 'None',
'Europium': 'Europium'},
'roi': '[[0, 529]]',
'Pipeline': 'eel-human-GBM',
'system_log': 'log_files/2021-Jun-15_21-24-21_ROBOFISH2.log'}
In order to be able to process each experiment folder must contain a documentation file: Experiment_Name_auto_config.yaml
If the data are generated by a ROBOFISH machine the experiment configuration file will be automatically generated and included in the experiment folder. If the data are generated by another instrument a minimal working configuration file can be manually created.
# JJEXP20210613_SL001_Section1_config.yaml
Age: None
Barcode: 'True'
Barcode_length: 16
Chamber_EXP: Chamber1
Chemistry: EELV2_corev2
# One codebook for each channel
# key is Codebook_+ channel name
# NB: use the channel name and not the optical config name
Codebooks:
Codebook_Atto425: None
Codebook_Cy3: codebookHG2_20210508.parquet
Codebook_Cy5: gene_hGBM20201124.parquet
Codebook_Cy7: None
Codebook_DAPI: None
Codebook_FITC: None
Codebook_TexasRed: None
Codebook_Europium: None
Description: GBM SL001 with HG1 and HG2 pools
EXP_name: JJEXP20210613_SL001_Section1
# Experiment type can be:
# eel-barcoded
# serial-smfish
Experiment_type: eel-barcoded
Heatshock_temperature: 22.0
Hyb_time_1_A: 0.16
Hyb_time_1_B: None
Hyb_time_1_C: None
Hyb_time_2_A: None
Hyb_time_2_B: None
Hyb_time_2_C: None
Hybmix_volume: 500
Imaging_temperature: 20.0
# Machine can be:
# ROBOFISH1
# ROBOFISH2
# ROBOFISH3
# NOT_DEFINED
Machine: ROBOFISH2
# Bool
Multicolor_barcode: 'False'
Operator: operator4
Orientation: None
Overlapping_percentage: '8'
# A pipeline is define as a dictionary of predefined
# function to run in the preprocessing and dots calling
# steps
# Can be one of the built-in pipeline
# Other pipelines can be added in the package in the
# pysmFISH.configuration_files.create_function_runner
# The current available pipelines are
# eel-human-embryo
# eel-human-GBM
# eel-human-adult-brain
# eel-mouse-brain
# smfish-serial-adult-human
# smfish-serial-mouse
# smfish-serial-controls-eel
Pipeline: eel-human-GBM
Position: None
Probes_FASTA:
Probes_FASTA_Atto425: None
Probes_FASTA_Cy3: HG2.fasta
Probes_FASTA_Cy5: HG2.fasta
Probes_FASTA_Cy7: None
Probes_FASTA_DAPI: None
Probes_FASTA_FITC: None
Probes_FASTA_TxRed: None
Probes_FASTA_Europium: None
Program: EEL_barcoded
Protocols_io: https://www.protocols.io/edit/eel-t92er8e
Readout_temperature: 22.0
RegionImaged: None
Sample: SL001
SectionID: None
Species: Homo sapiens
Staining_temperature: 37.0
Start_date: '20210613'
# Currently process Europium or DAPI
StitchingChannel: Europium
# Define the type of reference channel used
# for registering the FOVs in different rounds and
# stitch the entire experiment
# can be:
# small-beads
# large-beads
# both-beads
# nuclei
Stitching_type: both-beads
Strain: None
Stripping_temperature: 22.0
Target_cycles: 16
Tissue: Glioblastoma
roi: '[[0, 529]]'
# Minimal working experiment configuration file
Barcode: 'True'
Barcode_length: 16
Chemistry: EELV2_corev2
# One codebook for each channel
# key is Codebook_+ channel name
# NB: use the channel name and not the optical config name
Codebooks:
- Codebook_Atto425: None
- Codebook_Cy3: codebookHG2_20210508.parquet
- Codebook_Cy5: gene_hGBM20201124.parquet
- Codebook_Cy7: None
- Codebook_DAPI: None
- Codebook_FITC: None
- Codebook_TexasRed: None
EXP_name: LBEXP20210428_EEL_HE_3370um
Start_date: '20210428'
# Experiment type can be:
# eel-barcoded
# serial-smfish
Experiment_type: eel-barcoded
# Machine can be:
# ROBOFISH1
# ROBOFISH2
# ROBOFISH3
# NOT_DEFINED
Machine: ROBOFISH2
Operator: lars
Orientation: Sagittal
Overlapping_percentage: '8'
# A pipeline is define as a dictionary of predefined
# function to run in the preprocessing and dots calling
# steps
# Can be one of the built-in pipeline
# Other pipelines can be added in the package in the
# pysmFISH.configuration_files.create_function_runner
# The current available pipelines are
# eel-human-embryo
# eel-human-GBM
# eel-human-adult-brain
# eel-mouse-brain
# smfish-serial-adult-human
# smfish-serial-mouse
# smfish-serial-controls-eel
Pipeline: eel-human-embryo
# NB: use the channel name and not the optical config name
Probes_FASTA:
Probes_FASTA_Atto425: None
Probes_FASTA_Cy3: HG2.fasta
Probes_FASTA_Cy5: HG2.fasta
Probes_FASTA_Cy7: None
Probes_FASTA_DAPI: None
Probes_FASTA_FITC: None
Probes_FASTA_TxRed: None
# Define the type of reference channel used
# for registering the FOVs in different rounds and
# stitch the entire experiment
# can be:
# small-beads
# large-beads
# both-beads
# nuclei
Stitching_type: both-beads
# Currently process Europium or DAPI
StitchingChannel: Europium
# Bool
Multicolor_barcode: 'False'
Species: Homo sapiens
Age: 7W1D PCA
Strain: None
Tissue: Head
The dataset ( XXX add link to location of an example dataset XXX) is a pandas data frame containing all the metadata corresponding to a specific field of view. In experiment generated by ROBOFISH
machines the dataset is automatically generated.
The dataset contains all the metadata parsed from the ``.nd2files, the matching
.pkl` files and the `experiment-name_config.yaml`. This facilitate the reprocessing of the images because all information for processing are saved in one single file
The dataset is created using the Dataset
class of the data_models
module. The Dataset is created by collecting the metadata from the .zmetadata
aggregated .json
file of the zarr
container with the parsed raw data. If the .zmetadata
file is not available but the field of views related metadata are saved in single files is possible to build the Dataset
using the Dataset.create_full_dataset_from_files
utility function. The function currently support only .pkl
files but can be expanded to combine different file types.
The dataset contains the following experimental info:
# Experiment related properties
experiment_name
experiment_type
probe_fasta_name
total_fovs
# Sample related properties
species
strain
tissue
# Data generator related properties
operator
machine
start_date
# FOV related properties
fov_name
channel
round_num
target_name
fov_acquisition_coords_x
fov_acquisition_coords_y
fov_acquisition_coords_z
img_height
img_width
overlapping_percentage
pixel_microns
zstack # Number of z-planes
# Processing related properties
barcode # True or False
barcode_length
codebook
pipeline
processing_type
stitching_channel
stitching_type
# Data storage related properties
grp_name # zarr group name of raw data
raw_data_location
The pipeline
module contains the classes used to run processing pipelines. Currently there is only one class (Pipeline
) that can be used to process eel
and smfish
data.
Create the folder structure used for the data processing. If a folder is already present it won't overwrite it.
FOLDER STRUCTURE:
codebook: contain the codebooks used for the analysis.
original_robofish_logs: contains all the original robofish logs.
extra_files: contains the extra files acquired during imaging.
extra_processing_data: contains extra files used in the analysis like the dark images for flat field correction.
pipeline_config: contains all the configuration files.
raw_data: contains the renamed .nd2 files and the corresponding .pkl metadata files.
output_figures: contains the reports and visualizations-
notebooks: contain potential notebooks used for processing the data.
probes: contains the fasta file with the probes used in the experiment.
fresh_tissue: contain the images and the process data obtained from imaging the fresh tissue before eel processing.
logs: contains the dask and htcondor logs.
microscope_tiles_coords: contain the coords of the FOVs according to the microscope stage.
results: contains all the processing results.
Collect the hash corresponding to version of the pipeline used to process the data. The output is saved in a git_info.yaml
file in the results
folder. Run on 1 CPU on the main node.
This step make sure that the content of the experiment-name_config.yaml
file is complete. This file contains metadata required for the processing. Run on 1 CPU on the main node.
In order to be able to process large data in a feasible amount of time we parallelised a lot of the steps of the pipeline. To scale out processing we are using dask. Dask will take care of the scheduling and distributing the workload to workers. The workers can be on the same machine or on different computational units. We decided to used dask because it is able to run locally, on our on-premises cluster or on the cloud. We currently run dask using CPUs and not threads.
-
LOCAL:
When running locally the pipeline use:number_cpus: Defined in the pipeline parameters memory_limit:Defined in the pipeline parameters processes: True threads_per_worker: 1
-
HTCONDOR: The workload on our cluster is managed by htcondor. The cluster has a shared file system. In order to facilitate the creation of a dask cluster on top of htcondor we use dask-jobqueue.
The values of the parameters for setting up dask on our cluster is strictly dependent on the configuration of our htcondor installation.
We use an adaptable cluster (increase/decrease size depending on the load) with a minimum of 1 to a maximum of 15 jobs (NB: in dask-jobqueue terminology jobs=workers). Each of the jobs can have a custom defined number of cores. In our setup the default number is 20 and the total memory for all the workers is 200GB.
Summary:
jobs (workers): 1:15 cores / worker: 20 RAM / worker: 200 GB cluster.scheduler.allowed_failures : 1000
By specifying these parameter we override the standard configuration files that are written in the processing machine after installation. To better understand the standard configuration refer to the following doc page: dask configuration docs. We included a copy of our configuration files in the code repository
pysmFISH/docs/dask-general-config-files
IMPORTANT During our data processing we noted that is the processes run for too long (ex. blocking the GIL for a long time) the probability to loose connection between dask scheduler and workers is quite high. This results in an hanged or crushed processing. To avoid the issue is better to combine the long processing steps in chunks. In our particular case the issue surface when processing the different FOVs. Therefore instead of building an eel-processing-graph which process all the FOVs we first chunk the FOVs in group (usually 40-50 FOVs in 1 channel eel 16 bits experiments) and then build and process a graph for each chunk.
-
Unmanaged Cluster: Use a cluster where jobs submission is controlled by a workload manager (ex. HTCondor, SLURM,PBS). Jobs scheduling is done using SSHCluster from dask. The parameters used for setting up the SSHCluster are strictly dependent on the configuration of the unamanaged cluster available.
Example settings for MONOD: Currenlty in MONOD four nodes are not managed by HTCondor (monod10, monod11, monod12, monod33). In our current setup the main node host the scheduler. It is possible to host the scheduler on a node as well.
nprocs: 40 memory: "6GB", nthreads: 1 scheduler_port: 23875 dashboard_port: 25399 scheduler_address: 'localhost' workers_addresses_list = ['monod10','monod11','monod12','monod33']
IMPORTANT: Because of a bug in the SSHCluster is not enough to shut down the cllient and the cluster objects
running_pipeline_name.client.close() running_pipeline_name.cluster.close()
but the processes must be killed using the processing_cluster_setup.kill_process() function
running_pipeline_name.client.close() running_pipeline_name.cluster.close() if running_pipeline_name.processing_engine == 'unmanaged_cluster': processing_cluster_setup.kill_process()
If the processing fails and the cluster doesn't get kill you need to manually shut it down
# In the main node list the dask distributed processe ps -fA | grep distributed.cli.dask.scheduler kill process-number # for each of the nodes used ssh into the node and kill all the python processes pkill -f python
NOTE: When using an unmanaged cluster is possible to reuse the cluster for additional processing. You need to pass the dask.scheduler address to the
scheduler_address
parameter when a pipeline is initialized.
The goal of the parsing step is to convert the raw .nd2
files acquired by the microscope that cannot be accessed in parallel into a format more parallel friendly and with better compression. We decided to use zarr. The parsing output a zarr container experiment-name_parsed-image-tag.zarr
such as LBEXP20210718_EEL_Mouse_448_2_img_data.zarr
. The zarr
file contains a group for each of the FOVs/round of hybridization/channell position. Each group contains a dataset with the raw image stack. The dataset is chunked along the z-dimension chunk=(1,img.shape[0],img.shape[1])
. The parsed metadata are stored in the .zattrs
.json
file associated to the dataset.
The graph is built using dask.delayed
Parsed raw data file structure
File Name: LBEXP20210718_EEL_Mouse_448_2_img_data.zarr
- Group name LBEXP20210718_EEL_Mouse_448_2_Hybridization01_Cy5_fov_10
- Dataset name: raw_data_fov_10:
- File related metadata parsed from .nd2: .zattrs
- Group name LBEXP20210718_EEL_Mouse_448_2_Hybridization01_Cy5_fov_21
- Dataset name: raw_data_fov_21:
- File related metadata parsed from .nd2: .zattrs
Overview of the processing graph
Parsing types:
- no_parsing: skip the parsing step
- original: parse the files originated by ROBOFISH, rename the original and store them in the
raw_data
subfolder - reparsing_from_processing_folder: reparse the renamed raw files stored in the
raw_data
subfolder still in the experiment folder - reparsing_from_storage: reparse the renamed raw files stored in the
raw_data
subfolder after the raw_data has been moved to storage
Parsing original files:
-
Collect all processing files: Gather codebooks and probes from the storage folders
-
Sort data into folders: Sort the files created by ROBOFISH in subfolders
-
Select raw data: Identify the nd2 raw microscopy files generated by the robofish machine. important: The files must contain
CountXXXXX
in the name. -
QC_matching_nd2_metadata_robofish: This function is used to check that each of the nd2 files generated by the microscope has a matching pkl metadata file generated by robofish
-
Parse .nd2 files: This function is prallelised. Each .nd2 files is processed by a different CPU and the output is written in the same
zarr
file container. The metadata related to the imaging and stored in the .nd2 files are also collected during the parsing. -
Consolidate the zarr metadata: Each dataset in the zarr container has the metadata stored in a
.zattrs
files. In order to accelerate the access to the information contained in this files all the metadata are combined in a single.json
file named.zmetadata
and stored in the zarr container.
Reparsing from processing_folder/storage:
- Define the raw data folder: In order to reparse the data you need to identify the location of parsed and renamed raw_data folder.
- Select renamed raw data: Function used to identify the .nd2 files in a folder. The files do not need to have the CountXXX.
The dataset is a pandas dataframe that contains all the metadata for each field of view parsed from the .nd2
files, the matching .pkl
files and the experiment-name_config.yaml
file. This facilitate the reprocessing of the images because all information for processing are saved in one single file. The dataset dataframe match the organisation of the parsed images zarr container. A subset of metadata common to all the fovs is stored in the metadata attribute.
metdata:
- list_all_fovs
- list_all_channels
- total_rounds
- stitching_channel
- img_width
- img_height
- img_zstack
- pixel_microns
- experiment_name
- overlapping_percentage
- machine
- barcode_length
- processing_type
- experiment_type
- pipeline
- stitching_type
- list_all_codebooks
Load or create the analysis_config.yaml
file with all the parameters for running the analysis. It will first load the analysis_config.yaml file present in the pipeline_config folder
. If not it will create one using the master template stored in the config_db
directory selecting the parameters according to the Machine
and Experiment_type
info present in the dataset file.
Example of analysis_config.yaml file for the processing of a eel-barcoded experiment of a human sample
fish:
PreprocessingFishFlatFieldKernel:
- 1
- 100
- 100
PreprocessingFishFilteringSmallKernel:
- 1
- 8
- 8
PreprocessingFishFilteringLaplacianKernel:
- 0.02
- 0.01
- 0.01
CountingFishMinObjDistance: 1
CountingFishMaxObjSize: 200
CountingFishMinObjSize: 1
CountingFishNumPeaksPerLabel: 20
LargeObjRemovalPercentile: 95
LargeObjRemovalMinObjSize: 100
LargeObjRemovalSelem: 3
both-beads:
PreprocessingFishFilteringSmallKernel:
- 1
- 8
- 8
PreprocessingFishFilteringLaplacianKernel:
- 0.02
- 0.01
- 0.01
PreprocessingFishFlatFieldKernel:
- 1
- 100
- 100
CountingFishMinObjDistance: 5
CountingFishMaxObjSize: 600
CountingFishMinObjSize: 10
CountingFishNumPeaksPerLabel: 1
LargeObjRemovalPercentile: 95
LargeObjRemovalMinObjSize: 100
LargeObjRemovalSelem: 3
staining:
PreprocessingStainingFlatFieldKernel:
- 2
- 100
- 100
fresh-tissue:
nuclei:
PreprocessingFreshNucleiLargeKernelSize:
- 5
- 50
- 50
beads:
PreprocessingFishFlatFieldKernel:
- 100
- 100
CountingFishMinObjDistance: 2
CountingFishMaxObjSize: 200
CountingFishMinObjSize: 2
CountingFishNumPeaksPerLabel: 1
BarcodesExtractionResolution: 2
RegistrationReferenceHybridization: 1
RegistrationTollerancePxl: 3
RegistrationMinMatchingBeads: 5
This step is used to determine how the field of views are organised in the imaged region and to identify the coords of the overlapping regions between the tiles. The stage based coords are then normalised to an image-notation (numpy matrix notation) in which the top-left corner of the image is the origin (0,0), the x-axis goes left→right (columns of the numpy matrix) and the y-axis goes top→bottom (rows of the numpy matrix). The process of normalisation depend on the stage and the camera orientation in the machine. The coords systems of the machines used in the Linnarsson lab are different.
The identification of the organisation of the fovs in the composite image can be simplified if the (0,0) coords of the stage/camera will be set to the same position for all machines. In our case we started running experiments with the coords not adjusted and that's why the position of (0,0) is different for all the machine that are used to generate the data.
It is possible to assign a specified orientation in the camera tab of the NIS element.
The tile organisation is saved as image_space_tiles_organization.png
in the output_figures
subfolder.
Modify the pysmFISH.stitching.organize_square_tile
normalize_coords
method. The easiest way is to add the normalisation function to the NON_DEFINED
machine. Otherwise add your machine name (the same name of the machine included in experiment_config.yaml file)
The pysmFISH
pipeline is used to run analysis of a variety of samples. Each type of tissue and/or organism (mouse vs. human) show different properties such as background fluorescence or presence of lipofuscin. Therefore the images and the dots calling are done using different methodologies. In addition, different type of reference images for image registration can be used. We are currently support: nuclei
(for smFISH-serial) or Europium
beads (small 200 nm and large 1 small-beads, large-beads, both-beads
for both eel and smFISH experiments.
The running functions are combined into a dictionary with the following structure:
{processing step name: processing function name}
example of running dictionary:
running_functions = {
'fish_channels_preprocessing':'filter_remove_large_objs',
'fish_channels_dots_calling':'osmFISH_peak_based_detection_fast',
'fresh_sample_reference_preprocessing':'large_beads_preprocessing',
'fresh_sample_reference_dots_calling':'osmFISH_peak_based_detection_fast',
'fresh_sample_nuclei_preprocessing':'fresh_nuclei_filtering'}
In order to run a specific function we use the [getattr()](https://docs.python.org/3/library/functions.html#getattr)
python built-in function
example:
running_functions = {
'fish_channels_preprocessing':'filter_remove_large_objs',
'fish_channels_dots_calling':'osmFISH_peak_based_detection_fast',
'fresh_sample_reference_preprocessing':'large_beads_preprocessing',
'fresh_sample_reference_dots_calling':'osmFISH_peak_based_detection_fast',
'fresh_sample_nuclei_preprocessing':'fresh_nuclei_filtering'}
# preprocess smFISH signal
filtering_fun = running_functions['fish_channels_preprocessing']
filt_out = getattr(pysmFISH.preprocessing,filtering_fun)(
zarr_grp_name,
parsed_raw_data_fpath,
processing_parameters,
dark_img)
The group of functions used to process different types of tissue are hard-coded in pysmFISH.configuration_files.create_function_runner and during the pipeline run the group of functions to run is specified by the [Pipeline]()
parameter in the experiment_config.yaml
file. We currently have the following processing configurations (Pipeline possible values):
eel-human-GBM: to process human glioblastoma
eel-human-adult-brain: to process adult human brain
eel-human-embryo: to process human embryonic tissue
eel-mouse-brain: to process adult mouse
smfish-serial-adult-human: to process adult humam
smfish-serial-mouse: to process adult mouse
smfish-serial-controls-eel: to process control experiments for eel paper
IMPORTANT
The image processing and the dots calling functions have a standard structure in order to allow quick iteration for the testing of new processing steps and integration in the preprocessing graph
Image processing function signature
def filtering_fun(
zarr_grp_name: str,
parsed_raw_data_fpath: str,
processing_parameters: dict,
dark_img: np.ndarray)-> Tuple[Tuple[np.ndarray,],dict]
# example of output:
((img,),metadata) # img is the processed img
((masked_img,img),metadata) # img is the processed img and masked_img is the masked image
Dots calling function signature
def counting_fun(ImgStack: np.ndarray,
fov_subdataset: pd.Series,
parameters_dict: dict,
dimensions: int=2,
stringency:int =0,
min_int:float=False,
max_int:float=False,
min_peaks:int=False)->pd.DataFrame
# example of output
counts_df # Pandas dataframe with the counts
This step build the processing graph for eel-barcoded
type of experiments and is parallelised by field of view.
IMPORTANT Because some of the processing steps take quite a bit of time it is necessary to process the FOVs in chunks to avoid that the processes will fail (workers get lost and not communicate with the scheduler).
In our system (when is not too busy) we are able to build and process 50 fovs for a 3-color (counting the reference channel) eel-barcoded experiment without processing issues. A separate processing tasks graph is built and processed for each chunk. The parallelisation in done by fov / channel / Hybridization (single group in the .zarr raw data container) however, the graph is built maintaining the relation to the fov number (see example graph below).
The graph is built using dask.delayed
- Create empty zarr file: Creates the
zarr
container used to save the processed data that will be stored in the off-site backup directory - Create dark image: Function used to generate the image containing the camera noise
.png
The image is acquired as.nd2
file labeledBlank*.nd2
and is stored in theextra_files subfolder
. The file contains a stack of images acquired with no illumination in order to evaluate the camera noise. If the files is present in the extra_files folder than the median is calculated and saved asexperiment_name + machine_name +'_dark_img.npy'
in the extra_processing_files subfolder. If the file is not present themachine +'_dark_img.npy'
present in theconfig_db
folder is copied in the extra_processing_files subfolder. - Load dark image: Dark image is loaded as
.npy
files and converted in adelayed
object in order to avoid sending the data separately for each function call. - Load codebook: Load the
.parquet
codebook file and convert it in adelayed
object. - Chunk the fovs according to the size defined in the pipeline call. A tasks graph is built for each chunk. The parallelisation in done by fov / channel / Hybridization (single group in the .zarr raw data container) however, the graph is built maintaining the relation to the fov number.
-
Filtering: Each group in the .zarr raw data container part of the processing chunk is preprocessed and the z-stack flattened. The type of filtering function applied depend from the
pipeline
selected. The filtering parameters are in loaded from the analysis_config.yaml file -
Counting: The output of the filtering function is fed to the counting step. The type of counting function applied depend from the
pipeline
selected. The counting parameters are in loaded from the analysis_config.yaml file -
Concatenate counts: All the dataframe with counts corresponding to a specific fov are concatenated together.
-
Register reference channel (beads): The different rounds of the reference channel are register to the reference round defined by the
RegistrationReferenceHybridization
parameter in the analysis_config.yaml file . The registration isfft
based using a synthetic image reconstructed from the coords of the beads (a triangulation based approach has been tested but using fft is much more time efficient). The quality of the output is evaluated by determine the minimum number of matching beads with a distance equal or belowRegistrationTollerancePxl
between the reference image and the translated round. The quality measure is stored in the output dataframe in themin_number_matching_dots_registration
field. -
Register the fish channels: The shift between rounds calculated for the reference channel is used to correct the coords of the dots identified in the smFISH images.
-
Decode the barcodes: We use a nearest neighbour to determine the barcode. The decoding happens in two steps: Identification of all possible barcodes followed by decoding of the barcodes.
-
Identification of all barcodes: We start the processing from the reference round defined in the
RegistrationReferenceHybridization
parameter in the analysis_config.yaml file (default is round 1). Using a nearest neighbour approach with euclidean distance metric we identify all the peaks of the remaining rounds that form a barcode which start with a positive bit in the reference round. Two dots are neighbours if their distance is equal or belowBarcodesExtractionResolution
pixels (default 2). The dots forming a barcode are removed and the process is repeated until all the barcodes with positive bits that start in all the rounds (beside last) are identified. -
Decoding: We map the identified barcodes to the codebook using a nearest neighbour approach with hamming distance metric. The barcode are not filtered by selecting a predefined hamming distance. All the barcodes are stored in the
pandas.DataFrame
stored in theresults
subfolder.This processing step can be optimised for performance. Ex. mapping only barcodes with all positive bits. In addition running the code on GPU will significantly increase the processing speed.
-
-
Stitch microscope coords: The coords of the smFISH dots and the reference channel are mapped to the stage coords. This stitching is used for quick evaluation of the results.
-
- Consolidate the zarr metadata: Consolidate the metadata of the processed files. Each dataset in the zarr container has the metadata stored in a
.zattrs
files. In order to accelerate the access to the information contained in this files all the metadata are combined in a single.json
file named.zmetadata
and stored in the zarr container. - Simple output plotting: Utility function used to create a pandas dataframe with a simplified version of the eel analysis output that can be used for a quick visualisation. The type of stitching coords to be visualised can be selected using the
stitching_selected
parameter.
stitching_selected: microscope_stitched
# Example of columns of the output dataframe
fov_num
r_px_microscope_stitched
c_px_microscope_stitched
decoded_genes
The main output from this step are:
- zarr container with the filtered images
- counts and dots properties stored as
.parquet
files in theresults
subfolder. The data of eachfov
are saved as single file. Single fov files in combination with dask.dataframe make the parallelisation of the downstream processes quite efficient.
We create a dask graph to remove duplicated dots in parallel. The stitching_selected
parameter can be used to defined which coords to use for the the removal of the overlapping dots.
-
Using the
tiles organisation
info we determine which FOVs are overlapping and calculate the coords of the overlapping regions. -
For each overlapping region we identify the overlapping dots for all the genes. To identify matching dots we use a nearest neighbour approach. The searching radius is defined by the
same_dot_radius_duplicate_dots
parameter that can be specified in the pipeline definition. The default value is 5 pixels. The overlapping dots are not removed right after being identified because the same fov can be part of two different overlapping couples and we may end up in modify the same result file in two different processes. -
All the dots to remove are combine by FOV and then removed in parallel.
Each of the output dataframes contains the information relative to the quality of the registration between rounds (min_number_matching_dots_registration
field). The quality of the registration is extremely important for the calling of the barcodes. The minimum number of Europium matching dots acceptable is defined by the RegistrationMinMatchingBeads
in the analysis_config.yaml file.
Some of the registration issues that are identified are mapped in the Registration_errors
class in the errors
modules and reported in the min_number_matching_dots_registration
field in the counts dataframe.
Example
If the reference channels has no counts then the error reported in the min_number_matching_dots_registration
is -1 and there won't be any counts registered and the tile will be completely empty in the plotting.
-
Using
dask.dataframe
we collect themin_number_matching_dots_registration
for all fovs. The error data are save asregistration_error.parquet
file in theresults
subfolder. -
Plot all the error as scatterplot. The scatterplot is saved as
registration_error.png
in theoutput_figures
subfolder.Color code of the dots:
- black: fish channel dataframe has no counts (-6)
- dimgrey: registration channel dataframe has no counts (-5)
- silver: missing counts in the reference round (-4)
- orange: missing counts in one of the rounds (-3)
- green: number of beads matching after registration is below tolerance counts (-2)
- steelblue: slide with no issues
Numbers in the dots:
- Top number in the circle: FOV number
- Middle number: Round with lowest number of beads
- Lowest string: Number of matching beads in the rounds with lowest number of beads
To better determine the position of the tiles we use the dots in the overlapping region between tiles to dermine the shift between the tiles. All the overlapping regions are processed in parallel. If the registration for a certain overlapping regions fails:
shift = np.array([1000,1000])
registration[cpl] = [shift, np.nan]
After determine the shift between overlapping region with run a global registration step based on a linear regression approach to refine the shift between the tiles. If the registration for a specific tiles fails we infer the expected position according to the outcome of the global minimization.
Same approach used in the remove overlapping dots (microscope stitched coords) graph step but the stitching_selected
parameter is set to global_stitched
and the tile coords and overlapping regions are recalculated according to the newly registered microscope coords.
Processing graph for the low magnification images (40X) of the tissues nuclei acquired before eel to used for segmentation and identification of the cells. The processing of the fresh tissue follows the same approach used for processing fish experiment (same folder structure).
- Parsing of the raw images (if required).
- Create the fresh nuclei images dataset.
- Process and counts the reference beads.
- Create a beads images dataset.
- Process the nuclei (background removal and flattening of the images).
- Stitch the nuclei and adjust reference beads coords.
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
This step build the processing graph for smFISH-serial
type of experiments and is parallelised by field of view. smFISH experiments may use nuclei
or beads
to register different rounds of hybridization and stitched the field of views.
IMPORTANT
Because some of the processing steps take quite a bit of time it is necessary to process the FOVs in chunks to avoid that the processes will fail (workers get lost and not communicate with the scheduler).
The parallelisation in done by fov / channel / Hybridization (single group in the .zarr raw data container) however, the graph is built maintaining the relation to the fov number (see example graph below).
The graph is built using dask.delayed
-
Create empty zarr file: Creates the
zarr
container used to save the processed data that will be stored in the off-site backup directory -
Create dark image: Function used to generate the image containing the camera noise
.png
The image is acquired as.nd2
file labeledBlank*.nd2
and is stored in the extra_files subfolder. The file contain a stack of images acquired with no illumination in order to evaluate the camera noise. If the files is present in the extra_files folder than the median is calculated and saved asexperiment_name + machine_name +'_dark_img.npy'
in the extra_processing_files subfolder. If the file is not present themachine +'_dark_img.npy'
present in theconfig_db
folder is copied in the extra_processing_files subfolder. -
Load dark image: Dark image is loaded as
.npy
files and converted in adelayed
object in order to avoid sending the data separately for each function call. -
_Chunk the fovs according to the size defined in the pipeline call. A tasks graph is built for each chunk. The parallelisation in done by fov / channel / Hybridization (single group in the .zarr raw data container) however, the graph is built maintaining the relation to the fov number.
-
preprocessing type depends if we need to process nuclei or smFISH image
Nuclei- Single fov round processing serial nuclei: Run filtering using condition specific for nuclei (ex. large filtering kernel)
- Collect output: All the filtered nuclei images corresponding to all hybridization rounds for a field of view are collected together
smFISH
- Filtering: Each group in the .zarr raw data container part of the processing chunk is preprocessed and the z-stack flattened. The type of filtering function applied depend from the
pipeline
selected. The filtering parameters are in loaded from the analysis_config.yaml file - Counting: The output of the filtering function is fed to the counting step. The type of counting function applied depend from the
pipeline
selected. The counting parameters are in loaded from the analysis_config.yaml file - Concatenate counts: All the dataframe with counts corresponding to a specific fov are concatenated together.
-
registration type depends if the stitching type correspond to nuclei or beads
Nuclei- Combine filtered images: The collected images are combined into a single z-stack in which the position correspond to the
hybridization round -1
. - Nuclei based registration:
fft
based registration of the nuclei and correction of the smFISH coords
Beads
- Beads based registration: The different rounds of the reference channel are register to the reference round defined by the
RegistrationReferenceHybridization
parameter in the analysis_config.yaml file . The registration isfft
based using a synthetic image reconstructed from the coords of the beads (a triangulation based approach has been tested but using fft is much more time efficient). The quality of the output is evaluated by determine the minimum number of matching beads with a distance equal or belowRegistrationTollerancePxl
between the reference image and the translated round. The quality measure is stored in the output dataframe in themin_number_matching_dots_registration
field. The shift between rounds calculated for the reference channel is then used to correct the coords of the dots identified in the smFISH images.
- Combine filtered images: The collected images are combined into a single z-stack in which the position correspond to the
-
Stitch microscope coords: The coords of the smFISH dots and the reference channel are mapped to the stage coords. This stitching is used for quick evaluation of the results.
-
Consolidate the zarr metadata: Consolidate the metadata of the processed files. Each dataset in the zarr container has the metadata stored in a
.zattrs
files. In order to accelerate the access to the information contained in this files all the metadata are combined in a single.json
file named.zmetadata
and stored in the zarr container. -
Simple output plotting: Utility function used to create a pandas dataframe with a simplified version of the eel analysis output that can be used for a quick visualisation. The type of stitching coords to be visualised can be selected using the
stitching_selected
parameter.
The pipeline runs are characterised by different combination of steps. Different types of run can be create to process different types of data. The pipeline runs with test
in the name are used during development or for testing processing conditions
This step is used to parse the data from the nikon files generated by the ROBOFISH systems in the Linnarsson lab. This step is optional and need to be modified to fit the data generate by other microscopy systems.
steps
create_folders_step
save_git_commit
QC_check_experiment_yaml_file_step
processing_cluster_init_step
nikon_nd2_parsing_graph_step
prepare_processing_dataset_step
This is an handy run used to set and run all the requirements of the pipeline.
steps
prepare_processing_dataset_step
create_analysis_config_file_from_dataset_step
determine_tiles_organization
create_running_functions_step
This is the most complete run of the pipeline. It is usually what is run as soon as the data are transferred to the processing server. Using the resume
parameter in the pipeline definition it is possible to re-run the entire pipeline without reprocessing fovs that are already processed.
steps
run_setup
run_cluster_activation
run_parsing
run_required_steps
if experiment_type == 'eel-barcoded':
processing_barcoded_eel_step
QC_registration_error_step
microscope_stitched_remove_dots_eel_graph_step
stitch_and_remove_dots_eel_graph_step
processing_fresh_tissue_step
if experiment_type == 'smfish-serial':
processing_serial_fish_step
microscope_stitched_remove_dots_eel_graph_step
stitch_and_remove_dots_eel_graph_step
This is testing pipeline run is used to run the entire eel processing steps after modification of parameters or functions.
steps
if experiment_type == 'eel-barcoded':
processing_barcoded_eel_step
Example Run specific fovs:
- Start processing in a jupyter lab by running the following functions after initialization of the pipeline:
run_setup
run_cluster_activation
run_parsing
run_required_steps - Selected the number of fov to process
- Run the
test_run_after_editing
steps
if experiment_type == 'eel-barcoded':
processing_barcoded_eel_step
QC_registration_error_step
IMPORTANT
In order to function requires the raw data counts (_raw_fov_
) in the results
subfolder
steps
rerun_decoding_step
IMPORTANT
In order to function requires the raw data counts (_raw_fov_
) in the results
subfolder
rerun_from_registration_step
QC_registration_error_step
microscope_stitched_remove_dots_eel_graph_step
stitch_and_remove_dots_eel_graph_step
This is a special run used to process the data for the eel method paper. The data were collected before the rearrangement of the imaging room.
- In the notebooks folder there is a parameterized jupyter lab notebook to that can be used to run the pipeline Template_running_pysmFISH_pipeline.ipynb
- Run the jupyter lab template from the command line using papermill
papermill -k your-kernel-name-here \
notebooks/Template_running_pysmFISH_pipeline.ipynb \ # Notebook to run
Your_runs_folder_here/20101007-full-run.ipynb \ # Output notebook
# Use -p to pass the pipeline parameters
-p experiment_fpath /rawa/sl/fish_rawdata/AMEXP20210609_EEL_V1C_HA2 \
-p run_type re-run \
-p parsing_type reparsing_from_processing_folder \
-p save_bits_int True \
-p chunk_size 20 \
-p nprocs 40 \
-p memory 6GB \
-p processing_engine unmanaged_cluster \
-p scheduler_port 23875 \
-p dashboard_port 25399 \
# papermill related options
# Collect the error for fast check of issues
--start_timeout 6000 \
--log-output --stdout-file ~/papermill_out.log\
--stderr-file ~/papermill_err.log
See examples in notebooks
IMPORTANT
if running on monod you need to connect to the main node and start there a jupyter lab on a predefined port. On your laptop ssh into monod with the command below. On your web browser uses localhost:JUPYTER_PORT
JUPYTER_PORT=25788
ssh -L $JUPYTER_PORT:localhost:$JUPYTER_PORT [email protected] "ssh -L $JUPYTER_PORT:localhost:$JUPYTER_PORT
The dask dashboard can be used to monitor the processing in real time. It is very useful to monitor if the resources are properly used. Add the following line to your .bashrc
:
dask_dashboard(){
ssh -L "$1":localhost:"$1" [email protected]
}
Connect to the dashboard port (default 25399) by running the following command:
dask_dashboard 25399
and access the dashboard on your web browser: localhost:25399
Template running pysmFISH pipeline.ipynb
Setup processing environment.ipynb
Create dark image from standalone file.ipynb
Local processing cluster activation test.ipynb
Convert excel codebook to parquet.ipynb
Visualize raw counting.ipynb
Test filtering and counting.ipynb
Process fresh tissue.ipynb
Stitching counts EEL.ipynb
Create data for fishScale.ipynb
QC Registration.ipynb