-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transpose + np.copy() + (fancy indexing and/or ndarray.copy()
) causes major slowdowns
#322
Conversation
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #322 +/- ##
=====================================
Coverage 4.22% 4.23%
=====================================
Files 23 23
Lines 5154 5150 -4
=====================================
Hits 218 218
+ Misses 4936 4932 -4
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Preview page for your plugin is ready here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find! I've always been a bit confused with the many np.copy()
calls around recOrder
and waveorder
codebases. In this case a deep copy does seem redundant.
Another potential issue is precision: float64
is used liberally, and for uint16
raw data the benefit of doubles is not always worth the halved bandwidth, esp. for L3/RAM-intensive ops like FFTs.
@talonchandler Great find!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talonchandler I am unable to get the speedups replacing birefringence = reconstructor.Polarization_recon(stokes)
with birefringence = reconstructor.Polarization_recon(stokes)
.
I'm still getting relatively similar results with increasing z even the new code looks good and makes sense. Am I missing something?
I was changing pol_data = reader.get_zarr(0)[t, 1:6, z_start:z_end]
to different intervals for different Z positions.
Here is some test code:
#%%
from recOrder.io.utils import load_bg
from recOrder.compute.reconstructions import (
initialize_reconstructor,
reconstruct_qlipp_stokes,
reconstruct_qlipp_birefringence,
reconstruct_phase3D,
)
from datetime import datetime
import numpy as np
import os
import sys
from iohub.reader import ImageReader
#debugging
import time
import cProfile as profile
import pstats
# %%
# Load the Data
data_root = "/hpc/projects/comp_micro/projects/zebrafish-infection/2023_02_02_hummingbird_casper_GFPmacs"
dataset = "intoto_casper_short_1_2023_02_08_110049.zarr"
dataset_folder = os.path.join(data_root, dataset)
print('data:' + dataset_folder)
# Background folder name
bg_root = "/hpc/projects/comp_micro/rawdata/hummingbird/Ed/2023_02_02_zebrafish_casper"
bg_folder = os.path.join(bg_root, "BG_3")
# %%
#Setup Readers
reader = ImageReader(dataset_folder)
t = 0
pos = 0
pol_data = reader.get_zarr(0)[t, 1:6]
C, Z, Y, X = pol_data.shape
print(pol_data.shape)
# fluor_data = data[6:]
bg_data = load_bg(bg_folder, height= Y, width= X)
# %%
# Get first position
print("Start Reconstructor")
reconstructor_args = {
"image_dim": (Y, X),
"mag": 16, # magnification is 200/165
"pixel_size_um": 3.45, # pixel size in um
"wavelength_nm": 532,
"swing": 0.08,
"calibration_scheme": "5-State", # "4-State" or "5-State"
"NA_obj": 0.55, # numerical aperture of objective
"NA_illu": 0.45, # numerical aperture of condenser
"n_obj_media": 1.0, # refractive index of objective immersion media
"bg_correction": "None", # BG correction method: "None", "local_fit", "global"
}
reconstructor = initialize_reconstructor(
pipeline="birefringence", **reconstructor_args
)
# Reconstruct data Stokes w/ background correction
print("Begin stokes calculation")
start_time = time.time()
# prof = profile.Profile()
# prof.enable()
# bg_stokes = reconstruct_qlipp_stokes(bg_data,reconstructor)
# stokes = reconstruct_qlipp_stokes(pol_data, reconstructor,bg_stokes)
stokes = reconstruct_qlipp_stokes(pol_data, reconstructor, bg_stokes=None)
# prof.disable()
print(f'Stokes elapsed time:{time.time() - start_time}')
print(f"Shape of background corrected data Stokes: {np.shape(stokes)}")
# Reconstruct Birefringence from Stokes
print("Begin birefringence calculation")
bire_start_time = time.time()
birefringence = reconstructor.Polarization_recon(stokes)
birefringence[0] = (
birefringence[0] / (2 * np.pi) * reconstructor_args["wavelength_nm"]
)
print(f'Bire elapsed time:{time.time() - bire_start_time}')
print(f'Total elapsed time:{time.time() - start_time}')
# %%
# print profiling output
# stats = pstats.Stats(prof).strip_dirs().sort_stats("cumtime")
# stats.print_stats() # top 10 rows
Hi Ed, I just ran a comparison of Results: 10 slices: 20 slices: 40 slices: As the number of slices grows, the speed improvement increases. (Note that the "Birefringence (old)" numbers match your results in mehta-lab/waveorder#102). Let me know if you can't reproduce this result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @talonchandler. Yes, I can confirm there is a boost in speed and that this fixes the slowdown I was experiencing. Great debugging!!
One thing to note and perhaps it should become its separate issue is that using the scatter-gather method here given that I'm just throwing silicon (64 cores) at it we can do 40slices(stokes+bire) in 9 secs with each process taking 0.8s for stokes and .4 for birefringence.
I just confirmed that
For this specific function's input with size Z x 2000 x 2000 I'm seeing:
I agree. I'll open a separate issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a stroke of genius in dealing with stokes parameters.
Just did a final test on hummingbird...everything's working well here. Merging. |
Fixes waveorder #102.
TL;DR: transposes and copies are slowing down the
recOrder
->waveorder
inferfaceThe
recOrder
->waveorder
adapter functions use transposes to bridge the gap betweenrecOrder
's CZYX order andwaveorder
s CXYZ order. Transposes of >=3-dimensional arrays can result in non-contiguous views of the original arrays, and the copied versions also inhabit non-contiguous regions of memory. This means thatwaveorder
is often receiving non-contiguous arrays.numpy
functions give the correct answers when you pass non-contiguous arrays, but the results are often much slower. Fancy indexing, and.copy
operations are particularly slow when they operate on non-contiguous arrays since they need to seek across a large region of memory.waveorder_reconstructor.Polarization_recon
receives non-contiguous arrays and uses fancy indexing (e.g.sa_wrapped[ret_wrapped < 0] += np.pi / 2
) and.copy
operations (Recon_para[0] = ret_wrapped.copy()
). The.copy
operations are particularly slow when they receive non-contiguous arrays that they don't expect.Confusingly,
np.copy()
has different default behavior thanndarray.copy()
(np.copy
matches the layout, whilendarray.copy
assumesC-order
and is recommended by numpy). Here's a minimal working example to illustrate the difference in behavior (and the core source of slowdown that this PR fixes):Minimal example script (click to expand):
Results in:
This PR is a first step towards unravelling these issues. The
Polarization_recon
function doesn't need any transposes (even thoughrecOrder
applies them), so we can get the correct results quickly by removing the transposes and their inverses.I suspect that we'll find other similar speedups across the
recOrder
->waveorder
interface, and I will start my search in commonly used areas with transposes and copies. The longest-term solutions will involve changingwaveorder
to use the CZYX order so that no transposes are necessary.