Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dask and zarr arrays #805

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ejeschke
Copy link
Owner

@ejeschke ejeschke commented Oct 15, 2019

This adds support for dask and zarr arrays into BaseImage-derived objects (e.g. AstroImage), e.g.

>>> aimg = AstroImage()
>>> aimg.load_data(dask_arr)

These images can then be loaded directly into a Ginga viewer.

Three new pytest files are added: one for numpy, dask and zarr

Developer documentation has been updated.

@ejeschke
Copy link
Owner Author

@pllim, no hurry.

@pllim
Copy link
Collaborator

pllim commented Oct 16, 2019

Not sure if I'm qualified to review this one as I have never used dask nor zarr. Maybe someone from SunPy like @nabobalis is a better person to review?

@ejeschke
Copy link
Owner Author

@pllim, how about just a basic code review?

@pllim
Copy link
Collaborator

pllim commented Oct 31, 2019

@ejeschke , sure, I'll have a look after you resolve the conflict.

@nabobalis
Copy link
Contributor

Sorry! I clearly glossed over the email notification I got from this 15 days ago.

I've not used dask myself (outside of using xarray with it). I think either @Cadair or @wtbarnes have used it in a more direct form. I am happy to provide a code review regardless.

@ejeschke ejeschke force-pushed the dask-support branch 4 times, most recently from 29366f5 to b0ce57e Compare October 31, 2019 20:06
@ejeschke
Copy link
Owner Author

@pllim, @nabobalis, thank you! Yes, pleased to have any code review.

You can check out this gist to see the tests I did.

Copy link
Contributor

@nabobalis nabobalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good by me. That is some tricky reshaping to be done for zarr.

@Cadair
Copy link
Contributor

Cadair commented Nov 1, 2019

I would be happy to review this, but wont be able to get to it until next week. If you could post a review request for me, it will end up on my list. :)

@pllim
Copy link
Collaborator

pllim commented Nov 1, 2019

@Cadair , can't add you as reviewer, but added you to assignee.

Copy link
Collaborator

@pllim pllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the three test modules could be merged into one to avoid code duplication via subclassing and changing a few things here and there at setup.

I wonder if a rebase will fix RTD build for this PR.

ginga/tests/test_dask.py Show resolved Hide resolved
ginga/tests/test_dask.py Outdated Show resolved Hide resolved
ginga/tests/test_dask.py Outdated Show resolved Hide resolved
ginga/tests/test_dask.py Outdated Show resolved Hide resolved
ginga/tests/test_dask.py Show resolved Hide resolved
ginga/trcalc.py Outdated Show resolved Hide resolved
ginga/trcalc.py Outdated Show resolved Hide resolved
ginga/trcalc.py Show resolved Hide resolved
ginga/trcalc.py Outdated Show resolved Hide resolved
arr = arr.reshape(shape)
return arr

else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you guarantee that d_obj would definitely be Dask object at this point? Or should this be an elif instead?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. The current code sort of assumes that by process of elimination. Probably it should be an elif with an exception raised if it didn't match anything before. Problem is that I want to detect the cases without having to import zarr or dask, because this becomes the basic slicing function. So if a good duck-typing test could be done that might be a possibility...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I think the code for this is ok, we can refactor it later if a better test for dask arrays can be found.

@ejeschke ejeschke mentioned this pull request Nov 1, 2019
@pllim
Copy link
Collaborator

pllim commented Nov 1, 2019

Re: RTD failure -- we'll revisit if it persists after your next round of edits.

@ejeschke ejeschke added this to the 3.1 milestone Nov 2, 2019
@ejeschke
Copy link
Owner Author

ejeschke commented Nov 2, 2019

@Cadair, do you have some examples of large images (too large for RAM) that you open up using dask arrays?

@ejeschke
Copy link
Owner Author

ejeschke commented Nov 2, 2019

@pllim, I believe I addressed all your points. Please have another look.

@ejeschke
Copy link
Owner Author

ejeschke commented Nov 2, 2019

I feel like the three test modules could be merged into one to avoid code duplication via subclassing and changing a few things here and there at setup.

Maybe for a future PR? I want to add some more tests to the new test_trcalc and since those would overlap with these somewhat they could all be done together.


def _2ddata(self, shape, data_np=None):
if data_np is None:
data_np = np.asarray([min(i, j)
Copy link
Collaborator

@pllim pllim Nov 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be optimized, especially if you are using sizable shape values:

In [30]: shape = (1000, 500)

In [31]: %timeit np.min(np.indices(shape), axis=0)
2.52 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [32]: %timeit np.asarray([min(i, j) for i in range(shape[0]) for j in range(shape[1])]).reshape(shape)
92.7 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This comment also applies to all similar occurrences throughout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejeschke , do you not wish to address this one? Either way is fine by me, but I just want to make sure it was not overlooked.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whups, ok...pushed another commit. Have a look.


def _get_data(self, shape, data_np=None):
if data_np is None:
data_np = np.random.randint(0, 10000, shape)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also remove random here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch.

ginga/tests/test_numpy.py Show resolved Hide resolved
@ejeschke
Copy link
Owner Author

Rebased, still passing all tests with latest conda installs of zarr and dask.

@pllim
Copy link
Collaborator

pllim commented Jun 16, 2021

I haven't been using these two packages, so as long as tests pass and you are happy with it, FFTM.

@ejeschke ejeschke mentioned this pull request Jun 30, 2023
8 tasks
@ejeschke ejeschke changed the base branch from master to main July 12, 2023 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants