Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch VDR Data from Butler For Region Search #531

Merged
merged 13 commits into from
Apr 5, 2024
Merged

Conversation

wilsonbb
Copy link
Collaborator

@wilsonbb wilsonbb commented Mar 27, 2024

Initial PR for Region Search module that creates a representation of VDR (Visit, Detector, Region) data fetched from a Butler repository for analysis of which images to use for KBMOD.

Note that we have broken out butler mocking to its own utils file.

The RegionSearch object also provides a few static methods which are wrappers for initial exploration of how collections datatypes are organized in the butler.

@wilsonbb wilsonbb requested a review from DinoBektesevic March 29, 2024 18:19
@wilsonbb wilsonbb marked this pull request as ready for review March 29, 2024 18:20
@wilsonbb
Copy link
Collaborator Author

@DinoBektesevic I was hoping to spend a bit more time on the butler mocking, but wanted to go ahead and mark this for review so you could take a pass at the interface

Copy link
Member

@DinoBektesevic DinoBektesevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general comment on the PR is that it's ok and I want to approve. There is one major issue of the build-system configuration that will require redoing how importing works, and then sorting out a few doc-strings.

My general comment about the implementation is that I worry about wrapping the butler functionality because there's so much of it.
In struggled with the same in the ButlerStandardizer. The user could have executed a near-to-sql-query by using the butler.registry and get back lists of tuples or whatever data structure they Rubin gives back to them. There's so many ways in which those can be derived and they change all the time that it's hard to standardize on a particular way.
I wonder about the utility of wrapping butler.queryCollections with additional checks on whether a collection was given or wasn't - because butler.queryCollections will already perform those checks anyhow. The purpose here looks more educational than functional, and that's ok if these are the tools you need in order to make progress atm, but I wonder how far that would scale as we start adding more. Ultimately all you're doing is executing:

# all of these are just database queries really
butler = Butler(repo_root, without_datastore=True) 

# changeable
collections = ("DECam/imdiff/20210318/20240130T225831Z",)

# safe to assume, but could be changed
datasets=("goodSeeingDiff_differenceExp", )

# basically constant
dimensions = ["visit", "physical_filter", "exposure", "visit", "detector", "patch", "tract",]

# DataCoordinate iterator
datcoord = butler.registry.queryDataIds(dimensions, collections=collections, dimensions=dimensions)

# This would require using a datastore
uris = butler.getURis(datasets[0], data_id=list(datcoord)[0], collections=collections)
# but this gets a dataset ref and it doesn't even need the datastore
refs = b.find_dataset(datasets[0], data_id=list(datcoord)[0], collections=collections)
# butler.get would work I think (with datastore), butler.find, butler.getDataset etc...

so perhaps we should, long term, be looking to take collections of data coordinates, unwinding them for ra, dec, and then actually performing some sort of grouping based on spatial and temporal constraints. As a side-note, if butler was instantiated with a datastore (leve out without_datastore) the refs can directly be passed to the Standardizer from which KBMOD can run searches.

pyproject.toml Outdated Show resolved Hide resolved
src/kbmod/region_search.py Outdated Show resolved Hide resolved
src/kbmod/region_search.py Outdated Show resolved Hide resolved
src/kbmod/region_search.py Outdated Show resolved Hide resolved
src/kbmod/region_search.py Outdated Show resolved Hide resolved
src/kbmod/region_search.py Outdated Show resolved Hide resolved
src/kbmod/region_search.py Outdated Show resolved Hide resolved
src/kbmod/region_search.py Outdated Show resolved Hide resolved
tests/test_region_search.py Show resolved Hide resolved
tests/test_region_search.py Outdated Show resolved Hide resolved
@wilsonbb
Copy link
Collaborator Author

wilsonbb commented Apr 3, 2024

My general comment on the PR is that it's ok and I want to approve. There is one major issue of the build-system configuration that will require redoing how importing works, and then sorting out a few doc-strings.

My general comment about the implementation is that I worry about wrapping the butler functionality because there's so much of it. In struggled with the same in the ButlerStandardizer. The user could have executed a near-to-sql-query by using the butler.registry and get back lists of tuples or whatever data structure they Rubin gives back to them. There's so many ways in which those can be derived and they change all the time that it's hard to standardize on a particular way. I wonder about the utility of wrapping butler.queryCollections with additional checks on whether a collection was given or wasn't - because butler.queryCollections will already perform those checks anyhow. The purpose here looks more educational than functional, and that's ok if these are the tools you need in order to make progress atm, but I wonder how far that would scale as we start adding more. Ultimately all you're doing is executing:

# all of these are just database queries really
butler = Butler(repo_root, without_datastore=True) 

# changeable
collections = ("DECam/imdiff/20210318/20240130T225831Z",)

# safe to assume, but could be changed
datasets=("goodSeeingDiff_differenceExp", )

# basically constant
dimensions = ["visit", "physical_filter", "exposure", "visit", "detector", "patch", "tract",]

# DataCoordinate iterator
datcoord = butler.registry.queryDataIds(dimensions, collections=collections, dimensions=dimensions)

# This would require using a datastore
uris = butler.getURis(datasets[0], data_id=list(datcoord)[0], collections=collections)
# but this gets a dataset ref and it doesn't even need the datastore
refs = b.find_dataset(datasets[0], data_id=list(datcoord)[0], collections=collections)
# butler.get would work I think (with datastore), butler.find, butler.getDataset etc...

so perhaps we should, long term, be looking to take collections of data coordinates, unwinding them for ra, dec, and then actually performing some sort of grouping based on spatial and temporal constraints. As a side-note, if butler was instantiated with a datastore (leve out without_datastore) the refs can directly be passed to the Standardizer from which KBMOD can run searches.

Yeah I think these are fair concerns. There's definitely some danger with wrapping too much of the butler functionality, and we'll have to revisit this approach for scaling. In particular I was thinking we would need our own datastore for caching butler query results and the spatial computations we're doing that would replace the astropy table, but I wasn't as aware of the options around instantiating a butler with a datastore and should probably revisit that.

I think the main goal here was to start the process of moving current region search work into modules with tests but still have a nice interface for interfacing with notebooks. The next steps are taking the various approaches, like find overlapping images and generating discrete piles, and move them into this module as well.

So after the next PR the envisioned interface for generating ImageCollections for a discrete pile would be something like:

from kbmod.region_search import RegionSearch
rs = RegionSearch(
    '/repo/path',
    [collection_names],
    [dataTypes],
    max_workers=4,
    fetch_data=True)

# inspect fetched vdr data and potentially filter it down further
rs.vdr_data

# Next PR with simplest "region search" style approach for discrete piles
image_collections = rs.get_overlapping_image_collections(overlap_uncertainty_radius_arcsec=30)

# Reproject image collection work units and start kbmod serarch

But I agree we'll probably soon need to reassess how this approach scales

@wilsonbb wilsonbb requested a review from DinoBektesevic April 3, 2024 18:09
Copy link
Member

@DinoBektesevic DinoBektesevic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you just in-indent the class TestRegionSearch and cleanup the unused improrts and feel free to merge.

@wilsonbb wilsonbb merged commit 3612d95 into main Apr 5, 2024
2 checks passed
@wilsonbb wilsonbb deleted the region_search_init branch April 16, 2024 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants